HiveBrain v1.2.0
Get Started
← Back to all entries
patternrubyMinor

Small BM25+ implementation for probibalistic result ranking

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
resultprobibalisticsmallforrankingimplementationbm25

Problem

I came up with this small implementation of an Okapi BM25+ ranker using Ruby. I cooked it up in a very short time so it's very simple but I'm trying to think whether there's a better way to write it. In particular, I'm wondering if I should have something like a Ranker class -- intuitively I feel as though I should but it seems to work fine this way too.

The two classes used are Document and Collection. A Collection can contain many Documents. Very simple.

document.rb

class Document
  attr_accessor :body, :rank

  def initialize(params={:body=>'', :rank=>nil})
    @body = params[:body]
    @rank = params[:rank]
  end

  def length
    @body.length
  end

  def include?(term)
    @body.include?(term)
  end

  def term_freq(term)
    # Will need something better here to clean out the document.
    # This is temporary ;)
   @body.gsub(/[^\s\p{Alnum}\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}]/,'')
    .downcase
    .split
    .count(term) 
  end
end


collection.rb

```
class Collection

def initialize(params={:docs=>[], :query=>nil})
@docs = params[:docs]
@query = params[:query]
end

def containing_term(term)
total = 0
@docs.each do |doc|
total += 1 if doc.include?(term)
end
total
end

def avg_dl
doc_lengths = 0
@docs.each do |doc|
doc_lengths += doc.length
end
doc_lengths / total_docs
end

def total_docs
@docs.size
end

def idf(term)
numerator = total_docs - containing_term(term) + 0.5
denominator = containing_term(term) + 0.5
Math.log(numerator / denominator)
end

def bm25(params={:k => 1.2, :b => 0.75, :delta => 1.0})
@k = params[:k]
@b = params[:b]
@delta = params[:delta]

@docs.each do |doc|
score = 0
dl = doc.length
query_terms = @query.split

query_terms.each do |term|
dtf = doc.term_freq(term)
numerator = dtf * (@k + 1)
denominator = dtf + @k (1 - @b + @b (doc.length / avg_dl))

Solution

You should use built-ins to simplify the code:

def containing_term(term)
  total = 0
  @docs.each do |doc|
    total += 1 if doc.include?(term) 
   end
 total
end


Becomes:

def containing_term(term)
  @docs.count {|doc| doc.include?(term) }
end


And

def avg_dl
  doc_lengths = 0
  @docs.each do |doc|
    doc_lengths += doc.length
  end
  doc_lengths / total_docs
end


Becomes:

def avg_dl
  @docs.map(&:length).inject(:+) / total_docs
end


You are close to immutable, make the last step

In Collection you use sort!, disallowing method chaining, sort and returning a new object would be more expected.

Use a Hash for an easy but massive optimization

The method:

def term_freq(term)
   # Will need something better here to clean out the document.
   # This is temporary ;)
  @body.gsub(/[^\s\p{Alnum}\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}]/,'')
   .downcase
   .split
   .count(term) 
end


Should be a Hash lookup, where the Hash is created at init time, time complexity would become O(1) with an Hash instead of O(n).

Code Snippets

def containing_term(term)
  total = 0
  @docs.each do |doc|
    total += 1 if doc.include?(term) 
   end
 total
end
def containing_term(term)
  @docs.count {|doc| doc.include?(term) }
end
def avg_dl
  doc_lengths = 0
  @docs.each do |doc|
    doc_lengths += doc.length
  end
  doc_lengths / total_docs
end
def avg_dl
  @docs.map(&:length).inject(:+) / total_docs
end
def term_freq(term)
   # Will need something better here to clean out the document.
   # This is temporary ;)
  @body.gsub(/[^\s\p{Alnum}\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}]/,'')
   .downcase
   .split
   .count(term) 
end

Context

StackExchange Code Review Q#97304, answer score: 2

Revisions (0)

No revisions yet.