patternrubyMinor
Small BM25+ implementation for probibalistic result ranking
Viewed 0 times
resultprobibalisticsmallforrankingimplementationbm25
Problem
I came up with this small implementation of an Okapi BM25+ ranker using Ruby. I cooked it up in a very short time so it's very simple but I'm trying to think whether there's a better way to write it. In particular, I'm wondering if I should have something like a
The two classes used are
document.rb
collection.rb
```
class Collection
def initialize(params={:docs=>[], :query=>nil})
@docs = params[:docs]
@query = params[:query]
end
def containing_term(term)
total = 0
@docs.each do |doc|
total += 1 if doc.include?(term)
end
total
end
def avg_dl
doc_lengths = 0
@docs.each do |doc|
doc_lengths += doc.length
end
doc_lengths / total_docs
end
def total_docs
@docs.size
end
def idf(term)
numerator = total_docs - containing_term(term) + 0.5
denominator = containing_term(term) + 0.5
Math.log(numerator / denominator)
end
def bm25(params={:k => 1.2, :b => 0.75, :delta => 1.0})
@k = params[:k]
@b = params[:b]
@delta = params[:delta]
@docs.each do |doc|
score = 0
dl = doc.length
query_terms = @query.split
query_terms.each do |term|
dtf = doc.term_freq(term)
numerator = dtf * (@k + 1)
denominator = dtf + @k (1 - @b + @b (doc.length / avg_dl))
Ranker class -- intuitively I feel as though I should but it seems to work fine this way too.The two classes used are
Document and Collection. A Collection can contain many Documents. Very simple.document.rb
class Document
attr_accessor :body, :rank
def initialize(params={:body=>'', :rank=>nil})
@body = params[:body]
@rank = params[:rank]
end
def length
@body.length
end
def include?(term)
@body.include?(term)
end
def term_freq(term)
# Will need something better here to clean out the document.
# This is temporary ;)
@body.gsub(/[^\s\p{Alnum}\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}]/,'')
.downcase
.split
.count(term)
end
endcollection.rb
```
class Collection
def initialize(params={:docs=>[], :query=>nil})
@docs = params[:docs]
@query = params[:query]
end
def containing_term(term)
total = 0
@docs.each do |doc|
total += 1 if doc.include?(term)
end
total
end
def avg_dl
doc_lengths = 0
@docs.each do |doc|
doc_lengths += doc.length
end
doc_lengths / total_docs
end
def total_docs
@docs.size
end
def idf(term)
numerator = total_docs - containing_term(term) + 0.5
denominator = containing_term(term) + 0.5
Math.log(numerator / denominator)
end
def bm25(params={:k => 1.2, :b => 0.75, :delta => 1.0})
@k = params[:k]
@b = params[:b]
@delta = params[:delta]
@docs.each do |doc|
score = 0
dl = doc.length
query_terms = @query.split
query_terms.each do |term|
dtf = doc.term_freq(term)
numerator = dtf * (@k + 1)
denominator = dtf + @k (1 - @b + @b (doc.length / avg_dl))
Solution
You should use built-ins to simplify the code:
Becomes:
And
Becomes:
You are close to immutable, make the last step
In
Use a Hash for an easy but massive optimization
The method:
Should be a Hash lookup, where the Hash is created at init time, time complexity would become O(1) with an Hash instead of O(n).
def containing_term(term)
total = 0
@docs.each do |doc|
total += 1 if doc.include?(term)
end
total
endBecomes:
def containing_term(term)
@docs.count {|doc| doc.include?(term) }
endAnd
def avg_dl
doc_lengths = 0
@docs.each do |doc|
doc_lengths += doc.length
end
doc_lengths / total_docs
endBecomes:
def avg_dl
@docs.map(&:length).inject(:+) / total_docs
endYou are close to immutable, make the last step
In
Collection you use sort!, disallowing method chaining, sort and returning a new object would be more expected.Use a Hash for an easy but massive optimization
The method:
def term_freq(term)
# Will need something better here to clean out the document.
# This is temporary ;)
@body.gsub(/[^\s\p{Alnum}\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}]/,'')
.downcase
.split
.count(term)
endShould be a Hash lookup, where the Hash is created at init time, time complexity would become O(1) with an Hash instead of O(n).
Code Snippets
def containing_term(term)
total = 0
@docs.each do |doc|
total += 1 if doc.include?(term)
end
total
enddef containing_term(term)
@docs.count {|doc| doc.include?(term) }
enddef avg_dl
doc_lengths = 0
@docs.each do |doc|
doc_lengths += doc.length
end
doc_lengths / total_docs
enddef avg_dl
@docs.map(&:length).inject(:+) / total_docs
enddef term_freq(term)
# Will need something better here to clean out the document.
# This is temporary ;)
@body.gsub(/[^\s\p{Alnum}\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}]/,'')
.downcase
.split
.count(term)
endContext
StackExchange Code Review Q#97304, answer score: 2
Revisions (0)
No revisions yet.