patternrubyMinor
Implementing a (small) full-text search in Ruby
Viewed 0 times
fullsearchtextsmallrubyimplementing
Problem
I'm trying to implement a small full-text search in Ruby. I feel like I've laid down the foundation but I've encountered a blocker that has gotten me thinking that something with the design is incorrect.
The general concept is that a
document.rb
tokenizer.rb
```
module Rankrb
# The same tokenizer should be used for document
# tokenization and query tokenization to ensure that
# the same terms are being searched and returned.
class Tokenizer
attr_accessor :str
attr_reader :tokens
def initialize(str='')
@str = str
@tokens = Array.new
@stopwords = Rankrb.configuration.stopwords
@lang = Rankrb.configuration.language
end
def tokenize
regex = /[^\s\p{Alnum}\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}]/
@tokens = @str.gsub(regex,'')
.downcas
The general concept is that a
Document is the basic unit. Multiple Documents form a Collection. The InvertedIndex takes a collection to build the index, which is a just hash of stemmed tokens with their respective document ids, for example:inverted_index = {
"new" => [1, 4], # These are document ids.
"home" => [1, 2, 3, 4], # The key is stemmed and some stop-words
"sale" => [1, 2, 3, 4], # are being removed.
"top" => [1],
"forecast" => [1],
"rise" => [2, 4],
"juli" => [2, 3, 4],
"increas" => [3]
})document.rb
module Rankrb
class Document
attr_accessor :id, :body, :rank
def initialize(params={})
@id = params.fetch :id, nil
@body = params.fetch :body, ''
@rank = params.fetch :rank, nil
end
def length
tokens.join(' ').length
end
def include?(term)
tokens.include? term_to_token(term)
end
def term_freq(term)
tokens.count term_to_token(term)
end
def tokens
Rankrb::Tokenizer.new(@body).tokenize
end
def uniq_tokens
tokens.uniq
end
private
def term_to_token(term)
Rankrb::Tokenizer.new(term).tokenize.shift
end
end
endtokenizer.rb
```
module Rankrb
# The same tokenizer should be used for document
# tokenization and query tokenization to ensure that
# the same terms are being searched and returned.
class Tokenizer
attr_accessor :str
attr_reader :tokens
def initialize(str='')
@str = str
@tokens = Array.new
@stopwords = Rankrb.configuration.stopwords
@lang = Rankrb.configuration.language
end
def tokenize
regex = /[^\s\p{Alnum}\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}]/
@tokens = @str.gsub(regex,'')
.downcas
Solution
I gave a talk at RailsConf a few years back where I did a walk through of some fundamentally similar code. In my talk, I actually used JRuby and OpenNLP but you can achieve the same result in Ruby MRI using some of the other libraries I call out in my deck.
https://speakerdeck.com/brandonblack/natural-language-processing-in-ruby?slide=31
In my example, I'm doing a poor man's summary of a block of text by identifying and grabbing the top "most significant" sentences in the document. It's a different goal, but many of the basic ideas translate over to what you're trying to do.
For a basic text search, you need to build an index that returns the most relevant items based on a given search phrase. Once you've built that index, searching it is simple.
For each segment of searchable text:
In the database, a row in the search index might look something like this:
If you had a table full of those associations, you'd just need to index that table across all 3 columns with a weight descending direction. For best results, consider using a database that leverages in-memory mapping (ex. MongoDB) for this index (similar to keeping this index in cache).
I hope that helps draw a rough picture of the process.
That being said, there are many more nuances to full text search. This is a very very simplistic, mostly naïve approach and without an appropriate data backend it would not scale very much for you (your spot on about storing these in memory being a long term losing strategy). If you're considering this for a production system, I would recommend using a database with full text search or something like Apache Lucene which there are Ruby-based adapters for.
https://speakerdeck.com/brandonblack/natural-language-processing-in-ruby?slide=31
In my example, I'm doing a poor man's summary of a block of text by identifying and grabbing the top "most significant" sentences in the document. It's a different goal, but many of the basic ideas translate over to what you're trying to do.
For a basic text search, you need to build an index that returns the most relevant items based on a given search phrase. Once you've built that index, searching it is simple.
For each segment of searchable text:
- Tokenize all the input blocks of text.
- Filter out stop words that have low value. You can do this with a library or for simple demo purposes you can just use a simple regex or blacklist.
- Use a word stemmer to identify the root word for each token in the text block.
- In your database, store an association between the word stem, the document being searched and a weighting for how many times that word stem appears in the text.
- In your search method, follow the same process (tokenize, filter, stem) processing the search phrase and then return the documents found in your index for those stem words in the search phrase sorted by weight descending.
In the database, a row in the search index might look something like this:
stem_word_id, document_id, weightIf you had a table full of those associations, you'd just need to index that table across all 3 columns with a weight descending direction. For best results, consider using a database that leverages in-memory mapping (ex. MongoDB) for this index (similar to keeping this index in cache).
I hope that helps draw a rough picture of the process.
That being said, there are many more nuances to full text search. This is a very very simplistic, mostly naïve approach and without an appropriate data backend it would not scale very much for you (your spot on about storing these in memory being a long term losing strategy). If you're considering this for a production system, I would recommend using a database with full text search or something like Apache Lucene which there are Ruby-based adapters for.
Code Snippets
stem_word_id, document_id, weightContext
StackExchange Code Review Q#98845, answer score: 2
Revisions (0)
No revisions yet.