HiveBrain v1.2.0
Get Started
← Back to all entries
patternrubyMinor

Implementing a (small) full-text search in Ruby

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fullsearchtextsmallrubyimplementing

Problem

I'm trying to implement a small full-text search in Ruby. I feel like I've laid down the foundation but I've encountered a blocker that has gotten me thinking that something with the design is incorrect.

The general concept is that a Document is the basic unit. Multiple Documents form a Collection. The InvertedIndex takes a collection to build the index, which is a just hash of stemmed tokens with their respective document ids, for example:

inverted_index = {
  "new" => [1, 4],         # These are document ids.
  "home" => [1, 2, 3, 4],  # The key is stemmed and some stop-words
  "sale" => [1, 2, 3, 4],  # are being removed.
  "top" => [1],
  "forecast" => [1],
  "rise" => [2, 4],
  "juli" => [2, 3, 4],
  "increas" => [3]
})


document.rb

module Rankrb
  class Document
    attr_accessor :id, :body, :rank

    def initialize(params={})
      @id = params.fetch :id, nil
      @body = params.fetch :body, ''
      @rank = params.fetch :rank, nil
    end

    def length
      tokens.join(' ').length
    end

    def include?(term)
      tokens.include? term_to_token(term)
    end

    def term_freq(term)
      tokens.count term_to_token(term)
    end

    def tokens
      Rankrb::Tokenizer.new(@body).tokenize
    end

    def uniq_tokens
      tokens.uniq
    end

    private
    def term_to_token(term)
      Rankrb::Tokenizer.new(term).tokenize.shift
    end
  end
end


tokenizer.rb

```
module Rankrb
# The same tokenizer should be used for document
# tokenization and query tokenization to ensure that
# the same terms are being searched and returned.
class Tokenizer
attr_accessor :str
attr_reader :tokens

def initialize(str='')
@str = str
@tokens = Array.new
@stopwords = Rankrb.configuration.stopwords
@lang = Rankrb.configuration.language
end

def tokenize
regex = /[^\s\p{Alnum}\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}]/
@tokens = @str.gsub(regex,'')
.downcas

Solution

I gave a talk at RailsConf a few years back where I did a walk through of some fundamentally similar code. In my talk, I actually used JRuby and OpenNLP but you can achieve the same result in Ruby MRI using some of the other libraries I call out in my deck.

https://speakerdeck.com/brandonblack/natural-language-processing-in-ruby?slide=31

In my example, I'm doing a poor man's summary of a block of text by identifying and grabbing the top "most significant" sentences in the document. It's a different goal, but many of the basic ideas translate over to what you're trying to do.

For a basic text search, you need to build an index that returns the most relevant items based on a given search phrase. Once you've built that index, searching it is simple.

For each segment of searchable text:

  • Tokenize all the input blocks of text.



  • Filter out stop words that have low value. You can do this with a library or for simple demo purposes you can just use a simple regex or blacklist.



  • Use a word stemmer to identify the root word for each token in the text block.



  • In your database, store an association between the word stem, the document being searched and a weighting for how many times that word stem appears in the text.



  • In your search method, follow the same process (tokenize, filter, stem) processing the search phrase and then return the documents found in your index for those stem words in the search phrase sorted by weight descending.



In the database, a row in the search index might look something like this:

stem_word_id, document_id, weight


If you had a table full of those associations, you'd just need to index that table across all 3 columns with a weight descending direction. For best results, consider using a database that leverages in-memory mapping (ex. MongoDB) for this index (similar to keeping this index in cache).

I hope that helps draw a rough picture of the process.

That being said, there are many more nuances to full text search. This is a very very simplistic, mostly naïve approach and without an appropriate data backend it would not scale very much for you (your spot on about storing these in memory being a long term losing strategy). If you're considering this for a production system, I would recommend using a database with full text search or something like Apache Lucene which there are Ruby-based adapters for.

Code Snippets

stem_word_id, document_id, weight

Context

StackExchange Code Review Q#98845, answer score: 2

Revisions (0)

No revisions yet.