HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Rabin Karp algorithm improvement for string contains A to Z only

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
rabinalgorithmcontainsforonlyimprovementstringkarp

Problem

I have made some improvement to Rabin Karp algorithm when the source and pattern only contain the characters from A to Z. The improvement is, I use base 26 for hashing (as there are total 26 characters). In this case, when the hashes match, there is no need to compare the actual string contents (since there are no hash collisions).

Below is my code in Python 2.7, any advice on code efficiency improvement (in terms of algorithm complexity perspective), bugs or code style is appreciated.

Source code in Python 2.7,

def rabin_karp_match(source, pattern):
    base = 26
    p_hash = 0
    s_hash = 0
    for i in pattern:
        p_hash = p_hash * base + ord(i)-ord('A')
    for j,v in enumerate(source):
        if j < len(pattern) - 1:
            s_hash = s_hash * base + ord(v)-ord('A')
            continue
        s_hash = s_hash * base + ord(v)-ord('A')
        if s_hash == p_hash:
            return j - len(pattern)+1
        s_hash = s_hash - (ord(source[j-len(pattern)+1]) - ord('A')) * base ** (len(pattern) - 1)
    return -1

if __name__ == "__main__":
    print rabin_karp_match('GACGCCA','CGC')
    print rabin_karp_match('GATACCCATCGAGTCGGATCGAGT', 'GAG')
    print rabin_karp_match('FOOBARWIDGET', 'WIDGETS')

Solution

The code looks good, this probably improves the runtime in some cases, but I am not sure if hash matches happen often enough for this to be effective on average.
Efficiency/Logic

-
It's probably a good idea to check for the pattern size at the beginning of your method, to avoid overflows. This technically won't happen in python as it will be automatically change the storage to big-int, but the performance of that will be much worse.

-
Minor optimization for runtime is to save the value of base ** (len(pattern) - 1) before your loop. Exponentiation is an expensive function, so it's better to do it only one time.

Readability

  • Try to avoid one letter variables like i and j, except for numeric loop indices. A more expressive variable name would improve readability a lot.



  • Look for repeated parts in your code, and see if it makes sense to extract them into functions. One example in your code is calculating the letter index which is repeated multiple times.



  • Might make more sense to return None instead of -1 in case of no match.



Modified code

def letter_position(letter):
  return ord(letter) - ord('A')

def rabin_karp_match(source, pattern):
    base = 26
    p_hash = 0
    s_hash = 0
    for letter in pattern:
        p_hash = p_hash * base + letter_position(letter)

    first_letter_base = base ** (len(pattern) - 1)
    for i, letter in enumerate(source):
        s_hash = s_hash * base + letter_position(letter)
        i < len(pattern) - 1:
          continue
 
        match_start_index = i - len(pattern) + 1
        if s_hash == p_hash:
          return match_start_indexj

        s_hash = s_hash - letter_position(source[match_start_index]) *  first_letter_base
    return None

Code Snippets

def letter_position(letter):
  return ord(letter) - ord('A')

def rabin_karp_match(source, pattern):
    base = 26
    p_hash = 0
    s_hash = 0
    for letter in pattern:
        p_hash = p_hash * base + letter_position(letter)

    first_letter_base = base ** (len(pattern) - 1)
    for i, letter in enumerate(source):
        s_hash = s_hash * base + letter_position(letter)
        i < len(pattern) - 1:
          continue
 
        match_start_index = i - len(pattern) + 1
        if s_hash == p_hash:
          return match_start_indexj

        s_hash = s_hash - letter_position(source[match_start_index]) *  first_letter_base
    return None

Context

StackExchange Code Review Q#154224, answer score: 4

Revisions (0)

No revisions yet.