patternpythonMinor
Rabin Karp algorithm improvement for string contains A to Z only
Viewed 0 times
rabinalgorithmcontainsforonlyimprovementstringkarp
Problem
I have made some improvement to Rabin Karp algorithm when the source and pattern only contain the characters from A to Z. The improvement is, I use base 26 for hashing (as there are total 26 characters). In this case, when the hashes match, there is no need to compare the actual string contents (since there are no hash collisions).
Below is my code in Python 2.7, any advice on code efficiency improvement (in terms of algorithm complexity perspective), bugs or code style is appreciated.
Source code in Python 2.7,
Below is my code in Python 2.7, any advice on code efficiency improvement (in terms of algorithm complexity perspective), bugs or code style is appreciated.
Source code in Python 2.7,
def rabin_karp_match(source, pattern):
base = 26
p_hash = 0
s_hash = 0
for i in pattern:
p_hash = p_hash * base + ord(i)-ord('A')
for j,v in enumerate(source):
if j < len(pattern) - 1:
s_hash = s_hash * base + ord(v)-ord('A')
continue
s_hash = s_hash * base + ord(v)-ord('A')
if s_hash == p_hash:
return j - len(pattern)+1
s_hash = s_hash - (ord(source[j-len(pattern)+1]) - ord('A')) * base ** (len(pattern) - 1)
return -1
if __name__ == "__main__":
print rabin_karp_match('GACGCCA','CGC')
print rabin_karp_match('GATACCCATCGAGTCGGATCGAGT', 'GAG')
print rabin_karp_match('FOOBARWIDGET', 'WIDGETS')Solution
The code looks good, this probably improves the runtime in some cases, but I am not sure if hash matches happen often enough for this to be effective on average.
Efficiency/Logic
-
It's probably a good idea to check for the pattern size at the beginning of your method, to avoid overflows. This technically won't happen in python as it will be automatically change the storage to big-int, but the performance of that will be much worse.
-
Minor optimization for runtime is to save the value of
Readability
Modified code
Efficiency/Logic
-
It's probably a good idea to check for the pattern size at the beginning of your method, to avoid overflows. This technically won't happen in python as it will be automatically change the storage to big-int, but the performance of that will be much worse.
-
Minor optimization for runtime is to save the value of
base ** (len(pattern) - 1) before your loop. Exponentiation is an expensive function, so it's better to do it only one time.Readability
- Try to avoid one letter variables like i and j, except for numeric loop indices. A more expressive variable name would improve readability a lot.
- Look for repeated parts in your code, and see if it makes sense to extract them into functions. One example in your code is calculating the letter index which is repeated multiple times.
- Might make more sense to return
Noneinstead of-1in case of no match.
Modified code
def letter_position(letter):
return ord(letter) - ord('A')
def rabin_karp_match(source, pattern):
base = 26
p_hash = 0
s_hash = 0
for letter in pattern:
p_hash = p_hash * base + letter_position(letter)
first_letter_base = base ** (len(pattern) - 1)
for i, letter in enumerate(source):
s_hash = s_hash * base + letter_position(letter)
i < len(pattern) - 1:
continue
match_start_index = i - len(pattern) + 1
if s_hash == p_hash:
return match_start_indexj
s_hash = s_hash - letter_position(source[match_start_index]) * first_letter_base
return NoneCode Snippets
def letter_position(letter):
return ord(letter) - ord('A')
def rabin_karp_match(source, pattern):
base = 26
p_hash = 0
s_hash = 0
for letter in pattern:
p_hash = p_hash * base + letter_position(letter)
first_letter_base = base ** (len(pattern) - 1)
for i, letter in enumerate(source):
s_hash = s_hash * base + letter_position(letter)
i < len(pattern) - 1:
continue
match_start_index = i - len(pattern) + 1
if s_hash == p_hash:
return match_start_indexj
s_hash = s_hash - letter_position(source[match_start_index]) * first_letter_base
return NoneContext
StackExchange Code Review Q#154224, answer score: 4
Revisions (0)
No revisions yet.