HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Iterate over large amount of words in Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
amountwordsiteratelargepythonover

Problem

I wrote a program that should check a dictionary that contains about 50000 other dictionaries. In those dictionaries, one of the keys has a list of words as value. Now, I iterate over those words, finding the words most similar to a queried input. However, it takes quite a while for it to finish. How can I speed up this process?

```
import pickle,sys

def levenshtein_distance(first, second):
if first == second: return 0
elif len(first) == 0: return len(second)
elif len(second) == 0: return len(first)
v0 = [None] * (len(second) + 1)
v1 = [None] * (len(second) + 1)
for i in range(len(v0)):
v0[i] = i
for i in range(len(first)):
v1[0] = i + 1
for j in range(len(second)):
cost = 0 if first[i] == second[j] else 1
v1[j + 1] = min(v1[j] + 1, v0[j + 1] + 1, v0[j] + cost)
for j in range(len(v0)):
v0[j] = v1[j]

return v1[len(second)]

def remove_duplicates(seq):
seen = set()
seen_add = seen.add
return [ x for x in seq if not (x in seen or seen_add(x))]

def main():
dict_pkld = pickle.load(open('woordjes2.pkl', 'rb'))
postlist = pickle.load(open('postlist.pkl', 'rb'))
lowest_distance = []
words = []
f = sys.stdin.readlines()
for line in f:
woorden = line.rstrip()
query1 = woorden
print ('Your query is: "' + query1 + '"')
print ('A word that is similar to your query is: \n')
for sub in dict_pkld.values():
woorden = sub['termen']
if not query1 in woorden:
for woord in woorden:
x = levenshtein_distance(query1, woord)
temp_list_number = []
temp_list_word = []
if lowest_distance:

Solution

-
The largest performance problem I see is in the for woord in woorden loop. The code iterates over zip(lowest_distance, words) - which is growing - over and over again, effectively getting a quadratic performance. I recommend to calculate distances to each woord, and sort the resulting list:

woord_distance_list = [ (levenstein_distance(woord), woord) for woord in woorden ].sorted()


-
The words list undergoes remove_duplicates, which is a strong indication that it should be set in the first place.

-
I would consider delegating levenstein_distance to a C function. Python is not sell-suited for such kind of computations.

-
Not related to performance:

-
postlist is declared way early. Move it down to where it is used:

postlist = pickle.load(open('postlist.pkl', 'rb'))
    result = postlist[words_new[0]]


-
A massive main with 4 levels of loop nesting is a strong indication that something is not right. Try to make each loop in a function, just to give it a meaningful name.

Code Snippets

woord_distance_list = [ (levenstein_distance(woord), woord) for woord in woorden ].sorted()
postlist = pickle.load(open('postlist.pkl', 'rb'))
    result = postlist[words_new[0]]

Context

StackExchange Code Review Q#108215, answer score: 3

Revisions (0)

No revisions yet.