patternpythonMinor
Iterate over large amount of words in Python
Viewed 0 times
amountwordsiteratelargepythonover
Problem
I wrote a program that should check a dictionary that contains about 50000 other dictionaries. In those dictionaries, one of the keys has a list of words as value. Now, I iterate over those words, finding the words most similar to a queried input. However, it takes quite a while for it to finish. How can I speed up this process?
```
import pickle,sys
def levenshtein_distance(first, second):
if first == second: return 0
elif len(first) == 0: return len(second)
elif len(second) == 0: return len(first)
v0 = [None] * (len(second) + 1)
v1 = [None] * (len(second) + 1)
for i in range(len(v0)):
v0[i] = i
for i in range(len(first)):
v1[0] = i + 1
for j in range(len(second)):
cost = 0 if first[i] == second[j] else 1
v1[j + 1] = min(v1[j] + 1, v0[j + 1] + 1, v0[j] + cost)
for j in range(len(v0)):
v0[j] = v1[j]
return v1[len(second)]
def remove_duplicates(seq):
seen = set()
seen_add = seen.add
return [ x for x in seq if not (x in seen or seen_add(x))]
def main():
dict_pkld = pickle.load(open('woordjes2.pkl', 'rb'))
postlist = pickle.load(open('postlist.pkl', 'rb'))
lowest_distance = []
words = []
f = sys.stdin.readlines()
for line in f:
woorden = line.rstrip()
query1 = woorden
print ('Your query is: "' + query1 + '"')
print ('A word that is similar to your query is: \n')
for sub in dict_pkld.values():
woorden = sub['termen']
if not query1 in woorden:
for woord in woorden:
x = levenshtein_distance(query1, woord)
temp_list_number = []
temp_list_word = []
if lowest_distance:
```
import pickle,sys
def levenshtein_distance(first, second):
if first == second: return 0
elif len(first) == 0: return len(second)
elif len(second) == 0: return len(first)
v0 = [None] * (len(second) + 1)
v1 = [None] * (len(second) + 1)
for i in range(len(v0)):
v0[i] = i
for i in range(len(first)):
v1[0] = i + 1
for j in range(len(second)):
cost = 0 if first[i] == second[j] else 1
v1[j + 1] = min(v1[j] + 1, v0[j + 1] + 1, v0[j] + cost)
for j in range(len(v0)):
v0[j] = v1[j]
return v1[len(second)]
def remove_duplicates(seq):
seen = set()
seen_add = seen.add
return [ x for x in seq if not (x in seen or seen_add(x))]
def main():
dict_pkld = pickle.load(open('woordjes2.pkl', 'rb'))
postlist = pickle.load(open('postlist.pkl', 'rb'))
lowest_distance = []
words = []
f = sys.stdin.readlines()
for line in f:
woorden = line.rstrip()
query1 = woorden
print ('Your query is: "' + query1 + '"')
print ('A word that is similar to your query is: \n')
for sub in dict_pkld.values():
woorden = sub['termen']
if not query1 in woorden:
for woord in woorden:
x = levenshtein_distance(query1, woord)
temp_list_number = []
temp_list_word = []
if lowest_distance:
Solution
-
The largest performance problem I see is in the
-
The
-
I would consider delegating
-
Not related to performance:
-
-
A massive
The largest performance problem I see is in the
for woord in woorden loop. The code iterates over zip(lowest_distance, words) - which is growing - over and over again, effectively getting a quadratic performance. I recommend to calculate distances to each woord, and sort the resulting list:woord_distance_list = [ (levenstein_distance(woord), woord) for woord in woorden ].sorted()-
The
words list undergoes remove_duplicates, which is a strong indication that it should be set in the first place.-
I would consider delegating
levenstein_distance to a C function. Python is not sell-suited for such kind of computations.-
Not related to performance:
-
postlist is declared way early. Move it down to where it is used:postlist = pickle.load(open('postlist.pkl', 'rb'))
result = postlist[words_new[0]]-
A massive
main with 4 levels of loop nesting is a strong indication that something is not right. Try to make each loop in a function, just to give it a meaningful name.Code Snippets
woord_distance_list = [ (levenstein_distance(woord), woord) for woord in woorden ].sorted()postlist = pickle.load(open('postlist.pkl', 'rb'))
result = postlist[words_new[0]]Context
StackExchange Code Review Q#108215, answer score: 3
Revisions (0)
No revisions yet.