patternpythonModerate
Optimizing an Anagram Solver
Viewed 0 times
solveranagramoptimizing
Problem
I've built an anagram solver in Python 2.7. It goes through a text file with over 100,000 words and checks if any permutations of the input matches the line. It works great, except it is very slow. I've tried making the text file a list with each line as a list item, but the text file is too big to do that. This is the text file that I am using (it's saved on my flash drive).
If you test it out you can see that is very slow, especially as the words get larger. Is there anything I can do to improve the performance of this program?
# import permutations module
from itertools import permutations as prm
num_of_cases = int(raw_input("Input number of words needed to be unscrambled: "))
# take input
counter = 1
while counter <= num_of_cases:
scrambled_word = list(str(raw_input("Input scrambled word: ")))
# empty lists that will be appended to later
prm_list = []
possible_words = []
# takes each permutation of the input and puts in a list
for i in prm(scrambled_word):
prm_list.append("".join(i))
def check(x, y):
# open list of words
dictionary = file('E:\MRP\Text_Files\dictionary.txt')
# check each line in the dictionary against each item
# in the list of permutations and add it to another empty list if it's a match
for line in dictionary:
for i in x:
if i+'\n' == line:
y.append(i)
check(prm_list, possible_words)
# delete duplicates
possible_words = list(set(possible_words))
# print out possible words
if len(possible_words) == 0:
print "No matches found"
else:
for i in possible_words:
print "Possible Word for Scrambled Word #" + str(counter) + ": " + i
# add to the counter
counter+=1If you test it out you can see that is very slow, especially as the words get larger. Is there anything I can do to improve the performance of this program?
Solution
Your first comment is misleading (it's the
There's absolutely no reason to rename
Your
You don't need to convert
The prior improvements make the comment
The later comments are often misaligned. Have them in line with the text.
The words should have their newlines trimmed, not the other way around.
You do
but why bother having
Instead of
Now to deal with speed.
which is much faster.
We can improve this further by doing all the work upfront by making a cannonical, sorted version. This generates a mapping:
This can be generated like so:
This then makes
This gives the code:
The loop can be replaced with
The whole thing should be put into a
The dictionary parsing could be extracted to another function.
The
To complete Python 3 compatibility, add this to the top:
This gives:
```
from collections import defaultdict
try:
input = raw_input
except NameError:
pass
def parse_dictionary(filename):
word_anagrams = defaultdict(set)
with open(filename) as dictionary:
for word in dictionary:
word = word.rstrip()
word_anagrams[''.join(sorted(word))].add(word)
return word_anagrams
def main():
word_anagrams = parse_dictionary("dictionary.txt")
num_cases = int(input("Input number of words needed to be unscrambled: "))
for num_matched_anagrams in range(1, num_cases+1):
scrambled_word = input("Input scrambled word: ")
sorted_word = ''.join(sorted(scrambled_word))
if word_anagrams[sorted_word]:
for word in word_anagrams[sorted_word]:
print("Possible Word for Scrambled Word #{}: {}".format(num_matched_anagrams, word))
itertools module not the permutations module) and useless anyway. Remove it.There's absolutely no reason to rename
permutations to prm. Don't do it.num_of_cases would be better called num_cases, simply because it's more conventional.Your
take input comment is misplaced. It should be at the place it refers to.counter is too vague: what are you counting? I would call it num_matched_anagrams.You don't need to convert
scrambled_word to a list. permutations accepts any iterable. You don't need to call str on the output of raw_input; it already returns a string.The prior improvements make the comment
take input really obvious, so you can remove that.The later comments are often misaligned. Have them in line with the text.
prm_list can be generated with just list(permutations(scrambled_word)). This removes two comments.prm_list should just be called word_orders or similar; don't bother naming things by their type.check is badly named; it should be called something like find_anagrams.check should not use file, which is deprecated, but open. It should also use with to handle the file.check should save the dictionary between runs. In fact, this shouldn't really be check's job at all.The words should have their newlines trimmed, not the other way around.
check should only return anagrams, not deal with appending.check should have sensibly named parameters.check should just call permutations itself.You do
possible_words = list(set(possible_words))but why bother having
possible_words as a list at all?Instead of
if len(possible_words) == 0, do if not possible_words. I would also switch the if around.num_matched_anagrams+=1 should have spaces: num_matched_anagrams += 1. The comment is obvious, so remove it.Now to deal with speed.
words should be a set (since it's a set of words) and you loop over the permutations first. This allows a check of:def find_anagrams(scrambled_word, words):
for word_order in permutations(scrambled_word):
potential_word = ''.join(word_order)
if potential_word in words:
yield potential_wordwhich is much faster.
We can improve this further by doing all the work upfront by making a cannonical, sorted version. This generates a mapping:
sorted word → [list of words that sort to this word]This can be generated like so:
from collections import defaultdict
word_anagrams = defaultdict(set)
with open('dictionary.txt') as dictionary:
for word in dictionary:
word = word.rstrip()
word_anagrams[sorted(word)].add(word)This then makes
find_anagrams just word_anagrams[sorted(scrambled_word)]. This has high initial overhead (0.5 seconds for me with 170k words) but makes subsequent lookups insanely fast.This gives the code:
from collections import defaultdict
word_anagrams = defaultdict(set)
with open('dictionary.txt') as dictionary:
for word in dictionary:
word = word.rstrip()
word_anagrams[''.join(sorted(word))].add(word)
num_cases = int(raw_input("Input number of words needed to be unscrambled: "))
num_matched_anagrams = 1
while num_matched_anagrams <= num_cases:
scrambled_word = raw_input("Input scrambled word: ")
sorted_word = ''.join(sorted(scrambled_word))
if word_anagrams[sorted_word]:
for word in word_anagrams[sorted_word]:
print "Possible Word for Scrambled Word #" + str(num_matched_anagrams) + ": " + word
else:
print "No matches found"
num_matched_anagrams += 1The loop can be replaced with
for num_matched_anagrams in range(1, num_cases+1).The whole thing should be put into a
main function.The dictionary parsing could be extracted to another function.
The
print should use formatting and I'd add brackets for Python 3 compatibility:print("Possible Word for Scrambled Word #{}: {}".format(num_matched_anagrams, word))To complete Python 3 compatibility, add this to the top:
try:
input = raw_input
except NameError:
passThis gives:
```
from collections import defaultdict
try:
input = raw_input
except NameError:
pass
def parse_dictionary(filename):
word_anagrams = defaultdict(set)
with open(filename) as dictionary:
for word in dictionary:
word = word.rstrip()
word_anagrams[''.join(sorted(word))].add(word)
return word_anagrams
def main():
word_anagrams = parse_dictionary("dictionary.txt")
num_cases = int(input("Input number of words needed to be unscrambled: "))
for num_matched_anagrams in range(1, num_cases+1):
scrambled_word = input("Input scrambled word: ")
sorted_word = ''.join(sorted(scrambled_word))
if word_anagrams[sorted_word]:
for word in word_anagrams[sorted_word]:
print("Possible Word for Scrambled Word #{}: {}".format(num_matched_anagrams, word))
Code Snippets
possible_words = list(set(possible_words))def find_anagrams(scrambled_word, words):
for word_order in permutations(scrambled_word):
potential_word = ''.join(word_order)
if potential_word in words:
yield potential_wordsorted word → [list of words that sort to this word]from collections import defaultdict
word_anagrams = defaultdict(set)
with open('dictionary.txt') as dictionary:
for word in dictionary:
word = word.rstrip()
word_anagrams[sorted(word)].add(word)from collections import defaultdict
word_anagrams = defaultdict(set)
with open('dictionary.txt') as dictionary:
for word in dictionary:
word = word.rstrip()
word_anagrams[''.join(sorted(word))].add(word)
num_cases = int(raw_input("Input number of words needed to be unscrambled: "))
num_matched_anagrams = 1
while num_matched_anagrams <= num_cases:
scrambled_word = raw_input("Input scrambled word: ")
sorted_word = ''.join(sorted(scrambled_word))
if word_anagrams[sorted_word]:
for word in word_anagrams[sorted_word]:
print "Possible Word for Scrambled Word #" + str(num_matched_anagrams) + ": " + word
else:
print "No matches found"
num_matched_anagrams += 1Context
StackExchange Code Review Q#75023, answer score: 14
Revisions (0)
No revisions yet.