HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Optimizing an Anagram Solver

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
solveranagramoptimizing

Problem

I've built an anagram solver in Python 2.7. It goes through a text file with over 100,000 words and checks if any permutations of the input matches the line. It works great, except it is very slow. I've tried making the text file a list with each line as a list item, but the text file is too big to do that. This is the text file that I am using (it's saved on my flash drive).

# import permutations module
from itertools import permutations as prm

num_of_cases = int(raw_input("Input number of words needed to be unscrambled: "))
# take input
counter = 1
while counter <= num_of_cases:
   scrambled_word = list(str(raw_input("Input scrambled word: ")))

# empty lists that will be appended to later
   prm_list = []
   possible_words = []

# takes each permutation of the input and puts in a list
   for i in prm(scrambled_word):
      prm_list.append("".join(i))

   def check(x, y):

   # open list of words
      dictionary = file('E:\MRP\Text_Files\dictionary.txt')

      # check each line in the dictionary against each item
      # in the list of permutations and add it to another empty list if it's a match
      for line in dictionary:
         for i in x:
            if i+'\n' == line:
               y.append(i)

   check(prm_list, possible_words)

# delete duplicates
   possible_words = list(set(possible_words))

# print out possible words
   if len(possible_words) == 0:
      print "No matches found"
   else:
      for i in possible_words:
         print "Possible Word for Scrambled Word #" + str(counter) + ": " + i

# add to the counter
   counter+=1


If you test it out you can see that is very slow, especially as the words get larger. Is there anything I can do to improve the performance of this program?

Solution

Your first comment is misleading (it's the itertools module not the permutations module) and useless anyway. Remove it.

There's absolutely no reason to rename permutations to prm. Don't do it.

num_of_cases would be better called num_cases, simply because it's more conventional.

Your take input comment is misplaced. It should be at the place it refers to.

counter is too vague: what are you counting? I would call it num_matched_anagrams.

You don't need to convert scrambled_word to a list. permutations accepts any iterable. You don't need to call str on the output of raw_input; it already returns a string.

The prior improvements make the comment take input really obvious, so you can remove that.

The later comments are often misaligned. Have them in line with the text.

prm_list can be generated with just list(permutations(scrambled_word)). This removes two comments.

prm_list should just be called word_orders or similar; don't bother naming things by their type.

check is badly named; it should be called something like find_anagrams.

check should not use file, which is deprecated, but open. It should also use with to handle the file.

check should save the dictionary between runs. In fact, this shouldn't really be check's job at all.

The words should have their newlines trimmed, not the other way around.

check should only return anagrams, not deal with appending.

check should have sensibly named parameters.

check should just call permutations itself.

You do

possible_words = list(set(possible_words))


but why bother having possible_words as a list at all?

Instead of if len(possible_words) == 0, do if not possible_words. I would also switch the if around.

num_matched_anagrams+=1 should have spaces: num_matched_anagrams += 1. The comment is obvious, so remove it.

Now to deal with speed.

words should be a set (since it's a set of words) and you loop over the permutations first. This allows a check of:

def find_anagrams(scrambled_word, words):
    for word_order in permutations(scrambled_word):
        potential_word = ''.join(word_order)

        if potential_word in words:
            yield potential_word


which is much faster.

We can improve this further by doing all the work upfront by making a cannonical, sorted version. This generates a mapping:

sorted word → [list of words that sort to this word]


This can be generated like so:

from collections import defaultdict

word_anagrams = defaultdict(set)

with open('dictionary.txt') as dictionary:
   for word in dictionary:
      word = word.rstrip()

      word_anagrams[sorted(word)].add(word)


This then makes find_anagrams just word_anagrams[sorted(scrambled_word)]. This has high initial overhead (0.5 seconds for me with 170k words) but makes subsequent lookups insanely fast.

This gives the code:

from collections import defaultdict

word_anagrams = defaultdict(set)

with open('dictionary.txt') as dictionary:
   for word in dictionary:
      word = word.rstrip()

      word_anagrams[''.join(sorted(word))].add(word)

num_cases = int(raw_input("Input number of words needed to be unscrambled: "))

num_matched_anagrams = 1
while num_matched_anagrams <= num_cases:
   scrambled_word = raw_input("Input scrambled word: ")
   sorted_word = ''.join(sorted(scrambled_word))

   if word_anagrams[sorted_word]:
      for word in word_anagrams[sorted_word]:
         print "Possible Word for Scrambled Word #" + str(num_matched_anagrams) + ": " + word

   else:
      print "No matches found"

   num_matched_anagrams += 1


The loop can be replaced with for num_matched_anagrams in range(1, num_cases+1).

The whole thing should be put into a main function.

The dictionary parsing could be extracted to another function.

The print should use formatting and I'd add brackets for Python 3 compatibility:

print("Possible Word for Scrambled Word #{}: {}".format(num_matched_anagrams, word))


To complete Python 3 compatibility, add this to the top:

try:
   input = raw_input
except NameError:
   pass


This gives:

```
from collections import defaultdict

try:
input = raw_input
except NameError:
pass

def parse_dictionary(filename):
word_anagrams = defaultdict(set)

with open(filename) as dictionary:
for word in dictionary:
word = word.rstrip()

word_anagrams[''.join(sorted(word))].add(word)

return word_anagrams

def main():
word_anagrams = parse_dictionary("dictionary.txt")

num_cases = int(input("Input number of words needed to be unscrambled: "))

for num_matched_anagrams in range(1, num_cases+1):
scrambled_word = input("Input scrambled word: ")
sorted_word = ''.join(sorted(scrambled_word))

if word_anagrams[sorted_word]:
for word in word_anagrams[sorted_word]:
print("Possible Word for Scrambled Word #{}: {}".format(num_matched_anagrams, word))

Code Snippets

possible_words = list(set(possible_words))
def find_anagrams(scrambled_word, words):
    for word_order in permutations(scrambled_word):
        potential_word = ''.join(word_order)

        if potential_word in words:
            yield potential_word
sorted word → [list of words that sort to this word]
from collections import defaultdict

word_anagrams = defaultdict(set)

with open('dictionary.txt') as dictionary:
   for word in dictionary:
      word = word.rstrip()

      word_anagrams[sorted(word)].add(word)
from collections import defaultdict

word_anagrams = defaultdict(set)

with open('dictionary.txt') as dictionary:
   for word in dictionary:
      word = word.rstrip()

      word_anagrams[''.join(sorted(word))].add(word)

num_cases = int(raw_input("Input number of words needed to be unscrambled: "))

num_matched_anagrams = 1
while num_matched_anagrams <= num_cases:
   scrambled_word = raw_input("Input scrambled word: ")
   sorted_word = ''.join(sorted(scrambled_word))

   if word_anagrams[sorted_word]:
      for word in word_anagrams[sorted_word]:
         print "Possible Word for Scrambled Word #" + str(num_matched_anagrams) + ": " + word

   else:
      print "No matches found"

   num_matched_anagrams += 1

Context

StackExchange Code Review Q#75023, answer score: 14

Revisions (0)

No revisions yet.