HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Recursive search for combinations of words that have a specified MD5 hash

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
searchcombinationswordsmd5hashrecursivethatforhavespecified

Problem

This code solves a challenge problem that a local company is hosting to attract/screen applicants. The link to the challenge is here.

In brief, we are given a wordlist containing approximately 100,000 words and we are asked to find the combination of words which has MD5 hash 4624d200580677270a54ccff86b9610e. The constraint is that the correct phrase will be an anagram of "poultry outwits ants".

The code begins by first removing all words with wrong letters or too many of the right letters from the list. Then a recursive function is called to search through combinations of up to 3 words, check if the letter counts match the desired phrase, and then hash and compare to see if the correct phrase has been found.

This was my first python project and I'd be very interested in feedback on both the algorithm and my usage of the language. I'm a new programmer who is just starting out in tech (mathematics background) and need all the feedback/criticism I can get to learn fast.

```
# -- coding: utf-8 --
# md5 of our target: 4624d200580677270a54ccff86b9610e
# This code ran in 48.6 seconds on a 2.5Ghz i7 Macbook Pro OSX 10.10.5

# set recursive depth to limit the length of the word combo we check.
# change this if a different depth is desired:
depth = 3

# import useful things
from datetime import datetime
startTime = datetime.now()
from collections import defaultdict, deque
import hashlib
import itertools
import sys

# load words
mySet = set(open('wordlist.txt'))
mySet = set(map(lambda s: s.strip(), mySet))

# initialize check as our dictionary with letter counts to check against
check = defaultdict(int)
letterString = 'poouulttttrywissan'
for letter in letterString:
check[letter] += 1

# strip out words from mySet that have the wrong letters
goodLetters = str('poultrywisan')
newset = set()
for word in mySet:
for letter in word:
if letter not in goodLetters:
newset.add(word)
break
mySet.difference_update(newset)

# strip out w

Solution

I'll do the boring, nit-picky stuff first, then move onto something more interesting! Note that I haven't been able to test this, as you haven't provided an example input to match your expected target.

Style

Python has an official style guide, which you should read and follow. For example:

  • Lines should be 80 characters or fewer (72 in docstrings, see below)



  • There shouldn't be whitespace around colons in slices



  • Function and variable names should be lowercase_with_underscores



-
Imports should be in alphabetical order before any other code, i.e.

from collections import defaultdict, deque
from datetime import datetime
import hashlib
import itertools
import sys

DEPTH = 3  # 'constants' are UPPERCASE_WITH_UNDERSCORES by convention

...


Documentation

The correct way to document modules/functions in Python is using a docstring, e.g. rather than:

# main recursive function.
def recurseHash(combo, counts):
    ...


you would write:

def recurseHash(combo, counts):
    """Main recursive function."""
    ...


This information is then available, as recurseHash.__doc__, to e.g. help. You can also use this to document e.g. expected parameters and return values (in this case, defining what combo and counts are).

Comments within the function are fine, though, as long as rather than explaining what the code does (which should be clear from the code) they explain why it does it.

Structure

It is generally bad form to have code running at the top level of a script. At the very least, you can rearrange it to look like:

# imports

# CONSTANTS

def recurseHash(...):
    ...

if __name__ == '__main__':

    # everything else


The code after the conditional __name__ check only runs when the script is executed directly, so that you can later import and reuse your functionality elsewhere without actually running anything.

However, a better move would be to refactor the top level code into a few functions, then call them as appropriate, e.g. replacing:

mySet = set(open('wordlist.txt'))
mySet = set(map(lambda s: s.strip(), mySet))


with:

word_set = import_words('wordlist.txt')  # again note naming convention


and encapsulating the open and str.strip functionality in a function. This makes the code more readable now, and more reusable later.

Magic values

Although you have sensibly extracted depth to the top of the script, there are still some other "magic values" within the code, e.g.:

  • The file name, 'wordlist.txt'



  • The base letters 'poouulttttrywissan' and 'poultrywisan' (also note duplication here) - it is particularly unclear where these come from, so an explanatory comment would also be helpful



  • The expected digest '4624d200580677270a54ccff86b9610e' - again, where has this come from?



Timing

I would be inclined to extract the timing logic from the core code, allowing you to later choose whether or not to time the process (using e.g. timeit or the built-in profilers) independent of the actual functionality.

Similarly, I would remove the print, allowing the calling code to decide how it should report the outcome to the user.

Counting

You are importing two classes from collections, but apparently haven't spotted the Counter class, which would simplify check substantially;

check = defaultdict(int)
letterString = 'poouulttttrywissan'
for letter in letterString:
    check[letter] += 1


can be rewritten as:

check = Counter('poouulttttrywissan')


It also provides a most_common method which will list the characters and their counts in descending order of the latter.

Scoping

recurseHash relies on newList and startTime being in scope when it is called; this use of a global variable is bad form, and you should refactor to make the former an explicit argument of the function (so your call becomes recurseHash(deque(), defaultdict(int), newList)). As above, the latter shouldn't be relevant to recurseHash itself.

Sets

Sets are a good choice of data structure, but you're using them in a slightly odd way. For example, why not make goodLetters a set, giving you O(1) rather than O(n) membership testing? Note also that goodLetters is the same as the keys of check, which can again be tested in O(1) (a set is basically a dictionary with keys but no values).

Flags

Python has the True and False boolean types, which you should use instead of 0 and 1 for greaterFlag and lessFlag (shouldn't that be lesser?) You could also consider adding a break once both flags are truth-y, to avoid running through the rest of the loop unnecessarily.

Code Snippets

from collections import defaultdict, deque
from datetime import datetime
import hashlib
import itertools
import sys

DEPTH = 3  # 'constants' are UPPERCASE_WITH_UNDERSCORES by convention

...
# main recursive function.
def recurseHash(combo, counts):
    ...
def recurseHash(combo, counts):
    """Main recursive function."""
    ...
# imports

# CONSTANTS

def recurseHash(...):
    ...

if __name__ == '__main__':

    # everything else
mySet = set(open('wordlist.txt'))
mySet = set(map(lambda s: s.strip(), mySet))

Context

StackExchange Code Review Q#103755, answer score: 6

Revisions (0)

No revisions yet.