patternpythonMinor
Recursive search for combinations of words that have a specified MD5 hash
Viewed 0 times
searchcombinationswordsmd5hashrecursivethatforhavespecified
Problem
This code solves a challenge problem that a local company is hosting to attract/screen applicants. The link to the challenge is here.
In brief, we are given a wordlist containing approximately 100,000 words and we are asked to find the combination of words which has MD5 hash
The code begins by first removing all words with wrong letters or too many of the right letters from the list. Then a recursive function is called to search through combinations of up to 3 words, check if the letter counts match the desired phrase, and then hash and compare to see if the correct phrase has been found.
This was my first python project and I'd be very interested in feedback on both the algorithm and my usage of the language. I'm a new programmer who is just starting out in tech (mathematics background) and need all the feedback/criticism I can get to learn fast.
```
# -- coding: utf-8 --
# md5 of our target: 4624d200580677270a54ccff86b9610e
# This code ran in 48.6 seconds on a 2.5Ghz i7 Macbook Pro OSX 10.10.5
# set recursive depth to limit the length of the word combo we check.
# change this if a different depth is desired:
depth = 3
# import useful things
from datetime import datetime
startTime = datetime.now()
from collections import defaultdict, deque
import hashlib
import itertools
import sys
# load words
mySet = set(open('wordlist.txt'))
mySet = set(map(lambda s: s.strip(), mySet))
# initialize check as our dictionary with letter counts to check against
check = defaultdict(int)
letterString = 'poouulttttrywissan'
for letter in letterString:
check[letter] += 1
# strip out words from mySet that have the wrong letters
goodLetters = str('poultrywisan')
newset = set()
for word in mySet:
for letter in word:
if letter not in goodLetters:
newset.add(word)
break
mySet.difference_update(newset)
# strip out w
In brief, we are given a wordlist containing approximately 100,000 words and we are asked to find the combination of words which has MD5 hash
4624d200580677270a54ccff86b9610e. The constraint is that the correct phrase will be an anagram of "poultry outwits ants". The code begins by first removing all words with wrong letters or too many of the right letters from the list. Then a recursive function is called to search through combinations of up to 3 words, check if the letter counts match the desired phrase, and then hash and compare to see if the correct phrase has been found.
This was my first python project and I'd be very interested in feedback on both the algorithm and my usage of the language. I'm a new programmer who is just starting out in tech (mathematics background) and need all the feedback/criticism I can get to learn fast.
```
# -- coding: utf-8 --
# md5 of our target: 4624d200580677270a54ccff86b9610e
# This code ran in 48.6 seconds on a 2.5Ghz i7 Macbook Pro OSX 10.10.5
# set recursive depth to limit the length of the word combo we check.
# change this if a different depth is desired:
depth = 3
# import useful things
from datetime import datetime
startTime = datetime.now()
from collections import defaultdict, deque
import hashlib
import itertools
import sys
# load words
mySet = set(open('wordlist.txt'))
mySet = set(map(lambda s: s.strip(), mySet))
# initialize check as our dictionary with letter counts to check against
check = defaultdict(int)
letterString = 'poouulttttrywissan'
for letter in letterString:
check[letter] += 1
# strip out words from mySet that have the wrong letters
goodLetters = str('poultrywisan')
newset = set()
for word in mySet:
for letter in word:
if letter not in goodLetters:
newset.add(word)
break
mySet.difference_update(newset)
# strip out w
Solution
I'll do the boring, nit-picky stuff first, then move onto something more interesting! Note that I haven't been able to test this, as you haven't provided an example input to match your expected target.
Style
Python has an official style guide, which you should read and follow. For example:
-
Imports should be in alphabetical order before any other code, i.e.
Documentation
The correct way to document modules/functions in Python is using a docstring, e.g. rather than:
you would write:
This information is then available, as
Comments within the function are fine, though, as long as rather than explaining what the code does (which should be clear from the code) they explain why it does it.
Structure
It is generally bad form to have code running at the top level of a script. At the very least, you can rearrange it to look like:
The code after the conditional
However, a better move would be to refactor the top level code into a few functions, then call them as appropriate, e.g. replacing:
with:
and encapsulating the
Magic values
Although you have sensibly extracted
Timing
I would be inclined to extract the timing logic from the core code, allowing you to later choose whether or not to time the process (using e.g.
Similarly, I would remove the
Counting
You are importing two classes from
can be rewritten as:
It also provides a
Scoping
Sets
Sets are a good choice of data structure, but you're using them in a slightly odd way. For example, why not make
Flags
Python has the
Style
Python has an official style guide, which you should read and follow. For example:
- Lines should be 80 characters or fewer (72 in docstrings, see below)
- There shouldn't be whitespace around colons in slices
- Function and variable names should be
lowercase_with_underscores
-
Imports should be in alphabetical order before any other code, i.e.
from collections import defaultdict, deque
from datetime import datetime
import hashlib
import itertools
import sys
DEPTH = 3 # 'constants' are UPPERCASE_WITH_UNDERSCORES by convention
...Documentation
The correct way to document modules/functions in Python is using a docstring, e.g. rather than:
# main recursive function.
def recurseHash(combo, counts):
...you would write:
def recurseHash(combo, counts):
"""Main recursive function."""
...This information is then available, as
recurseHash.__doc__, to e.g. help. You can also use this to document e.g. expected parameters and return values (in this case, defining what combo and counts are). Comments within the function are fine, though, as long as rather than explaining what the code does (which should be clear from the code) they explain why it does it.
Structure
It is generally bad form to have code running at the top level of a script. At the very least, you can rearrange it to look like:
# imports
# CONSTANTS
def recurseHash(...):
...
if __name__ == '__main__':
# everything elseThe code after the conditional
__name__ check only runs when the script is executed directly, so that you can later import and reuse your functionality elsewhere without actually running anything.However, a better move would be to refactor the top level code into a few functions, then call them as appropriate, e.g. replacing:
mySet = set(open('wordlist.txt'))
mySet = set(map(lambda s: s.strip(), mySet))with:
word_set = import_words('wordlist.txt') # again note naming conventionand encapsulating the
open and str.strip functionality in a function. This makes the code more readable now, and more reusable later.Magic values
Although you have sensibly extracted
depth to the top of the script, there are still some other "magic values" within the code, e.g.:- The file name,
'wordlist.txt'
- The base letters
'poouulttttrywissan'and'poultrywisan'(also note duplication here) - it is particularly unclear where these come from, so an explanatory comment would also be helpful
- The expected digest
'4624d200580677270a54ccff86b9610e'- again, where has this come from?
Timing
I would be inclined to extract the timing logic from the core code, allowing you to later choose whether or not to time the process (using e.g.
timeit or the built-in profilers) independent of the actual functionality.Similarly, I would remove the
print, allowing the calling code to decide how it should report the outcome to the user.Counting
You are importing two classes from
collections, but apparently haven't spotted the Counter class, which would simplify check substantially;check = defaultdict(int)
letterString = 'poouulttttrywissan'
for letter in letterString:
check[letter] += 1can be rewritten as:
check = Counter('poouulttttrywissan')It also provides a
most_common method which will list the characters and their counts in descending order of the latter.Scoping
recurseHash relies on newList and startTime being in scope when it is called; this use of a global variable is bad form, and you should refactor to make the former an explicit argument of the function (so your call becomes recurseHash(deque(), defaultdict(int), newList)). As above, the latter shouldn't be relevant to recurseHash itself.Sets
Sets are a good choice of data structure, but you're using them in a slightly odd way. For example, why not make
goodLetters a set, giving you O(1) rather than O(n) membership testing? Note also that goodLetters is the same as the keys of check, which can again be tested in O(1) (a set is basically a dictionary with keys but no values).Flags
Python has the
True and False boolean types, which you should use instead of 0 and 1 for greaterFlag and lessFlag (shouldn't that be lesser?) You could also consider adding a break once both flags are truth-y, to avoid running through the rest of the loop unnecessarily.Code Snippets
from collections import defaultdict, deque
from datetime import datetime
import hashlib
import itertools
import sys
DEPTH = 3 # 'constants' are UPPERCASE_WITH_UNDERSCORES by convention
...# main recursive function.
def recurseHash(combo, counts):
...def recurseHash(combo, counts):
"""Main recursive function."""
...# imports
# CONSTANTS
def recurseHash(...):
...
if __name__ == '__main__':
# everything elsemySet = set(open('wordlist.txt'))
mySet = set(map(lambda s: s.strip(), mySet))Context
StackExchange Code Review Q#103755, answer score: 6
Revisions (0)
No revisions yet.