patternpythonMinor
Decrypting a substitution cipher using n-gram frequency analysis
Viewed 0 times
substitutionfrequencyanalysisusinggramdecryptingcipher
Problem
This is a solution for the June 2015 Community Challenge: a program that decrypts a monoalphabetic substitution cipher. It's written in Python 3, but should be portable to Python 2 if you use
Please review! Is the technique clear and the code easy to follow? Does the code correctly implement the algorithm? Can you recommend better names for functions and parameters? Would a different search technique (for example, simulated annealing, or beam search) be better?
The program has two main parts: a scoring function and a search function.
The scoring function evaluates the quality of a candidate decryption by computing the log-likelihood of the frequency of n-grams in the candidate, compared to a model corpus.
```
from collections import Counter
from math import log
def ngram_count(n, text):
"""Return the number of n-grams in text."""
return len(text) - n + 1
def ngrams(n, text):
"""Generate the n-grams in text.
>>> list(ngrams(3, 'HELLOWORLD'))
['HEL', 'ELL', 'LLO', 'LOW', 'OWO', 'WOR', 'ORL', 'RLD']
"""
for i in range(ngram_count(n, text)):
yield text[i:i+n]
def plaintext_score_function(n, corpus):
"""Return a function that scores a plaintext based on the
log-likelihood of the occurrence of n-grams compared to those
found in corpus.
"""
# Number of n-grams in the corpus
k = ngram_count(n, corpus)
# Count of occurrences of each n-gram in the corpus.
counts = Counter(ngrams(n, corpus))
# Map from n-gram to the logarithm of its frequency in the corpus.
log_freq = {ngram: log(count / k) for ngram, count in counts.items()}
# Log-frequency to use for n-grams that don't appear in the corpus
# (an arbitrary value that's much smaller than the log-frequency
# for any n-gram that does appear in the corpus).
min_log_freq = log(0.01 / k)
def score(plaintext):
return sum(log_freq.get(ngram, min_log_freq)
for ngram
from __future__ import division.Please review! Is the technique clear and the code easy to follow? Does the code correctly implement the algorithm? Can you recommend better names for functions and parameters? Would a different search technique (for example, simulated annealing, or beam search) be better?
The program has two main parts: a scoring function and a search function.
The scoring function evaluates the quality of a candidate decryption by computing the log-likelihood of the frequency of n-grams in the candidate, compared to a model corpus.
```
from collections import Counter
from math import log
def ngram_count(n, text):
"""Return the number of n-grams in text."""
return len(text) - n + 1
def ngrams(n, text):
"""Generate the n-grams in text.
>>> list(ngrams(3, 'HELLOWORLD'))
['HEL', 'ELL', 'LLO', 'LOW', 'OWO', 'WOR', 'ORL', 'RLD']
"""
for i in range(ngram_count(n, text)):
yield text[i:i+n]
def plaintext_score_function(n, corpus):
"""Return a function that scores a plaintext based on the
log-likelihood of the occurrence of n-grams compared to those
found in corpus.
"""
# Number of n-grams in the corpus
k = ngram_count(n, corpus)
# Count of occurrences of each n-gram in the corpus.
counts = Counter(ngrams(n, corpus))
# Map from n-gram to the logarithm of its frequency in the corpus.
log_freq = {ngram: log(count / k) for ngram, count in counts.items()}
# Log-frequency to use for n-grams that don't appear in the corpus
# (an arbitrary value that's much smaller than the log-frequency
# for any n-gram that does appear in the corpus).
min_log_freq = log(0.01 / k)
def score(plaintext):
return sum(log_freq.get(ngram, min_log_freq)
for ngram
Solution
The first code block, with
Since we are dealing with logarithms,
In
The
I would prefer to a more explicit iteration feedback loop:
But the
ngram_count(), ngrams(), and plaintext_score_function(), seems straightforward enough. The docstrings and comments were all helpful.Since we are dealing with logarithms,
log(count / k)could belog(count) - log(k), andlog(k)can be precomputed. Subtraction should be faster than division, and it also happens to take care of your Python 2/3 compatibility problem.
min_log_freqis, for all practical purposes,PENALTY = -9999999.
- Have you tried multiplying the frequencies rather than adding the logs of the frequencies? Does it make a difference in answer quality or performance?
In
decipher() and find_cipher(), I'm not fond of the terminology. What you call the "cipher" is what I would call the "key".The
neighbours() closure makes this line a bit mysterious:best_score, best_cipher = choice(nlargest(choices, neighbours()))I would prefer to a more explicit iteration feedback loop:
score, key = choice(nlargest(choices, neighbours(score, key)))But the
(score, key) tuple appear frequently together, and you've repeated score_fun(decipher(…, …)), so I think there is room for improvement.Code Snippets
best_score, best_cipher = choice(nlargest(choices, neighbours()))score, key = choice(nlargest(choices, neighbours(score, key)))Context
StackExchange Code Review Q#93472, answer score: 5
Revisions (0)
No revisions yet.