patternpythonMinor

Decrypting a substitution cipher using n-gram frequency analysis

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

substitutionfrequencyanalysisusinggramdecryptingcipher

Problem

This is a solution for the June 2015 Community Challenge: a program that decrypts a monoalphabetic substitution cipher. It's written in Python 3, but should be portable to Python 2 if you use from __future__ import division.

Please review! Is the technique clear and the code easy to follow? Does the code correctly implement the algorithm? Can you recommend better names for functions and parameters? Would a different search technique (for example, simulated annealing, or beam search) be better?

The program has two main parts: a scoring function and a search function.

The scoring function evaluates the quality of a candidate decryption by computing the log-likelihood of the frequency of n-grams in the candidate, compared to a model corpus.

```
from collections import Counter
from math import log

def ngram_count(n, text):
"""Return the number of n-grams in text."""
return len(text) - n + 1

def ngrams(n, text):
"""Generate the n-grams in text.

>>> list(ngrams(3, 'HELLOWORLD'))
['HEL', 'ELL', 'LLO', 'LOW', 'OWO', 'WOR', 'ORL', 'RLD']

"""
for i in range(ngram_count(n, text)):
yield text[i:i+n]

def plaintext_score_function(n, corpus):
"""Return a function that scores a plaintext based on the
log-likelihood of the occurrence of n-grams compared to those
found in corpus.

"""
# Number of n-grams in the corpus
k = ngram_count(n, corpus)

# Count of occurrences of each n-gram in the corpus.
counts = Counter(ngrams(n, corpus))

# Map from n-gram to the logarithm of its frequency in the corpus.
log_freq = {ngram: log(count / k) for ngram, count in counts.items()}

# Log-frequency to use for n-grams that don't appear in the corpus
# (an arbitrary value that's much smaller than the log-frequency
# for any n-gram that does appear in the corpus).
min_log_freq = log(0.01 / k)

def score(plaintext):
return sum(log_freq.get(ngram, min_log_freq)
for ngram

Solution

The first code block, with ngram_count(), ngrams(), and plaintext_score_function(), seems straightforward enough. The docstrings and comments were all helpful.

Since we are dealing with logarithms,

log(count / k) could be log(count) - log(k), and log(k) can be precomputed. Subtraction should be faster than division, and it also happens to take care of your Python 2/3 compatibility problem.

min_log_freq is, for all practical purposes, PENALTY = -9999999.

Have you tried multiplying the frequencies rather than adding the logs of the frequencies? Does it make a difference in answer quality or performance?

In decipher() and find_cipher(), I'm not fond of the terminology. What you call the "cipher" is what I would call the "key".

The neighbours() closure makes this line a bit mysterious:

best_score, best_cipher = choice(nlargest(choices, neighbours()))

I would prefer to a more explicit iteration feedback loop:

score, key = choice(nlargest(choices, neighbours(score, key)))

But the (score, key) tuple appear frequently together, and you've repeated score_fun(decipher(…, …)), so I think there is room for improvement.

Code Snippets

best_score, best_cipher = choice(nlargest(choices, neighbours()))

score, key = choice(nlargest(choices, neighbours(score, key)))

Context

StackExchange Code Review Q#93472, answer score: 5

Revisions (0)

No revisions yet.