patternpythonMinor

Markov Chain in Python

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

pythonmarkovchain

Problem

For learning purposes, I'm trying to implement a Markov Chain from scratch in Python.

The goal is, provided a file with a list of words, and some sequence, to predict the next letter according the the probability computed from the list of words.

My questions are not related to the Python syntax (for instance I know that I can do better than defaultdict with Counter at some places but I don't care about that), rather to the Markov Chain algorithm in itself, and in particular:

is it a correct/efficient way to implement a MC?

if not, how could this algorithm be more efficient?

is the method of inserting state_length - 1 blanks before each word a good practice to allow searching sequences smaller than state_length?

is the way of representing ` a good practice?




The file

mdp_sequences.txt

 goes like this:

HELLO
CELLO
HELL
HELLO
HELL
SHELL
YELLOW
HELLO
YELLOW
...


Parsing



The first part of the program reads the file and computes a count of all words:

from collections import Counter, defaultdict
from pprint import pprint

states = []
state_length = 4

# read and parse csv file into a counted dict of sequences
sequences = defaultdict(int)
with open('mdp_sequences.txt') as f:
    for sequence in f:
        seq = " " * (state_length - 1) + sequence.rstrip()
        sequences[seq] += 1


The resulting

sequences dict is just a count of all words found in mdp_sequences.txt, with the particularity that is has state_length - 1

 spaces inserted before each word:

sequences = {
    '   HELLO': 342,
    '   CELLO': 117,
    '   HELL': 200,
    '   SHELL': 120,
    '   YELLOW': 250,
    ...
}


Enumerating states



Next we enumerate each possible state:

for seq, count in sequences.items():
    for i in range(len(seq) - state_length + 1):
        for j in range(int(count)):
            states.append(seq[i:][:state_length])

counted_states = dict(Counter(states))


At this point, the

counted_states

 dict goes like:

Solution

-
Code is easier to understand, test, and reuse, if you divide it into functions with well-documented inputs and outputs, for example you might choose functions build_markov_chain and apply_markov_chain.

-
Instead of a defaultdict(int), you could just use a Counter.

-
There's no need pad the words with spaces at the left — with a few tweaks to the code you can use 'H' instead of ' H' and so on.

-
Representing the terminal state by a special string ' is risky — what if this string occurred in your data? It would be better to use a distinguished Python object, for example None or a unique object:

EXIT = object() # unique object representing the end of a word

-
The Markov chain representation is redundant — when 'ABCD' is followed by 'BCDE', you know that the three letters BCD must be the same. So all you need to remember in the chain is the single letter 'E'.

-
It's not necessary to convert number to float before dividing:

probability = float(wcount) / float(scount)

Instead, write:

probability = wcount / scount

(If you are still stuck on Python 2, then use from __future__ import division.)

-
The algorithm proceeds in three steps: (i) read the file and build the sequences dictionary; (ii) use the sequences dictionary to build the counted_states dictionary; and (iii) use the counted_states and sequences dictionaries to build the Markov chain. But it would be simpler to build the chain in two steps: (i) count the successors to each state as you go through the input; and (ii) convert the counts to probabilities. Like this:

from collections import Counter, defaultdict

def build_markov_chain(filename='mdp_sequences.txt', n=4):
    """Read words from a file and build a Markov chain.

    Argument:
        filename -- Name of file containing words.
        n -- Number of characters in the states.

    Returns: map from n-character states to map from successors (or
    None indicating end-of-word) to probabilities.

    """
    # Map from state to map from successor to count.
    counts = defaultdict(Counter)
    with open(filename) as f:
        for line in f:
            line = line.rstrip()
            for i in range(1, len(line)):
                counts[line[max(0, i - n):i]][line[i]] += 1
            counts[line[max(0, len(line) - n):]][None] += 1
    # Convert counts to probabilities.
    probabilities = defaultdict(lambda: defaultdict(float))
    for state, successors in counts.items():
        total = sum(successors.values())
        probabilities[state] = {s: c / total for s, c in successors.items()}
    return probabilities

And so:

>>> from pprint import pprint
>>> pprint(build_markov_chain())
defaultdict(. at 0x110697950>,
            {'C': {'E': 1.0},
             'CE': {'L': 1.0},
             'CEL': {'L': 1.0},
             'CELL': {'O': 1.0},
             'ELLO': {None: 0.6666666666666666, 'W': 0.3333333333333333},
             'H': {'E': 1.0},
             'HE': {'L': 1.0},
             'HEL': {'L': 1.0},
             'HELL': {None: 0.5, 'O': 0.5},
             'LLOW': {None: 1.0},
             'S': {'H': 1.0},
             'SH': {'E': 1.0},
             'SHE': {'L': 1.0},
             'SHEL': {'L': 1.0},
             'Y': {'E': 1.0},
             'YE': {'L': 1.0},
             'YEL': {'L': 1.0},
             'YELL': {'O': 1.0}})

Code Snippets

EXIT = object() # unique object representing the end of a word

probability = float(wcount) / float(scount)

probability = wcount / scount

from collections import Counter, defaultdict

def build_markov_chain(filename='mdp_sequences.txt', n=4):
    """Read words from a file and build a Markov chain.

    Argument:
        filename -- Name of file containing words.
        n -- Number of characters in the states.

    Returns: map from n-character states to map from successors (or
    None indicating end-of-word) to probabilities.

    """
    # Map from state to map from successor to count.
    counts = defaultdict(Counter)
    with open(filename) as f:
        for line in f:
            line = line.rstrip()
            for i in range(1, len(line)):
                counts[line[max(0, i - n):i]][line[i]] += 1
            counts[line[max(0, len(line) - n):]][None] += 1
    # Convert counts to probabilities.
    probabilities = defaultdict(lambda: defaultdict(float))
    for state, successors in counts.items():
        total = sum(successors.values())
        probabilities[state] = {s: c / total for s, c in successors.items()}
    return probabilities

>>> from pprint import pprint
>>> pprint(build_markov_chain())
defaultdict(<function build_markov_chain.<locals>.<lambda> at 0x110697950>,
            {'C': {'E': 1.0},
             'CE': {'L': 1.0},
             'CEL': {'L': 1.0},
             'CELL': {'O': 1.0},
             'ELLO': {None: 0.6666666666666666, 'W': 0.3333333333333333},
             'H': {'E': 1.0},
             'HE': {'L': 1.0},
             'HEL': {'L': 1.0},
             'HELL': {None: 0.5, 'O': 0.5},
             'LLOW': {None: 1.0},
             'S': {'H': 1.0},
             'SH': {'E': 1.0},
             'SHE': {'L': 1.0},
             'SHEL': {'L': 1.0},
             'Y': {'E': 1.0},
             'YE': {'L': 1.0},
             'YEL': {'L': 1.0},
             'YELL': {'O': 1.0}})

Context

StackExchange Code Review Q#150291, answer score: 4

Revisions (0)

No revisions yet.