patternpythonMinor
Markov Chain in Python
Viewed 0 times
pythonmarkovchain
Problem
For learning purposes, I'm trying to implement a Markov Chain from scratch in Python.
The goal is, provided a file with a list of words, and some sequence, to predict the next letter according the the probability computed from the list of words.
My questions are not related to the Python syntax (for instance I know that I can do better than
The file mdp_sequences.txt
The goal is, provided a file with a list of words, and some sequence, to predict the next letter according the the probability computed from the list of words.
My questions are not related to the Python syntax (for instance I know that I can do better than
defaultdict with Counter at some places but I don't care about that), rather to the Markov Chain algorithm in itself, and in particular:- is it a correct/efficient way to implement a MC?
- if not, how could this algorithm be more efficient?
- is the method of inserting
state_length - 1blanks before each word a good practice to allow searching sequences smaller thanstate_length?
- is the way of representing `
a good practice?
The file mdp_sequences.txt
goes like this:
HELLO
CELLO
HELL
HELLO
HELL
SHELL
YELLOW
HELLO
YELLOW
...
- Parsing
The first part of the program reads the file and computes a count of all words:
from collections import Counter, defaultdict
from pprint import pprint
states = []
state_length = 4
# read and parse csv file into a counted dict of sequences
sequences = defaultdict(int)
with open('mdp_sequences.txt') as f:
for sequence in f:
seq = " " * (state_length - 1) + sequence.rstrip()
sequences[seq] += 1
The resulting sequences dict is just a count of all words found in mdp_sequences.txt, with the particularity that is has state_length - 1 spaces inserted before each word:
sequences = {
' HELLO': 342,
' CELLO': 117,
' HELL': 200,
' SHELL': 120,
' YELLOW': 250,
...
}
- Enumerating states
Next we enumerate each possible state:
for seq, count in sequences.items():
for i in range(len(seq) - state_length + 1):
for j in range(int(count)):
states.append(seq[i:][:state_length])
counted_states = dict(Counter(states))
At this point, the counted_states dict goes like:
``Solution
-
Code is easier to understand, test, and reuse, if you divide it into functions with well-documented inputs and outputs, for example you might choose functions
-
Instead of a
-
There's no need pad the words with spaces at the left — with a few tweaks to the code you can use
-
Representing the terminal state by a special string
-
The Markov chain representation is redundant — when
-
It's not necessary to convert number to
Instead, write:
(If you are still stuck on Python 2, then use
-
The algorithm proceeds in three steps: (i) read the file and build the
And so:
Code is easier to understand, test, and reuse, if you divide it into functions with well-documented inputs and outputs, for example you might choose functions
build_markov_chain and apply_markov_chain.-
Instead of a
defaultdict(int), you could just use a Counter.-
There's no need pad the words with spaces at the left — with a few tweaks to the code you can use
'H' instead of ' H' and so on.-
Representing the terminal state by a special string
' is risky — what if this string occurred in your data? It would be better to use a distinguished Python object, for example None or a unique object:EXIT = object() # unique object representing the end of a word-
The Markov chain representation is redundant — when
'ABCD' is followed by 'BCDE', you know that the three letters BCD must be the same. So all you need to remember in the chain is the single letter 'E'.-
It's not necessary to convert number to
float before dividing:probability = float(wcount) / float(scount)Instead, write:
probability = wcount / scount(If you are still stuck on Python 2, then use
from __future__ import division.)-
The algorithm proceeds in three steps: (i) read the file and build the
sequences dictionary; (ii) use the sequences dictionary to build the counted_states dictionary; and (iii) use the counted_states and sequences dictionaries to build the Markov chain. But it would be simpler to build the chain in two steps: (i) count the successors to each state as you go through the input; and (ii) convert the counts to probabilities. Like this:from collections import Counter, defaultdict
def build_markov_chain(filename='mdp_sequences.txt', n=4):
"""Read words from a file and build a Markov chain.
Argument:
filename -- Name of file containing words.
n -- Number of characters in the states.
Returns: map from n-character states to map from successors (or
None indicating end-of-word) to probabilities.
"""
# Map from state to map from successor to count.
counts = defaultdict(Counter)
with open(filename) as f:
for line in f:
line = line.rstrip()
for i in range(1, len(line)):
counts[line[max(0, i - n):i]][line[i]] += 1
counts[line[max(0, len(line) - n):]][None] += 1
# Convert counts to probabilities.
probabilities = defaultdict(lambda: defaultdict(float))
for state, successors in counts.items():
total = sum(successors.values())
probabilities[state] = {s: c / total for s, c in successors.items()}
return probabilitiesAnd so:
>>> from pprint import pprint
>>> pprint(build_markov_chain())
defaultdict(. at 0x110697950>,
{'C': {'E': 1.0},
'CE': {'L': 1.0},
'CEL': {'L': 1.0},
'CELL': {'O': 1.0},
'ELLO': {None: 0.6666666666666666, 'W': 0.3333333333333333},
'H': {'E': 1.0},
'HE': {'L': 1.0},
'HEL': {'L': 1.0},
'HELL': {None: 0.5, 'O': 0.5},
'LLOW': {None: 1.0},
'S': {'H': 1.0},
'SH': {'E': 1.0},
'SHE': {'L': 1.0},
'SHEL': {'L': 1.0},
'Y': {'E': 1.0},
'YE': {'L': 1.0},
'YEL': {'L': 1.0},
'YELL': {'O': 1.0}})Code Snippets
EXIT = object() # unique object representing the end of a wordprobability = float(wcount) / float(scount)probability = wcount / scountfrom collections import Counter, defaultdict
def build_markov_chain(filename='mdp_sequences.txt', n=4):
"""Read words from a file and build a Markov chain.
Argument:
filename -- Name of file containing words.
n -- Number of characters in the states.
Returns: map from n-character states to map from successors (or
None indicating end-of-word) to probabilities.
"""
# Map from state to map from successor to count.
counts = defaultdict(Counter)
with open(filename) as f:
for line in f:
line = line.rstrip()
for i in range(1, len(line)):
counts[line[max(0, i - n):i]][line[i]] += 1
counts[line[max(0, len(line) - n):]][None] += 1
# Convert counts to probabilities.
probabilities = defaultdict(lambda: defaultdict(float))
for state, successors in counts.items():
total = sum(successors.values())
probabilities[state] = {s: c / total for s, c in successors.items()}
return probabilities>>> from pprint import pprint
>>> pprint(build_markov_chain())
defaultdict(<function build_markov_chain.<locals>.<lambda> at 0x110697950>,
{'C': {'E': 1.0},
'CE': {'L': 1.0},
'CEL': {'L': 1.0},
'CELL': {'O': 1.0},
'ELLO': {None: 0.6666666666666666, 'W': 0.3333333333333333},
'H': {'E': 1.0},
'HE': {'L': 1.0},
'HEL': {'L': 1.0},
'HELL': {None: 0.5, 'O': 0.5},
'LLOW': {None: 1.0},
'S': {'H': 1.0},
'SH': {'E': 1.0},
'SHE': {'L': 1.0},
'SHEL': {'L': 1.0},
'Y': {'E': 1.0},
'YE': {'L': 1.0},
'YEL': {'L': 1.0},
'YELL': {'O': 1.0}})Context
StackExchange Code Review Q#150291, answer score: 4
Revisions (0)
No revisions yet.