HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Markov chain text generation in Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
textgenerationpythonmarkovchain

Problem

I'm willing to listen to any advice you have for improving it.

from collections import defaultdict

import nltk
import numpy as np

seq, freq, sample = [t for ts in [[None] + nltk.casual_tokenize(line) + [None] for line in open("file.txt").readlines()] for t in ts], defaultdict(lambda: defaultdict(int)), lambda curr: np.random.choice(list(freq[curr].keys()), p=[x/sum(freq[curr].values()) for x in freq[curr].values()])
for a, b in zip(seq[:-1], seq[1:]): freq[a][b] += 0 if a is None and b is None else 1
curr = sample(None)
while curr != None: _, curr = print(curr, end=" "), sample(curr)

Solution

Holy incomprehensible comprehensions!

There is no good justification for defining seq, freq, and sample all as one statement. Each of them is already complex enough on its own.

There is also no good reason to write your last while loop as a one-liner.

With just those changes and some indentation…

from collections import defaultdict

import nltk
import numpy as np

seq = [
    t for ts in [
        [None] + nltk.casual_tokenize(line) + [None]
        for line in open("file.txt").readlines()
    ]
    for t in ts
]
freq = defaultdict(lambda: defaultdict(int))
sample = lambda curr: np.random.choice(
    list(freq[curr].keys()),
    p=[x / sum(freq[curr].values()) for x in freq[curr].values()]
)
for a, b in zip(seq[:-1], seq[1:]):
    freq[a][b] += 0 if a is None and b is None else 1
curr = sample(None)
while curr != None:
    print(curr, end=" ")
    curr = sample(curr)


… we might start to hope to understand the code.

What's [t for ts in [[…]] for t in ts]? It's just flattening a list of lists. I'd write that as list(itertools.chain(*[[…]])).

with open("file.txt") as f:
    seq = list(itertools.chain(*[
        [None] + nltk.casual_tokenize(line) + [None]
        for line in f
    ]))


However, once you flatten it, you get …, None, None, … as placeholders for the linebreaks. You later have to do extra work to filter out if a is None and b is None when constructing the freq matrix. So why even bother creating seq as a flattened list in the first place, if all you want is a frequency matrix?

from collections import Counter

def frequency_table(file):
    freq = defaultdict(Counter)
    for line in file:
        tokens = nltk.casual_tokenize(line)
        for a, b in zip(tokens + [None], [None] + tokens):
            freq[a][b] += 1
    return freq


sample would be clearer if written as a function. I'd make it take a freq parameter instead of using freq as a global.

def sample(freq, curr):
    return np.random.choice(
        list(freq[curr].keys()),
        p=[x / sum(freq[curr].values()) for x in freq[curr].values()]
    )


I don't know if performance is an issue here, but you might be better off normalizing the probabilities instead of recalculating [x / sum(freq[curr].values()) for x in freq[curr].values()] with each call to sample().

It would be nice to convert your final loop into a generator:

def markov_chain(freq, word=None):
    while True:
        word = sample(freq, word)
        if word is None:
            break
        yield word


With those three function definitions in place, the rest of the code looks pretty straightforward:

with open("file.txt") as f:
    freq = frequency_table(f)
for word in markov_chain(freq):
    print(word, end=" ")

Code Snippets

from collections import defaultdict

import nltk
import numpy as np

seq = [
    t for ts in [
        [None] + nltk.casual_tokenize(line) + [None]
        for line in open("file.txt").readlines()
    ]
    for t in ts
]
freq = defaultdict(lambda: defaultdict(int))
sample = lambda curr: np.random.choice(
    list(freq[curr].keys()),
    p=[x / sum(freq[curr].values()) for x in freq[curr].values()]
)
for a, b in zip(seq[:-1], seq[1:]):
    freq[a][b] += 0 if a is None and b is None else 1
curr = sample(None)
while curr != None:
    print(curr, end=" ")
    curr = sample(curr)
with open("file.txt") as f:
    seq = list(itertools.chain(*[
        [None] + nltk.casual_tokenize(line) + [None]
        for line in f
    ]))
from collections import Counter

def frequency_table(file):
    freq = defaultdict(Counter)
    for line in file:
        tokens = nltk.casual_tokenize(line)
        for a, b in zip(tokens + [None], [None] + tokens):
            freq[a][b] += 1
    return freq
def sample(freq, curr):
    return np.random.choice(
        list(freq[curr].keys()),
        p=[x / sum(freq[curr].values()) for x in freq[curr].values()]
    )
def markov_chain(freq, word=None):
    while True:
        word = sample(freq, word)
        if word is None:
            break
        yield word

Context

StackExchange Code Review Q#120116, answer score: 6

Revisions (0)

No revisions yet.