patternpythonMinor
Markov chain text generation in Python
Viewed 0 times
textgenerationpythonmarkovchain
Problem
I'm willing to listen to any advice you have for improving it.
from collections import defaultdict
import nltk
import numpy as np
seq, freq, sample = [t for ts in [[None] + nltk.casual_tokenize(line) + [None] for line in open("file.txt").readlines()] for t in ts], defaultdict(lambda: defaultdict(int)), lambda curr: np.random.choice(list(freq[curr].keys()), p=[x/sum(freq[curr].values()) for x in freq[curr].values()])
for a, b in zip(seq[:-1], seq[1:]): freq[a][b] += 0 if a is None and b is None else 1
curr = sample(None)
while curr != None: _, curr = print(curr, end=" "), sample(curr)Solution
Holy incomprehensible comprehensions!
There is no good justification for defining
There is also no good reason to write your last
With just those changes and some indentation…
… we might start to hope to understand the code.
What's
However, once you flatten it, you get
I don't know if performance is an issue here, but you might be better off normalizing the probabilities instead of recalculating
It would be nice to convert your final loop into a generator:
With those three function definitions in place, the rest of the code looks pretty straightforward:
There is no good justification for defining
seq, freq, and sample all as one statement. Each of them is already complex enough on its own.There is also no good reason to write your last
while loop as a one-liner.With just those changes and some indentation…
from collections import defaultdict
import nltk
import numpy as np
seq = [
t for ts in [
[None] + nltk.casual_tokenize(line) + [None]
for line in open("file.txt").readlines()
]
for t in ts
]
freq = defaultdict(lambda: defaultdict(int))
sample = lambda curr: np.random.choice(
list(freq[curr].keys()),
p=[x / sum(freq[curr].values()) for x in freq[curr].values()]
)
for a, b in zip(seq[:-1], seq[1:]):
freq[a][b] += 0 if a is None and b is None else 1
curr = sample(None)
while curr != None:
print(curr, end=" ")
curr = sample(curr)… we might start to hope to understand the code.
What's
[t for ts in [[…]] for t in ts]? It's just flattening a list of lists. I'd write that as list(itertools.chain(*[[…]])).with open("file.txt") as f:
seq = list(itertools.chain(*[
[None] + nltk.casual_tokenize(line) + [None]
for line in f
]))However, once you flatten it, you get
…, None, None, … as placeholders for the linebreaks. You later have to do extra work to filter out if a is None and b is None when constructing the freq matrix. So why even bother creating seq as a flattened list in the first place, if all you want is a frequency matrix?from collections import Counter
def frequency_table(file):
freq = defaultdict(Counter)
for line in file:
tokens = nltk.casual_tokenize(line)
for a, b in zip(tokens + [None], [None] + tokens):
freq[a][b] += 1
return freqsample would be clearer if written as a function. I'd make it take a freq parameter instead of using freq as a global.def sample(freq, curr):
return np.random.choice(
list(freq[curr].keys()),
p=[x / sum(freq[curr].values()) for x in freq[curr].values()]
)I don't know if performance is an issue here, but you might be better off normalizing the probabilities instead of recalculating
[x / sum(freq[curr].values()) for x in freq[curr].values()] with each call to sample().It would be nice to convert your final loop into a generator:
def markov_chain(freq, word=None):
while True:
word = sample(freq, word)
if word is None:
break
yield wordWith those three function definitions in place, the rest of the code looks pretty straightforward:
with open("file.txt") as f:
freq = frequency_table(f)
for word in markov_chain(freq):
print(word, end=" ")Code Snippets
from collections import defaultdict
import nltk
import numpy as np
seq = [
t for ts in [
[None] + nltk.casual_tokenize(line) + [None]
for line in open("file.txt").readlines()
]
for t in ts
]
freq = defaultdict(lambda: defaultdict(int))
sample = lambda curr: np.random.choice(
list(freq[curr].keys()),
p=[x / sum(freq[curr].values()) for x in freq[curr].values()]
)
for a, b in zip(seq[:-1], seq[1:]):
freq[a][b] += 0 if a is None and b is None else 1
curr = sample(None)
while curr != None:
print(curr, end=" ")
curr = sample(curr)with open("file.txt") as f:
seq = list(itertools.chain(*[
[None] + nltk.casual_tokenize(line) + [None]
for line in f
]))from collections import Counter
def frequency_table(file):
freq = defaultdict(Counter)
for line in file:
tokens = nltk.casual_tokenize(line)
for a, b in zip(tokens + [None], [None] + tokens):
freq[a][b] += 1
return freqdef sample(freq, curr):
return np.random.choice(
list(freq[curr].keys()),
p=[x / sum(freq[curr].values()) for x in freq[curr].values()]
)def markov_chain(freq, word=None):
while True:
word = sample(freq, word)
if word is None:
break
yield wordContext
StackExchange Code Review Q#120116, answer score: 6
Revisions (0)
No revisions yet.