patternpythonMinor
Vanilla Markov chain in 18 lines
Viewed 0 times
vanillamarkovlineschain
Problem
I am looking for advice on how I can improve this code. The objective is to make a vanilla Markov Chain text generator in the most concise way. I'm looking for advice on style, naming, 'elegance', and correctness.
import random, re
with open("corpus.txt") as file:
text = file.read().replace('\n', ' newline ')
text = re.sub(r'([\,\.\!\?\;\:])',r' \1 ',text)
text = re.sub(r'[^a-zA-Z0-9|\,|\.|\!|\?|\;|\:]',r' ',text)
text = [word.lower().strip() for word in text.split(' ') if word.strip() != '']
markov_chain = {}
for current_word, next_word in zip(text, text[1:]):
markov_chain.setdefault(current_word, []).append(next_word)
current_word = random.choice(list(markov_chain.keys()))
sentence = []
for i in range(0,15):
sentence.append(current_word)
if current_word in markov_chain and len(markov_chain[current_word]) > 0:
current_word = random.choice(markov_chain[current_word])
else:
break
print(' '.join(sentence).replace('newline', '\n').replace('\n ', '\n'))Solution
At a high level, I think you're overusing regexes. You can split stuff into words earlier and then clean them, and you don't need to treat newlines as an explicit token. If you do want to split them as an explicit token called
I would take your input text and split it into words first, but do it a line at a time so I don't need to read the whole file into memory. Also, don't use
The
After that you can strip punctuation from your words and lowercase it.
I'd probably wrap that all up in a generator that looks something like this.
More about
The markov chain construction is good.
This line selects a word uniformly at random regardless of how many times it appears in the text rather than selecting a word with probability equal to its frequency. I'm not sure what the requirements are, but it is something that jumped out at me.
Some miscellaneous comments:
You don't need to escape everything in your first character class.
In this line
newline to protect it from future regexes, you can handle that when you're constructing the markov chain by having a dedicated class, or just a string \n.class NewlineToken:
passI would take your input text and split it into words first, but do it a line at a time so I don't need to read the whole file into memory. Also, don't use
file as the name of a temporary file since it shadows a builtin.The
split method when called with no arguments splits on whitespace.with open("corpus.txt") as fh:
for line in fh:
words = line.split()
# do stuff with wordsAfter that you can strip punctuation from your words and lowercase it.
word.lower().translate(None, string.punctuation)I'd probably wrap that all up in a generator that looks something like this.
def stripped_words(path):
with open(path) as fh:
for line in fh:
words = line.split()
for word in words:
yield word.lower().translate(
None,
string.punctuation
)More about
string.punctuation here https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-pythonThe markov chain construction is good.
zip automatically truncates the longer iterator and .setdefault is really useful. I actually didn't know about that until reading this. You'd have to modify it a little bit if you're iterating over words from your file rather than working with an array.This line selects a word uniformly at random regardless of how many times it appears in the text rather than selecting a word with probability equal to its frequency. I'm not sure what the requirements are, but it is something that jumped out at me.
# more frequent words aren't more likely to be chosen here
current_word = random.choice(list(markov_chain.keys()))Some miscellaneous comments:
You don't need to escape everything in your first character class.
text = re.sub(r'([,.!?;:])',r' \1 ', text)In this line
text = re.sub(r'[^a-zA-Z0-9|\,|\.|\!|\?|\;|\:]',r' ',text)| is part of your character class, I'm not sure whether you intended it because it isn't escaped like your other non-alphanumeric characters are.Code Snippets
class NewlineToken:
passwith open("corpus.txt") as fh:
for line in fh:
words = line.split()
# do stuff with wordsword.lower().translate(None, string.punctuation)def stripped_words(path):
with open(path) as fh:
for line in fh:
words = line.split()
for word in words:
yield word.lower().translate(
None,
string.punctuation
)# more frequent words aren't more likely to be chosen here
current_word = random.choice(list(markov_chain.keys()))Context
StackExchange Code Review Q#150649, answer score: 4
Revisions (0)
No revisions yet.