HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Vanilla Markov chain in 18 lines

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
vanillamarkovlineschain

Problem

I am looking for advice on how I can improve this code. The objective is to make a vanilla Markov Chain text generator in the most concise way. I'm looking for advice on style, naming, 'elegance', and correctness.

import random, re

with open("corpus.txt") as file:
    text = file.read().replace('\n', ' newline ')

text = re.sub(r'([\,\.\!\?\;\:])',r' \1 ',text)
text = re.sub(r'[^a-zA-Z0-9|\,|\.|\!|\?|\;|\:]',r' ',text)

text = [word.lower().strip() for word in text.split(' ') if word.strip() != '']

markov_chain = {}
for current_word, next_word in zip(text, text[1:]):
    markov_chain.setdefault(current_word, []).append(next_word)

current_word = random.choice(list(markov_chain.keys()))
sentence = []

for i in range(0,15):
    sentence.append(current_word)
    if current_word in markov_chain and len(markov_chain[current_word]) > 0:
        current_word = random.choice(markov_chain[current_word])
    else:
        break

print(' '.join(sentence).replace('newline', '\n').replace('\n ', '\n'))

Solution

At a high level, I think you're overusing regexes. You can split stuff into words earlier and then clean them, and you don't need to treat newlines as an explicit token. If you do want to split them as an explicit token called newline to protect it from future regexes, you can handle that when you're constructing the markov chain by having a dedicated class, or just a string \n.

class NewlineToken:
    pass


I would take your input text and split it into words first, but do it a line at a time so I don't need to read the whole file into memory. Also, don't use file as the name of a temporary file since it shadows a builtin.

The split method when called with no arguments splits on whitespace.

with open("corpus.txt") as fh:
    for line in fh:
        words = line.split()
        # do stuff with words


After that you can strip punctuation from your words and lowercase it.

word.lower().translate(None, string.punctuation)


I'd probably wrap that all up in a generator that looks something like this.

def stripped_words(path):
    with open(path) as fh:
        for line in fh:
            words = line.split()
            for word in words:
                 yield word.lower().translate(
                         None, 
                         string.punctuation
                 )


More about string.punctuation here https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python

The markov chain construction is good. zip automatically truncates the longer iterator and .setdefault is really useful. I actually didn't know about that until reading this. You'd have to modify it a little bit if you're iterating over words from your file rather than working with an array.

This line selects a word uniformly at random regardless of how many times it appears in the text rather than selecting a word with probability equal to its frequency. I'm not sure what the requirements are, but it is something that jumped out at me.

# more frequent words aren't more likely to be chosen here
current_word = random.choice(list(markov_chain.keys()))


Some miscellaneous comments:

You don't need to escape everything in your first character class.

text = re.sub(r'([,.!?;:])',r' \1 ', text)


In this line

text = re.sub(r'[^a-zA-Z0-9|\,|\.|\!|\?|\;|\:]',r' ',text)


| is part of your character class, I'm not sure whether you intended it because it isn't escaped like your other non-alphanumeric characters are.

Code Snippets

class NewlineToken:
    pass
with open("corpus.txt") as fh:
    for line in fh:
        words = line.split()
        # do stuff with words
word.lower().translate(None, string.punctuation)
def stripped_words(path):
    with open(path) as fh:
        for line in fh:
            words = line.split()
            for word in words:
                 yield word.lower().translate(
                         None, 
                         string.punctuation
                 )
# more frequent words aren't more likely to be chosen here
current_word = random.choice(list(markov_chain.keys()))

Context

StackExchange Code Review Q#150649, answer score: 4

Revisions (0)

No revisions yet.