HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

A function for parsing words from a string without using whitespace

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
withoutwhitespacewordsfunctionparsingforusingfromstring

Problem

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

def unique_words(string):
    words = dict()
    p1 = 0 # String slice position 1
    p2 = 1 # String slice position 2
    len_string = len(string)
    while p2 < len_string:
        p2 += 1
        sub1 = string[p1:p2] # A shorter sub
        sub2 = string[p1:(p2 + 1)] # A longer sub
        sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
        sub2_count = string.count(sub2) # Counts the frequency of the longer sub
        if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
            words[sub1] = ('') # Add 
            p1 = p2
    return words


The above code works when the number of unique words is small but fails when it is large. I've used the website TextMechanic to generate a random string like

'updownleftupdowndownleftupleftrightupdownleftup'


and the above code returns a dictionary exactly as desired:

{'up': '', 'down': '', 'left': '', 'right': ''}


Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution uses the algorithm on short slices of the original string, but this involves trial-and-error and has artifacts.

Solution

You should give your variables better names. For example, while you might know what the variable p1, or p2 does, people reading your code don't. Giving your variables better names also reduces the need for inline comments like # String slice position 2.

Rather than using the dict function to initialize an empty dictionary, you can just type the following: words = {}.

Finally, you don't need the parentheses around the '' in words[sub1] = (''). It can be changed to the following: words[sub1] = ''.

Context

StackExchange Code Review Q#62469, answer score: 3

Revisions (0)

No revisions yet.