patternpythonMinor
A function for parsing words from a string without using whitespace
Viewed 0 times
withoutwhitespacewordsfunctionparsingforusingfromstring
Problem
I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.
I'm curious about other methods for finding words without using whitespace.
The above code works when the number of unique words is small but fails when it is large. I've used the website TextMechanic to generate a random string like
and the above code returns a dictionary exactly as desired:
Here's the problem:
When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.
My current solution uses the algorithm on short slices of the original string, but this involves trial-and-error and has artifacts.
I'm curious about other methods for finding words without using whitespace.
def unique_words(string):
words = dict()
p1 = 0 # String slice position 1
p2 = 1 # String slice position 2
len_string = len(string)
while p2 < len_string:
p2 += 1
sub1 = string[p1:p2] # A shorter sub
sub2 = string[p1:(p2 + 1)] # A longer sub
sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
sub2_count = string.count(sub2) # Counts the frequency of the longer sub
if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
words[sub1] = ('') # Add
p1 = p2
return wordsThe above code works when the number of unique words is small but fails when it is large. I've used the website TextMechanic to generate a random string like
'updownleftupdowndownleftupleftrightupdownleftup'and the above code returns a dictionary exactly as desired:
{'up': '', 'down': '', 'left': '', 'right': ''}Here's the problem:
When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.
My current solution uses the algorithm on short slices of the original string, but this involves trial-and-error and has artifacts.
Solution
You should give your variables better names. For example, while you might know what the variable
Rather than using the
Finally, you don't need the parentheses around the
p1, or p2 does, people reading your code don't. Giving your variables better names also reduces the need for inline comments like # String slice position 2.Rather than using the
dict function to initialize an empty dictionary, you can just type the following: words = {}.Finally, you don't need the parentheses around the
'' in words[sub1] = (''). It can be changed to the following: words[sub1] = ''.Context
StackExchange Code Review Q#62469, answer score: 3
Revisions (0)
No revisions yet.