patternpythonMinor
Social Media Hashtag Splitting
Viewed 0 times
hashtagsocialsplittingmedia
Problem
I decided to try out Python (3.x) two or so weeks ago, and this is my first real script using it. The program I've written below is slow, clunky, inefficient, inaccurate, and probably poorly coded! But it produces what I want most of the time, and it was really just a project to become accustomed to different Python techniques.
What it should do is, recieve a string, one that looks like a tag from a social media site (#socialmedia, #20000leaguesunder, etc.), and then based on a dictionary, split the string into logical words.
So "#volunteeringtoday" becomes "volunteering today" and so on.
I understand my algorithm is poor, so any help with the actual Python and formatting would be preferred.
```
import re
class Splitter():
input = ''
clean = ''
words = []
def __init__(self, string):
'''
Set the input string
'''
self.input = string
def split(self):
'''
Split the input string. Get the clean version then start the separation loop.
'''
self.clean = self.cleanString(self.input)
return self.separateString(self.clean)
def separateString(self, inp):
'''
Separate the string input. Break apart and look for matches in a dictionary.
'''
# Index each character in the input string
for ind in range(len(inp)):
# Build a segment
built = self.partition(inp, ind)
#print('"'+built+'"') # Add for details
# If only one letter remains, steal a letter from the previous match
if len(built) == 1:
built = self.words[-1][-1] + self.lastLarge
self.words[-1] = self.words[-1][0:-1]
return self.separateString(built)
# Check if segment ends with a digit. Separate it if it does.
if re.match('\d+$', built):
built = re.sub("([^\d])\d", "", built)
self.words.append(built)
return self.separateString(inp.replace(built, '', 1))
#
What it should do is, recieve a string, one that looks like a tag from a social media site (#socialmedia, #20000leaguesunder, etc.), and then based on a dictionary, split the string into logical words.
So "#volunteeringtoday" becomes "volunteering today" and so on.
I understand my algorithm is poor, so any help with the actual Python and formatting would be preferred.
```
import re
class Splitter():
input = ''
clean = ''
words = []
def __init__(self, string):
'''
Set the input string
'''
self.input = string
def split(self):
'''
Split the input string. Get the clean version then start the separation loop.
'''
self.clean = self.cleanString(self.input)
return self.separateString(self.clean)
def separateString(self, inp):
'''
Separate the string input. Break apart and look for matches in a dictionary.
'''
# Index each character in the input string
for ind in range(len(inp)):
# Build a segment
built = self.partition(inp, ind)
#print('"'+built+'"') # Add for details
# If only one letter remains, steal a letter from the previous match
if len(built) == 1:
built = self.words[-1][-1] + self.lastLarge
self.words[-1] = self.words[-1][0:-1]
return self.separateString(built)
# Check if segment ends with a digit. Separate it if it does.
if re.match('\d+$', built):
built = re.sub("([^\d])\d", "", built)
self.words.append(built)
return self.separateString(inp.replace(built, '', 1))
#
Solution
Bugs
Try some simple test cases to find bugs:
These reveal some bugs:
-
Only the first call to
-
Unknown words are omitted. What should
-
Words containing numbers are not handled in any reasonable way. Perhaps any sequence of digits should be parsed as a word?
-
Brackets are treated as letters. The regex in
When modifying
The last test case (
Architecture and simplification
The
It's not a good idea for
Callers probably want a list of words, rather than a space-separated string, so
It would be clearer (and faster) to load the dictionary once, as a set, rather than reading it repeatedly:
Try some simple test cases to find bugs:
def test_split(s):
splitter = Splitter(s)
print(s, "⇒", splitter.split(s))
test_split('[') #no word, just punctuation
test_split('cat')
test_split('cattree')
test_split('cathode') #one word, not cat-hode
test_split('muscat') #one word, not mus-cat
test_split('onetwothree')
test_split('one1two2') #multiple numbers
test_split('2014') #multi-digit number
test_split('xyzzy') #unknown word
test_split('cathodemuscat') #cathode-muscat, not cathodemus-cat or cat-hodemuscatThese reveal some bugs:
-
Only the first call to
split works (even on separate instances). This is because Splitter.words is shared between instances but is modified by separateWords, and words remain in it after each call.-
Unknown words are omitted. What should
separateString return when it doesn't recognize a word? The whole input?-
Words containing numbers are not handled in any reasonable way. Perhaps any sequence of digits should be parsed as a word?
-
Brackets are treated as letters. The regex in
cleanString should be "[^\da-z]", not "[^\d[a-z]]", unless you want to keep square brackets.When modifying
separateString, be careful with the case where the whole input as one word — it's easy to break this or have it only work by accident.The last test case (
'cathodemuscat') is particularly hard. Hint: split the string, recurse on both halves, and check that both halves succeed.Architecture and simplification
The
Splitter class does nothing useful. Its methods should be top-level functions instead. (This contributes to the Splitter.words bug.)It's not a good idea for
separateString to communicate with its recursive call by modifying input and words. This is bug-prone (it contributes to the Splitter.words bug) and hard to read. Instead, separateString should pass the appropriate substring as an argument to the recursive call, and append its word to the result. It may look something like this:return separateString(inp[:-ind]) + [word]Callers probably want a list of words, rather than a space-separated string, so
separateString should return the list without joining.It would be clearer (and faster) to load the dictionary once, as a set, rather than reading it repeatedly:
dictionary = set([cleanString(w) for w in open('word.txt')])partition is unnecessary. Just write inp[len(inp)-ind:], or even inp[-ind:] if ind is never 0. (Negative indexes count from the end.)Code Snippets
def test_split(s):
splitter = Splitter(s)
print(s, "⇒", splitter.split(s))
test_split('[') #no word, just punctuation
test_split('cat')
test_split('cattree')
test_split('cathode') #one word, not cat-hode
test_split('muscat') #one word, not mus-cat
test_split('onetwothree')
test_split('one1two2') #multiple numbers
test_split('2014') #multi-digit number
test_split('xyzzy') #unknown word
test_split('cathodemuscat') #cathode-muscat, not cathodemus-cat or cat-hodemuscatreturn separateString(inp[:-ind]) + [word]dictionary = set([cleanString(w) for w in open('word.txt')])Context
StackExchange Code Review Q#51078, answer score: 6
Revisions (0)
No revisions yet.