HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Social Media Hashtag Splitting

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
hashtagsocialsplittingmedia

Problem

I decided to try out Python (3.x) two or so weeks ago, and this is my first real script using it. The program I've written below is slow, clunky, inefficient, inaccurate, and probably poorly coded! But it produces what I want most of the time, and it was really just a project to become accustomed to different Python techniques.

What it should do is, recieve a string, one that looks like a tag from a social media site (#socialmedia, #20000leaguesunder, etc.), and then based on a dictionary, split the string into logical words.

So "#volunteeringtoday" becomes "volunteering today" and so on.

I understand my algorithm is poor, so any help with the actual Python and formatting would be preferred.

```
import re

class Splitter():

input = ''
clean = ''
words = []

def __init__(self, string):
'''
Set the input string
'''
self.input = string

def split(self):
'''
Split the input string. Get the clean version then start the separation loop.
'''
self.clean = self.cleanString(self.input)
return self.separateString(self.clean)

def separateString(self, inp):
'''
Separate the string input. Break apart and look for matches in a dictionary.
'''
# Index each character in the input string
for ind in range(len(inp)):
# Build a segment
built = self.partition(inp, ind)
#print('"'+built+'"') # Add for details
# If only one letter remains, steal a letter from the previous match
if len(built) == 1:
built = self.words[-1][-1] + self.lastLarge
self.words[-1] = self.words[-1][0:-1]
return self.separateString(built)
# Check if segment ends with a digit. Separate it if it does.
if re.match('\d+$', built):
built = re.sub("([^\d])\d", "", built)
self.words.append(built)
return self.separateString(inp.replace(built, '', 1))
#

Solution

Bugs

Try some simple test cases to find bugs:

def test_split(s):
    splitter = Splitter(s)
    print(s, "⇒", splitter.split(s))

test_split('[')        #no word, just punctuation
test_split('cat')
test_split('cattree')
test_split('cathode')  #one word, not cat-hode
test_split('muscat')   #one word, not mus-cat
test_split('onetwothree')
test_split('one1two2') #multiple numbers
test_split('2014')     #multi-digit number
test_split('xyzzy')    #unknown word
test_split('cathodemuscat') #cathode-muscat, not cathodemus-cat or cat-hodemuscat


These reveal some bugs:

-
Only the first call to split works (even on separate instances). This is because Splitter.words is shared between instances but is modified by separateWords, and words remain in it after each call.

-
Unknown words are omitted. What should separateString return when it doesn't recognize a word? The whole input?

-
Words containing numbers are not handled in any reasonable way. Perhaps any sequence of digits should be parsed as a word?

-
Brackets are treated as letters. The regex in cleanString should be "[^\da-z]", not "[^\d[a-z]]", unless you want to keep square brackets.

When modifying separateString, be careful with the case where the whole input as one word — it's easy to break this or have it only work by accident.

The last test case ('cathodemuscat') is particularly hard. Hint: split the string, recurse on both halves, and check that both halves succeed.
Architecture and simplification

The Splitter class does nothing useful. Its methods should be top-level functions instead. (This contributes to the Splitter.words bug.)

It's not a good idea for separateString to communicate with its recursive call by modifying input and words. This is bug-prone (it contributes to the Splitter.words bug) and hard to read. Instead, separateString should pass the appropriate substring as an argument to the recursive call, and append its word to the result. It may look something like this:

return separateString(inp[:-ind]) + [word]


Callers probably want a list of words, rather than a space-separated string, so separateString should return the list without joining.

It would be clearer (and faster) to load the dictionary once, as a set, rather than reading it repeatedly:

dictionary = set([cleanString(w) for w in open('word.txt')])


partition is unnecessary. Just write inp[len(inp)-ind:], or even inp[-ind:] if ind is never 0. (Negative indexes count from the end.)

Code Snippets

def test_split(s):
    splitter = Splitter(s)
    print(s, "⇒", splitter.split(s))

test_split('[')        #no word, just punctuation
test_split('cat')
test_split('cattree')
test_split('cathode')  #one word, not cat-hode
test_split('muscat')   #one word, not mus-cat
test_split('onetwothree')
test_split('one1two2') #multiple numbers
test_split('2014')     #multi-digit number
test_split('xyzzy')    #unknown word
test_split('cathodemuscat') #cathode-muscat, not cathodemus-cat or cat-hodemuscat
return separateString(inp[:-ind]) + [word]
dictionary = set([cleanString(w) for w in open('word.txt')])

Context

StackExchange Code Review Q#51078, answer score: 6

Revisions (0)

No revisions yet.