patternpythonMinor
Random Word Splitter
Viewed 0 times
randomwordsplitter
Problem
I wrote a word splitting function. It splits a word into random characters. For example if input is 'runtime' one of each below output possible:
But it's runtime is very high when I want to split 100k words do you have any suggestions to optimize or write it smarter.
['runtime']
['r','untime']
['r','u','n','t','i','m','e'] ....But it's runtime is very high when I want to split 100k words do you have any suggestions to optimize or write it smarter.
def random_multisplitter(word):
from numpy import mod
spw = []
length = len(word)
rand = random_int(word)
if rand == length: #probability of not splitting
return [word]
else:
div = mod(rand, (length + 1))#defining division points
bound = length - div
spw.append(div)
while div != 0:
rand = random_int(word)
div = mod(rand,(bound+1))
bound = bound-div
spw.append(div)
result = spw
b = 0
points =[]
for x in range(len(result)-1): #calculating splitting points
b=b+result[x]
points.append(b)
xy=0
t=[]
for i in points:
t.append(word[xy:i])
xy=i
if word[xy:len(word)]!='':
t.append(word[xy:len(word)])
if type(t)!=list:
return [t]
return tSolution
Most of your variable names seem good, however
You should normally have two space's around operators.
which looks at a glance like a variable.
Why do you need to import numpy to do a simple mod?
Python comes with mod,
You use the word
which I would use to imply that that will be the result of a function.
You shouldn't import things anywhere apart from the beginning of the file.
I nor anyone else wants to traverse you entire program to find that you import numpy's mod.
We want to know at the begging you use it.
Also this can lead to bugs where you think you have imported numpy, but you haven't.
Your programs overall ability to be read with ease is quite low.
I would recommend splitting the function into two.
A 'feeder' and a 'consumer'.
First I love generators, and you can write one that reduces the complexity of this program.
The algorithm that I use is to find the start and stops of the split.
Then I replace start with stop and add a random amount to the stop.
It really is that simple. To do the majority of your program.
This is simple, it will count until it gets to or past the length, and will return them.
You can think of it like building an array that looks like
This is a near drop in replacement for the entire program until
However I use a lazy approach for that for loop.
Where yours is a one dimensional list of index.
If you wanted it to build an array instead then it would look like the following block.
However it's not advised, as it will lead to bad memory usage, and it will take slightly longer.
Then you can now split the string.
To do this I will loop through the above generator and yield split strings.
I use Python's amazing split operation just like you did.
However as you want
You will change it.
But a generator is a better choice for the 100k words input.
This is as then you will have a smaller memory consumption.
If not faster, it will at least be smaller and nicer to look at.
I thought that the program would be larger, and so, a small one function version would be:
Some people may dislike that I allow the slice to go above the maximum length of the list. Just to avoid any potential confusion about that, it's safe to use, but can seem weird.
Due to this you may wish to change the assignment of
Or when you are returing the values change it to:
The former will have a higher amount of splits to the end of the string. Where as the unmodified version and latter version will prominently have two or three main splits on small strings.
As this is tagged performance, It's probably best if we have a speed test.
The code I use to test the speed is:
This uses
Also, I can't test your original code.
And finally I used
Keep in mind that Generators perform better when they aren't converted to a
xy explains nothing.You should normally have two space's around operators.
b = b + result[x] is more readable than b=b+result[x],which looks at a glance like a variable.
Why do you need to import numpy to do a simple mod?
Python comes with mod,
%.You use the word
result,which I would use to imply that that will be the result of a function.
You shouldn't import things anywhere apart from the beginning of the file.
I nor anyone else wants to traverse you entire program to find that you import numpy's mod.
We want to know at the begging you use it.
Also this can lead to bugs where you think you have imported numpy, but you haven't.
Your programs overall ability to be read with ease is quite low.
I would recommend splitting the function into two.
A 'feeder' and a 'consumer'.
First I love generators, and you can write one that reduces the complexity of this program.
The algorithm that I use is to find the start and stops of the split.
Then I replace start with stop and add a random amount to the stop.
It really is that simple. To do the majority of your program.
def get_numbers(length):
start, stop = 0, 0
while stop < length:
start, stop = stop, stop + random.randint(1, length)
yield start, stopThis is simple, it will count until it gets to or past the length, and will return them.
You can think of it like building an array that looks like
[(start, stop), (start, stop), ...].This is a near drop in replacement for the entire program until
for i in points:.However I use a lazy approach for that for loop.
Where yours is a one dimensional list of index.
If you wanted it to build an array instead then it would look like the following block.
However it's not advised, as it will lead to bad memory usage, and it will take slightly longer.
def get_numbers(length):
list_ = []
start, stop = 0, 0
while stop < length:
start, stop = stop, stop + random_int(length)
list_.append((start, stop))
return list_Then you can now split the string.
To do this I will loop through the above generator and yield split strings.
def random_multisplitter(word):
for start, stop in get_numbers(len(word)):
yield word[start:stop]I use Python's amazing split operation just like you did.
However as you want
random_multisplitter to return a list, not a generator.You will change it.
But a generator is a better choice for the 100k words input.
This is as then you will have a smaller memory consumption.
def random_multisplitter(word):
return [word[start:stop] for start, stop in get_numbers(len(word))]If not faster, it will at least be smaller and nicer to look at.
I thought that the program would be larger, and so, a small one function version would be:
from random import randint
def random_multisplitter(word):
length = len(word)
start, stop = 0, 0
while stop < length:
start, stop = stop, stop + randint(1, length)
yield word[start:stop]
# Generator
random_multisplitter('runtime')
# List
list(random_multisplitter('runtime'))Some people may dislike that I allow the slice to go above the maximum length of the list. Just to avoid any potential confusion about that, it's safe to use, but can seem weird.
>>> 'abcde'[0:20]
'abcde'
>>> 'abcde'[20]
IndexError: string index out of rangeDue to this you may wish to change the assignment of
stop to the following:start, stop = stop, randint(stop + 1, length)Or when you are returing the values change it to:
yield start, min(stop, length)The former will have a higher amount of splits to the end of the string. Where as the unmodified version and latter version will prominently have two or three main splits on small strings.
As this is tagged performance, It's probably best if we have a speed test.
The code I use to test the speed is:
word = ' ' * 100000
def time_it(fn, name):
n = 100
t0 = time.time()
for _ in range(n):
fn(word)
t1 = time.time()
print name, '=', t1-t0This uses
time and so just take the results with a bit of salt.Also, I can't test your original code.
random_int is not defined in the example.And finally I used
start, stop = stop, stop + random.randint(1, 2) On the 'Slow' ones, and start, stop = stop, stop + random.randint(1, length) on the 'Fast' ones.Other Answer = 75.2860000134
Slow Generator = 12.1919999123
Slow List = 13.236000061
Fast Generator = 0.00200009346008
Fast List = 0.0019998550415Keep in mind that Generators perform better when they aren't converted to a
list straight off the bat.Code Snippets
def get_numbers(length):
start, stop = 0, 0
while stop < length:
start, stop = stop, stop + random.randint(1, length)
yield start, stopdef get_numbers(length):
list_ = []
start, stop = 0, 0
while stop < length:
start, stop = stop, stop + random_int(length)
list_.append((start, stop))
return list_def random_multisplitter(word):
for start, stop in get_numbers(len(word)):
yield word[start:stop]def random_multisplitter(word):
return [word[start:stop] for start, stop in get_numbers(len(word))]from random import randint
def random_multisplitter(word):
length = len(word)
start, stop = 0, 0
while stop < length:
start, stop = stop, stop + randint(1, length)
yield word[start:stop]
# Generator
random_multisplitter('runtime')
# List
list(random_multisplitter('runtime'))Context
StackExchange Code Review Q#105286, answer score: 7
Revisions (0)
No revisions yet.