patternpythonMinor
Improving Speeds of Tokenisation of large word list for distinct combinations
Viewed 0 times
distinctcombinationsspeedswordlargeforlisttokenisationimproving
Problem
I'm having trouble with a current algorithm I'm writing that takes a corpa, a list of characters in sets (1 one or larger) with a frequency number attached to it. Against another list of character sets that need to be created using the most frequent occurrences of the combined sets... if that makes sense. I'll explain with an example. So lets say your corpa looks like this. (Just note that case does not matter as the actual data is in Korean, hence each character represents a Korean character.)
The wordlist is (ignore the number):
So the output of the script would be:
Currently I am basically doing this by ordering the corpa, then grabbing each individual line, getting all the combinations of each letter (this is redundant as it produces combinations which are not positionally correct) and then pulls out which combinations can make up the word. Then adds up the frequency of the set of combinations and then outputs the one with Freq_Max_Delta.
I'm thinking that a large part of the problem is the number of nested loops that I'm using and the fact we are creating redundant data. I'm also looking into the possibility of cython and even using multiple threads for the tokenisation. The problem is some what similar to the Knapsack Problem
Below is my code, also a portion of the inputs are located on my gist if you are interested in running it with what I'm using. The wordlist and the corpa
```
from argparse import ArgumentParser
from collections import OrderedDict
from itertools import combinations
def order_worlist(corpa):
dict_corpa = dict()
print 'ordering corpa....'
with open(corpa, 'r') as open_wordlist:
for line in open_wordlist:
A 56
AB 7342
ABC 3
BC 116
C 5
CD 10
BCD 502
ABCD 23
D 132
DD 6The wordlist is (ignore the number):
AAB 1123
DCDD 83So the output of the script would be:
Original Pois Makeup Freq_Max_Delta
AAB A AB [AB, 7342][A, 56] 7398
DCDD D C DD [D, 132][DD, 6][C, 5] 143Currently I am basically doing this by ordering the corpa, then grabbing each individual line, getting all the combinations of each letter (this is redundant as it produces combinations which are not positionally correct) and then pulls out which combinations can make up the word. Then adds up the frequency of the set of combinations and then outputs the one with Freq_Max_Delta.
I'm thinking that a large part of the problem is the number of nested loops that I'm using and the fact we are creating redundant data. I'm also looking into the possibility of cython and even using multiple threads for the tokenisation. The problem is some what similar to the Knapsack Problem
Below is my code, also a portion of the inputs are located on my gist if you are interested in running it with what I'm using. The wordlist and the corpa
```
from argparse import ArgumentParser
from collections import OrderedDict
from itertools import combinations
def order_worlist(corpa):
dict_corpa = dict()
print 'ordering corpa....'
with open(corpa, 'r') as open_wordlist:
for line in open_wordlist:
Solution
Some general concerns about your program:
-
If you go past 4/5 tabs you need to refactor your code.
If this means that you have to make a new function so be it.
If it means that you have to use proper explicit guard clauses please do.
-
Make your functions do one thing.
Instead you could read from the file, do a little transforming and then get another function to do more transforms.
For example you could just return the first dictionary.
And in a different function sort the dictionary.
-
Don't use 'hacks'.
Take for example the following code block:
The first thing I thought when I saw this was WTF.
The following is much more legible:
As you probably don't know, you don't need an
What's the point in ordering it here? Increase in speed? I removed it and it seemed much faster.
And so I removed them all.
To start off, rename
And make it a dict-comprehension, and now it's an easy to read 3liner.
Next make a function like the above for
You want to make a generator to not read everything here into memory.
You also want to only return the first tsv item.
Next change
Change
Don't re-assign
I'll take a small break here to highlight the use of good variable names.
I think
We now know that we will be given a
We also know that
It's nicely laid out to us so we don't have to think what
And if we read it we can understand it without any prior knowledge of names and conventions.
And finally we come to
We want to be able to edit or use the information returned by
In
if there are no combinations
You then want to generate and filter all potential combinations.
If there are none
And then you will want to do the final work of getting the best combination.
Generating and filtering all potential combinations is the hardest part.
Using the list comprehension I explained earlier, and adding the if statements to it is how you will do this.
We can reduce this,
Which will give you:
Now we have to find the best combination. I personally would use a list comprehension, and a non abusive use of
But either way is fine.
What is bad is your lack of
Just remove the
-
If you go past 4/5 tabs you need to refactor your code.
If this means that you have to make a new function so be it.
If it means that you have to use proper explicit guard clauses please do.
-
Make your functions do one thing.
order_worlist both reads from a file, converts it to a dictionary, and then sorts the dictionary.Instead you could read from the file, do a little transforming and then get another function to do more transforms.
For example you could just return the first dictionary.
And in a different function sort the dictionary.
-
Don't use 'hacks'.
Take for example the following code block:
output = list()
for k, v in tx_dict.items():
y = lambda : sum([map(list, combinations(tx_list, i)) for i in range(len(tx_list) + 1)], [])
output.append( y())The first thing I thought when I saw this was WTF.
- Don't do
y = (lambda:...)(). Just doy = ....
- Don't abuse
sum, instead use a nested comprehension.
- Don't populate a potentially massive list to check if
tx_dictis empty.
The following is much more legible:
if not tx_dict:
return None
combinations_ = [
j
for i in range(len(tx_list) + 1)
for j in combinations(tx_list, i)
]As you probably don't know, you don't need an
OrderedDict.What's the point in ordering it here? Increase in speed? I removed it and it seemed much faster.
And so I removed them all.
To start off, rename
order_worlist to read_corpa.And make it a dict-comprehension, and now it's an easy to read 3liner.
def read_corpa(file_name):
with open(file_name, 'r') as f:
return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}Next make a function like the above for
agrs.wordlist, this time call it read_words.You want to make a generator to not read everything here into memory.
You also want to only return the first tsv item.
def read_words(file_name):
with open(file_name, 'r') as f:
for word in f:
yield word.split('\t')[0]Next change
contains to use a slice rather than an expensive for loop.def contains(small, big):
small_ = len(small)
for i in xrange(len(big) - small_ + 1):
if big[i:i + small_] == small:
return (i, i + small_)
return NoneChange
find_best to only take original_tx (word) and dict_corpa (corpas).Don't re-assign
k and v just define them as the variable you want to call them at the beginning.def find_best(word, corpas):
results = {}
for corpa, frequency in corpas.items():
c = contains(corpa, word)
if c:
results[word[c[0]:c[1]]] = frequency
return resultsI'll take a small break here to highlight the use of good variable names.
I think
find_best highlights this the best.We now know that we will be given a
word and corpas as the arguments.We also know that
corpas is a dictionary, with a corpa as the key and it's frequency as the value.It's nicely laid out to us so we don't have to think what
try_tx is.And if we read it we can understand it without any prior knowledge of names and conventions.
And finally we come to
create_cominations, this should be split into two functions create_combination and display_combination.We want to be able to edit or use the information returned by
create_combination, possibly at a later date.In
create_combination you want to return as early as possible,if there are no combinations
return None.You then want to generate and filter all potential combinations.
If there are none
return None.And then you will want to do the final work of getting the best combination.
Generating and filtering all potential combinations is the hardest part.
Using the list comprehension I explained earlier, and adding the if statements to it is how you will do this.
combinations_ = [
j
for i in range(len(tx_list) + 1)
for j in combinations(tx_list, i)
]
for i in combinations_:
if sorted(''.join(i)) == sorted(original_tx):
if len(''.join(i)) == len(original_tx):
...
# Into
combinations_ = [
j
for i in range(len(tx_list) + 1)
for j in combinations(tx_list, i)
if sorted(''.join(j)) == sorted(original_tx)
if len(''.join(j)) == len(original_tx)
]
for i in combinations_:
...We can reduce this,
'ab' != 'abc' and so the length check is pointless.Which will give you:
combinations_ = [
j
for i in range(len(tx_list) + 1)
for j in combinations(tx_list, i)
if sorted(''.join(j)) == sorted(original_tx)
]Now we have to find the best combination. I personally would use a list comprehension, and a non abusive use of
sum.But either way is fine.
What is bad is your lack of
or, a pointless else:pass, using final_li = []; final_li.append(...) and finally, [sub].Just remove the
else:pass, change [sub] to sub and change `fiCode Snippets
output = list()
for k, v in tx_dict.items():
y = lambda : sum([map(list, combinations(tx_list, i)) for i in range(len(tx_list) + 1)], [])
output.append( y())if not tx_dict:
return None
combinations_ = [
j
for i in range(len(tx_list) + 1)
for j in combinations(tx_list, i)
]def read_corpa(file_name):
with open(file_name, 'r') as f:
return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}def read_words(file_name):
with open(file_name, 'r') as f:
for word in f:
yield word.split('\t')[0]def contains(small, big):
small_ = len(small)
for i in xrange(len(big) - small_ + 1):
if big[i:i + small_] == small:
return (i, i + small_)
return NoneContext
StackExchange Code Review Q#120391, answer score: 3
Revisions (0)
No revisions yet.