patternpythonMinor

Improving Speeds of Tokenisation of large word list for distinct combinations

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

distinctcombinationsspeedswordlargeforlisttokenisationimproving

Problem

I'm having trouble with a current algorithm I'm writing that takes a corpa, a list of characters in sets (1 one or larger) with a frequency number attached to it. Against another list of character sets that need to be created using the most frequent occurrences of the combined sets... if that makes sense. I'll explain with an example. So lets say your corpa looks like this. (Just note that case does not matter as the actual data is in Korean, hence each character represents a Korean character.)

A   56
AB  7342
ABC 3
BC  116
C   5
CD  10
BCD 502
ABCD    23
D   132
DD  6

The wordlist is (ignore the number):

AAB     1123
DCDD    83

So the output of the script would be:

Original    Pois        Makeup                      Freq_Max_Delta
AAB         A AB        [AB, 7342][A, 56]           7398
DCDD        D C DD      [D, 132][DD, 6][C, 5]       143

Currently I am basically doing this by ordering the corpa, then grabbing each individual line, getting all the combinations of each letter (this is redundant as it produces combinations which are not positionally correct) and then pulls out which combinations can make up the word. Then adds up the frequency of the set of combinations and then outputs the one with Freq_Max_Delta.

I'm thinking that a large part of the problem is the number of nested loops that I'm using and the fact we are creating redundant data. I'm also looking into the possibility of cython and even using multiple threads for the tokenisation. The problem is some what similar to the Knapsack Problem

Below is my code, also a portion of the inputs are located on my gist if you are interested in running it with what I'm using. The wordlist and the corpa

```
from argparse import ArgumentParser
from collections import OrderedDict
from itertools import combinations

def order_worlist(corpa):
dict_corpa = dict()
print 'ordering corpa....'
with open(corpa, 'r') as open_wordlist:
for line in open_wordlist:

Solution

Some general concerns about your program:

-
If you go past 4/5 tabs you need to refactor your code.

If this means that you have to make a new function so be it.

If it means that you have to use proper explicit guard clauses please do.

-
Make your functions do one thing.

order_worlist both reads from a file, converts it to a dictionary, and then sorts the dictionary.

Instead you could read from the file, do a little transforming and then get another function to do more transforms.
For example you could just return the first dictionary.
And in a different function sort the dictionary.

-
Don't use 'hacks'.

Take for example the following code block:

output = list()
for k, v in tx_dict.items():
    y = lambda : sum([map(list, combinations(tx_list, i)) for i in range(len(tx_list) + 1)], [])
    output.append( y())

The first thing I thought when I saw this was WTF.

Don't do y = (lambda:...)(). Just do y = ....

Don't abuse sum, instead use a nested comprehension.

Don't populate a potentially massive list to check if tx_dict is empty.

The following is much more legible:

if not tx_dict:
    return None

combinations_ = [
    j
    for i in range(len(tx_list) + 1)
    for j in combinations(tx_list, i)
]

As you probably don't know, you don't need an OrderedDict.
What's the point in ordering it here? Increase in speed? I removed it and it seemed much faster.

And so I removed them all.

To start off, rename order_worlist to read_corpa.
And make it a dict-comprehension, and now it's an easy to read 3liner.

def read_corpa(file_name):
    with open(file_name, 'r') as f:
        return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}

Next make a function like the above for agrs.wordlist, this time call it read_words.
You want to make a generator to not read everything here into memory.
You also want to only return the first tsv item.

def read_words(file_name):
    with open(file_name, 'r') as f:
        for word in f:
            yield word.split('\t')[0]

Next change contains to use a slice rather than an expensive for loop.

def contains(small, big):
    small_ = len(small)
    for i in xrange(len(big) - small_ + 1):
        if big[i:i + small_] == small:
            return (i, i + small_)
    return None

Change find_best to only take original_tx (word) and dict_corpa (corpas).
Don't re-assign k and v just define them as the variable you want to call them at the beginning.

def find_best(word, corpas):
    results = {}
    for corpa, frequency in corpas.items():
        c = contains(corpa, word)
        if c:
            results[word[c[0]:c[1]]] = frequency
    return results

I'll take a small break here to highlight the use of good variable names.
I think find_best highlights this the best.
We now know that we will be given a word and corpas as the arguments.
We also know that corpas is a dictionary, with a corpa as the key and it's frequency as the value.

It's nicely laid out to us so we don't have to think what try_tx is.
And if we read it we can understand it without any prior knowledge of names and conventions.

And finally we come to create_cominations, this should be split into two functions create_combination and display_combination.
We want to be able to edit or use the information returned by create_combination, possibly at a later date.

In create_combination you want to return as early as possible,
if there are no combinations return None.

You then want to generate and filter all potential combinations.
If there are none return None.

And then you will want to do the final work of getting the best combination.

Generating and filtering all potential combinations is the hardest part.
Using the list comprehension I explained earlier, and adding the if statements to it is how you will do this.

combinations_ = [
    j
    for i in range(len(tx_list) + 1)
    for j in combinations(tx_list, i)
]

for i in combinations_:
    if sorted(''.join(i)) == sorted(original_tx):
        if len(''.join(i)) == len(original_tx):
            ...

# Into
combinations_ = [
    j
    for i in range(len(tx_list) + 1)
    for j in combinations(tx_list, i)
    if sorted(''.join(j)) == sorted(original_tx)
    if len(''.join(j)) == len(original_tx)
]

for i in combinations_:
    ...

We can reduce this, 'ab' != 'abc' and so the length check is pointless.
Which will give you:

combinations_ = [
    j
    for i in range(len(tx_list) + 1)
    for j in combinations(tx_list, i)
    if sorted(''.join(j)) == sorted(original_tx)
]

Now we have to find the best combination. I personally would use a list comprehension, and a non abusive use of sum.
But either way is fine.

What is bad is your lack of or, a pointless else:pass, using final_li = []; final_li.append(...) and finally, [sub].

Just remove the else:pass, change [sub] to sub and change `fi

Code Snippets

output = list()
for k, v in tx_dict.items():
    y = lambda : sum([map(list, combinations(tx_list, i)) for i in range(len(tx_list) + 1)], [])
    output.append( y())

if not tx_dict:
    return None

combinations_ = [
    j
    for i in range(len(tx_list) + 1)
    for j in combinations(tx_list, i)
]

def read_corpa(file_name):
    with open(file_name, 'r') as f:
        return {l[0]: int(l[-1]) for l in (line.rstrip().split('\t') for line in f)}

def read_words(file_name):
    with open(file_name, 'r') as f:
        for word in f:
            yield word.split('\t')[0]

def contains(small, big):
    small_ = len(small)
    for i in xrange(len(big) - small_ + 1):
        if big[i:i + small_] == small:
            return (i, i + small_)
    return None

Context

StackExchange Code Review Q#120391, answer score: 3

Revisions (0)

No revisions yet.