patternpythonMinor
Finding the fastest common prefix of 2 strings in Python
Viewed 0 times
thefastestpythonfindingprefixstringscommon
Problem
I'm comparing my algorithms,
Another program I'm working with spends most of its time on one line
Here's what I've found
```
times = [
... ['os.path.commonprefix', ospathcommonprefix],
... ['bit_prefix', bit_prefix],
... ['slice_prefix', slice_prefix],
... ['index_prefix', index_prefix],
... ['zip_prefix', zip_prefix],
... ['izip_prefix', izip_prefix]]
a = 'a' * 1000000
b = 'a' * 900000
for i, algorithm in enumerate(times):
times[i][1] = timeit.timeit(lambda: algorithm1, number=100)
for result in sorted(times, key=lambda x: x[-1]): print result
['slice_prefix', 0.0408039022572666]
['bit_prefix', 6.330115954430312]
['izip_prefix', 6.653096312379603]
['os.path.commonprefix', 7.151773449750181]
['index_prefix', 8.905739173445]
['zip_prefix', 12.252280194828245]
times = [
... ['os.path.commonprefix', ospathcommonprefix],
... ['bit_prefix', bit_prefix],
... ['slice_prefix', slice_prefix],
... ['index_prefix', index_prefix],
... ['zip_prefix', zip_prefix],
... ['izip_prefix', izip_prefix]]
a = 'a' * 2
b = 'a' * 1
for i, algorithm in enumerate(times):
times[i][1] = timeit.timeit(lambda: algorithm1, number=1000000)
for result in sorted(times, key=lambda x: x[-1]): print result
['slice_prefix', 0.7089188635036408]
['index_prefix', 0.8099312869949244]
['izip_prefix', 0.9187295778019688]
['zip_prefix', 1.115640751833098]
['os.path.commonprefix', 1.3019800865420166]
['bit_prefix', 3.7144623631063496]
a = 'a' * 1000000
b = 'a' * 900000
times = [
... ['os.path.commonprefix', ospathcommonprefix],
... ['bit_prefix', bit_prefix],
... ['slice_prefix', slice_prefix],
... ['index_prefix', index_prefix],
... ['zip_prefix', zip_pr
slice_prefix and bit_prefix, with existing ones to find the common prefix length of 2 strings as fast as possible. Another program I'm working with spends most of its time on one line
len(os.path.commonprefix((item1, item2))). I've searched for other algorithms in Python and created my own to reduce this bottleneck. Here's what I've found
```
times = [
... ['os.path.commonprefix', ospathcommonprefix],
... ['bit_prefix', bit_prefix],
... ['slice_prefix', slice_prefix],
... ['index_prefix', index_prefix],
... ['zip_prefix', zip_prefix],
... ['izip_prefix', izip_prefix]]
a = 'a' * 1000000
b = 'a' * 900000
for i, algorithm in enumerate(times):
times[i][1] = timeit.timeit(lambda: algorithm1, number=100)
for result in sorted(times, key=lambda x: x[-1]): print result
['slice_prefix', 0.0408039022572666]
['bit_prefix', 6.330115954430312]
['izip_prefix', 6.653096312379603]
['os.path.commonprefix', 7.151773449750181]
['index_prefix', 8.905739173445]
['zip_prefix', 12.252280194828245]
times = [
... ['os.path.commonprefix', ospathcommonprefix],
... ['bit_prefix', bit_prefix],
... ['slice_prefix', slice_prefix],
... ['index_prefix', index_prefix],
... ['zip_prefix', zip_prefix],
... ['izip_prefix', izip_prefix]]
a = 'a' * 2
b = 'a' * 1
for i, algorithm in enumerate(times):
times[i][1] = timeit.timeit(lambda: algorithm1, number=1000000)
for result in sorted(times, key=lambda x: x[-1]): print result
['slice_prefix', 0.7089188635036408]
['index_prefix', 0.8099312869949244]
['izip_prefix', 0.9187295778019688]
['zip_prefix', 1.115640751833098]
['os.path.commonprefix', 1.3019800865420166]
['bit_prefix', 3.7144623631063496]
a = 'a' * 1000000
b = 'a' * 900000
times = [
... ['os.path.commonprefix', ospathcommonprefix],
... ['bit_prefix', bit_prefix],
... ['slice_prefix', slice_prefix],
... ['index_prefix', index_prefix],
... ['zip_prefix', zip_pr
Solution
slice_prefix and index_prefix look all right algorithmically but they impose a lot of interpreter overhead for each and every character. As always in such a situtation you could unlock speed reserves by pushing some processing into the engine. In other words, the engine can probably compare hundreds of characters in the same time that your loops go through a small handful.If you cannot find an existing string compare that reports the mismatch position then you could take a long hard look at the distribution of your input strings and pick an initial block size for comparisons, like 32 or 64. Use
slice_prefix to find the first mismatching block of that size, and then use block size halving to find the mismatch in that block. When the block size gets lower than some threshold - something on the order of 8 or 16 - switch to linear scan. Run copious tests on representative inputs in order to refine the two thresholds (initial blocks size and linear scan threshold).Some additional observations. At this point there are two fruitful avenues: on one hand you could defer to a function written in C, and on the other you could create an uglified (optimised) version of
slice_prefix that is finely tuned to the specifics of your application, like average prefix length and so on. Gauging the potential of either approach requires hard data, like the raw overhead for passing two strings to C (which gives you an absolute limit of what could possibly be achieved via that route) and the actual performance of a good library
memcmp() when called from Python under production conditions on representative inputs.There are some promising avenues for tuning
slice_prefix. The first is a slight tweak of the tail end:def slice_prefix(a, b, start=0, length=64):
while 1:
while a[start:start + length] == b[start:start + length]:
start += length
length += length
if length >= 4:
length /= 4
elif length == 1:
return start
else:
length = 1Try factors between 2 and 8 on representative input data.
Your binary approach (a.k.a.
directional_prefix) needs some work yet, though. The basic idea is this: when the first while loop exits then we know that there must a a mismatch some where in the block with length length starting at start. If it isn't in the first half then it must be in the second half. So if the first half is equal then you add the current length to start, if it is different then not. Then halve and continue until the linear threshold is reached, at which point you do a linear scan of the rest. P.S.: there is no one ring to bind them all, which is why I keep mentioning representative inputs. The
memcmp() of a mature C runtime can comes close to the ideal and typically contains a sh*tload of optimisations for different cases in order to give balanced high performance (note: the standard memcmp() is no use because it doesn't return the mismatch position, but there are some that do). But the effort for bringing C code into the fold and the complications arising from it are non-negligible, as is the call overhead. Tuning slice_prefix to meet requirements - if it is possible - might be a better solution. It is already in a state of grace as is, though, and from here on can only come uglification...Code Snippets
def slice_prefix(a, b, start=0, length=64):
while 1:
while a[start:start + length] == b[start:start + length]:
start += length
length += length
if length >= 4:
length /= 4
elif length == 1:
return start
else:
length = 1Context
StackExchange Code Review Q#124144, answer score: 4
Revisions (0)
No revisions yet.