HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

A bitwise common prefix algorithm in Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
algorithmbitwisepythonprefixcommon

Problem

Is my algorithm slow because it has problems? Or, is any bitwise common prefix solution not going to give me the performance I'm looking for?

After profiling my algorithm, I found that over 60% of the time was spent on one line len(os.path.commonprefix((item1, item2))). I'm only looking for the length of the prefix

To solve this I tried to write a bitwise prefix solution

def bit_prefix(a, b):
    min_len = min(len(a), len(b))
    if min_len > 0:
        x = str(bin(
            int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16)))
        y = x.strip('0')
        if len(y) is 1:
            return min_len
        else:
            return (len(x) - len(y)) / 8
    else:
        return 0


I've only gotten a marginal improvement in speed with long prefixes

a = 'a' * 1000000 + 'z'

b = 'a' * 900000 + 'z'

timeit.timeit(lambda: bit_prefix(a, b), number=100)
Out[34]: 6.340372534867129

timeit.timeit(lambda: len(os.path.commonprefix((a, b))), number=100)
Out[35]: 7.5483549056534684

print bit_prefix(a, b), len(os.path.commonprefix((a, b)))
900000 900000


And my algorithm performs more poorly with short prefixes

a = 'aaz'

b = 'az'

timeit.timeit(lambda: bit_prefix(a, b), number=1000000)
Out[42]: 3.968956086175467

timeit.timeit(lambda: len(os.path.commonprefix((a, b))), number=1000000)
Out[43]: 1.1592788235707303

print bit_prefix(a, b), len(os.path.commonprefix((a, b)))
1 1


If my algorithm isn't broken and a bitwise solution won't give me the performance boost I'm looking for, can you refer me to a common prefix solution that would outperform os.path.commonprefix?

Here's the bit_pefix profile

```
Line # Hits Time Per Hit % Time Line Contents
==============================================================
29 @profile
30 def bit_prefix(a, b):
31 36140 81099 2.2 4.2 min_len = min(len(a), len(

Solution

Is my algorithm slow because it has problems?

Yes. The fundamental problem is that it always processes every character of the input.

There are some obvious improvements which don't address this fundamental problem:

  • Since the common prefix can't be more than min_len, truncate both strings to min_len before encoding them.



  • bin returns a string, so str(bin(...)) is overkill.



-
bin(int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16)) is also overkill: what you care about is the position of the lowest set bit of int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16). You can extract the lowest set bit directly from the integer as

differences = int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16)
least_significant_difference = differences ^ (differences - 1)


(That actually gives a binary number which is all 1s from the lowest set bit to the least significant bit).

Then you can either convert to string and find the length, or take the logarithm base 2.

But the fundamental problem is still there: the fastest code is the code which isn't executed, and when the prefix is very short then a simple loop from the start which compares characters is hard to beat. If you really have to beat it, I think you're probably looking at writing some C which coerces a pointer to (16-bit?) characters into a pointer to 64-bit integers and uses that to compare a block of characters at a time. Beware endianness.

Code Snippets

differences = int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16)
least_significant_difference = differences ^ (differences - 1)

Context

StackExchange Code Review Q#124080, answer score: 2

Revisions (0)

No revisions yet.