patternpythonMinor
A bitwise common prefix algorithm in Python
Viewed 0 times
algorithmbitwisepythonprefixcommon
Problem
Is my algorithm slow because it has problems? Or, is any bitwise common prefix solution not going to give me the performance I'm looking for?
After profiling my algorithm, I found that over 60% of the time was spent on one line
To solve this I tried to write a bitwise prefix solution
I've only gotten a marginal improvement in speed with long prefixes
And my algorithm performs more poorly with short prefixes
If my algorithm isn't broken and a bitwise solution won't give me the performance boost I'm looking for, can you refer me to a common prefix solution that would outperform os.path.commonprefix?
Here's the bit_pefix profile
```
Line # Hits Time Per Hit % Time Line Contents
==============================================================
29 @profile
30 def bit_prefix(a, b):
31 36140 81099 2.2 4.2 min_len = min(len(a), len(
After profiling my algorithm, I found that over 60% of the time was spent on one line
len(os.path.commonprefix((item1, item2))). I'm only looking for the length of the prefixTo solve this I tried to write a bitwise prefix solution
def bit_prefix(a, b):
min_len = min(len(a), len(b))
if min_len > 0:
x = str(bin(
int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16)))
y = x.strip('0')
if len(y) is 1:
return min_len
else:
return (len(x) - len(y)) / 8
else:
return 0I've only gotten a marginal improvement in speed with long prefixes
a = 'a' * 1000000 + 'z'
b = 'a' * 900000 + 'z'
timeit.timeit(lambda: bit_prefix(a, b), number=100)
Out[34]: 6.340372534867129
timeit.timeit(lambda: len(os.path.commonprefix((a, b))), number=100)
Out[35]: 7.5483549056534684
print bit_prefix(a, b), len(os.path.commonprefix((a, b)))
900000 900000And my algorithm performs more poorly with short prefixes
a = 'aaz'
b = 'az'
timeit.timeit(lambda: bit_prefix(a, b), number=1000000)
Out[42]: 3.968956086175467
timeit.timeit(lambda: len(os.path.commonprefix((a, b))), number=1000000)
Out[43]: 1.1592788235707303
print bit_prefix(a, b), len(os.path.commonprefix((a, b)))
1 1If my algorithm isn't broken and a bitwise solution won't give me the performance boost I'm looking for, can you refer me to a common prefix solution that would outperform os.path.commonprefix?
Here's the bit_pefix profile
```
Line # Hits Time Per Hit % Time Line Contents
==============================================================
29 @profile
30 def bit_prefix(a, b):
31 36140 81099 2.2 4.2 min_len = min(len(a), len(
Solution
Is my algorithm slow because it has problems?
Yes. The fundamental problem is that it always processes every character of the input.
There are some obvious improvements which don't address this fundamental problem:
-
(That actually gives a binary number which is all
Then you can either convert to string and find the length, or take the logarithm base 2.
But the fundamental problem is still there: the fastest code is the code which isn't executed, and when the prefix is very short then a simple loop from the start which compares characters is hard to beat. If you really have to beat it, I think you're probably looking at writing some C which coerces a pointer to (16-bit?) characters into a pointer to 64-bit integers and uses that to compare a block of characters at a time. Beware endianness.
Yes. The fundamental problem is that it always processes every character of the input.
There are some obvious improvements which don't address this fundamental problem:
- Since the common prefix can't be more than
min_len, truncate both strings tomin_lenbefore encoding them.
binreturns a string, sostr(bin(...))is overkill.
-
bin(int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16)) is also overkill: what you care about is the position of the lowest set bit of int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16). You can extract the lowest set bit directly from the integer asdifferences = int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16)
least_significant_difference = differences ^ (differences - 1)(That actually gives a binary number which is all
1s from the lowest set bit to the least significant bit).Then you can either convert to string and find the length, or take the logarithm base 2.
But the fundamental problem is still there: the fastest code is the code which isn't executed, and when the prefix is very short then a simple loop from the start which compares characters is hard to beat. If you really have to beat it, I think you're probably looking at writing some C which coerces a pointer to (16-bit?) characters into a pointer to 64-bit integers and uses that to compare a block of characters at a time. Beware endianness.
Code Snippets
differences = int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16)
least_significant_difference = differences ^ (differences - 1)Context
StackExchange Code Review Q#124080, answer score: 2
Revisions (0)
No revisions yet.