patterncModerate
Integer Log2 implemented using binary search
Viewed 0 times
searchimplementedlog2binaryusinginteger
Problem
I saw a simple method to calculate Log2(N), where N is the power of 2:
It would be not likely very necessary, but I try to improve it in a simple way, the idea comes from binary search.
I run the test in Visual Studio 2013, 32 bits release build and fully optimized. When I look into generated disassembly, both methods are inlined, and there are nothing special in both function. There is only one shr command as expected in
I test both of the method by sending 1 << n, n from 0 to 31 to them, and repeat for 1000000 times, and count the time separately.
The new method is about 1.66 times faster than the old one,
Then I make a further change, expand the while loop in the new method. (There was a word for this 'expand', but I don't remember it exactly 8-| )
To make it clear, I make a new define, and the method become:
Looking into the disassembly again, as expected, it's fully serialized, only 5 test, je shr and add commands.
Timing again, the expanded method is again about 1.66 times faster then the new one, and about 2.8 times faster than the original method.
I think there must be some more space for improvement. Any suggestion are welcome.
unsigned int Log2(unsigned int N)
{
unsigned int n = 0;
while (N >>= 1)
{
++n;
}
return n;
}It would be not likely very necessary, but I try to improve it in a simple way, the idea comes from binary search.
unsigned int Log2New(unsigned int N)
{
unsigned int bits = sizeof(N) * 4;
unsigned int n = 0;
while (N > 1)
{
if (N >> bits)
{
N >>= bits;
n += bits;
}
bits >>= 1;
}
return n;
}I run the test in Visual Studio 2013, 32 bits release build and fully optimized. When I look into generated disassembly, both methods are inlined, and there are nothing special in both function. There is only one shr command as expected in
if (N >> bits) { N >>= bits; ... }I test both of the method by sending 1 << n, n from 0 to 31 to them, and repeat for 1000000 times, and count the time separately.
The new method is about 1.66 times faster than the old one,
Then I make a further change, expand the while loop in the new method. (There was a word for this 'expand', but I don't remember it exactly 8-| )
To make it clear, I make a new define, and the method become:
#define CountShift(bits) if ((N)>>(bits)) { (N)>>=(bits); (n) += (bits); }
unsigned int Log2NewExpand(unsigned int N)
{
unsigned int n = 0;
CountShift(16);
CountShift(8);
CountShift(4);
CountShift(2);
CountShift(1);
return n;
}Looking into the disassembly again, as expected, it's fully serialized, only 5 test, je shr and add commands.
Timing again, the expanded method is again about 1.66 times faster then the new one, and about 2.8 times faster than the original method.
I think there must be some more space for improvement. Any suggestion are welcome.
Solution
I was initially skeptical of your benchmarking results because of your testing methodology:
I test both of the method by sending 1 2, it turns out that the position of the most-significant set bit is the same as the binary logarithm of the represented value. For example, consider the number
The index of the first set bit, when searching from most-significant bit (MSB) to least-significant bit (LSB), is 24—exactly the same as the binary logarithm. This hints at an exceptionally simple algorithm (at least conceptually): scan through the binary representation of the number looking for the most significant set bit! As luck would have it, there is an instruction on x86 to do precisely what we want to do, and it's been available since the 386 way back in 1986. That is the
Here's our updated function:
(Yes, code that should be a single line looks a little bit bloated here because MSVC's
And the updated leaderboard:
Stop and think about this for a second. We went from an O(log n) algorithm to what is effectively an O(1) algorithm, just by thinking about the problem in a different way, and figuring out how we can leverage the computer's internal binary representation to our advantage. It reminds me of a lesson straight out of Michael Abrash's Zen of Assembly Language. We now have the most efficient algorithm and have earned those style points!
Now, there are a couple of caveats. First off, log2(0) is mathematically undefined, so you have to decide how you want to handle that. I've simply ignored it in all of the code I've written, making it undefined there as well. The
I test both of the method by sending 1 2, it turns out that the position of the most-significant set bit is the same as the binary logarithm of the represented value. For example, consider the number
21658123. Its binary logarithm is 24, and its binary representation is:0001 0100 1010 0111 1010 0000 1011
↑ ↑ ↑
| └————— bit 24 |
bit 31 bit 0
The index of the first set bit, when searching from most-significant bit (MSB) to least-significant bit (LSB), is 24—exactly the same as the binary logarithm. This hints at an exceptionally simple algorithm (at least conceptually): scan through the binary representation of the number looking for the most significant set bit! As luck would have it, there is an instruction on x86 to do precisely what we want to do, and it's been available since the 386 way back in 1986. That is the
BSR instruction (Bit Scan Reverse), usable in MSVC via the _BitScanReverse intrinsic and on Gnu compilers (i.e., GCC and Clang) via the __builtin_clz intrinsic (although you'll need to XOR the result with 32). And even if you're not targeting the x86, the intrinsic will likely still work for you, causing the compiler to use an architecture-provided instruction if available, or emit code to generate the equivalent result.Here's our updated function:
unsigned int Log2_ViaBSR(unsigned int value)
{
unsigned long result;
_BitScanReverse(&result, static_cast(value));
return result;
}(Yes, code that should be a single line looks a little bit bloated here because MSVC's
_BitScanReverse intrinsic works only with DWORD values, which is a typedef for unsigned long, not unsigned int, so although on Windows they are both 32-bit types, they are nevertheless different types according to the language standard, so you need to declare and use an additional temporary for the output, instead of simply reusing the input parameter. Fortunately, this does not affect the object code generated by the compiler. Note, also, that the intrinsic returns a Boolean value indicating whether it was successful. This allows you to handle the case where the input is 0 and the result is undefined—see "caveats" below.)And the updated leaderboard:
Benchmark Time CPU Iterations
------------------------------------------------------------------
Log2 22 ns 21 ns 32000000
Log2New 10 ns 10 ns 64000000
Log2NewUnrolled 6 ns 6 ns 112000000
Log2Fast 7 ns 6 ns 89600000
Log2Fast_Intrinsic 4 ns 4 ns 186666667
Log2_ViaBSR 2 ns 2 ns 298666667Stop and think about this for a second. We went from an O(log n) algorithm to what is effectively an O(1) algorithm, just by thinking about the problem in a different way, and figuring out how we can leverage the computer's internal binary representation to our advantage. It reminds me of a lesson straight out of Michael Abrash's Zen of Assembly Language. We now have the most efficient algorithm and have earned those style points!
Now, there are a couple of caveats. First off, log2(0) is mathematically undefined, so you have to decide how you want to handle that. I've simply ignored it in all of the code I've written, making it undefined there as well. The
BSR instruction and its corresponding intrinsics take the same tack, formally defining the result as "undefined" when the input is 0. And although there is actually a lot more that I could say about this (including possible workarounds for "undefined" inputs, inefficient code generation when this intrinsic is used in MSVC, the similar LZCNT instruction introduced on Intel Haswell, etc.), I probably shouldn't make this answer any longer. Instead, I'll refer you to Peter Cordes's recent answer on the subject, and my comments below it.Code Snippets
Benchmark Time CPU Iterations
-----------------------------------------------------------
Log2 21 ns 21 ns 29866667
Log2New 10 ns 10 ns 74666667
Log2NewUnrolled 6 ns 6 ns 112000000unsigned int Log2Fast(unsigned int N)
{
N |= (N >> 1);
N |= (N >> 2);
N |= (N >> 4);
N |= (N >> 8);
N |= (N >> 16);
return (PopCount(N) - 1);
}Log2Fast PROC
mov ecx, DWORD PTR [N]
mov eax, ecx // N |= (N >> 1)
shr eax, 1
or ecx, eax
mov eax, ecx // N |= (N >> 2)
shr eax, 2
or ecx, eax
mov eax, ecx // N |= (N >> 4)
shr eax, 4
or ecx, eax
mov eax, ecx // N |= (N >> 8)
shr eax, 8
or ecx, eax
mov eax, ecx // N |= (N >> 16)
shr eax, 16
or eax, ecx
popcnt eax, eax // return (PopCount(N) - 1)
dec eax
ret 4
Log2Fast ENDPBenchmark Time CPU Iterations
------------------------------------------------------------------
Log2 21 ns 20 ns 34461538
Log2New 10 ns 10 ns 64000000
Log2NewUnrolled 6 ns 6 ns 112000000
Log2Fast 5 ns 5 ns 89600000
Log2Fast_Intrinsic 4 ns 4 ns 179200000unsigned int Log2_ViaBSR(unsigned int value)
{
unsigned long result;
_BitScanReverse(&result, static_cast<unsigned long>(value));
return result;
}Context
StackExchange Code Review Q#151312, answer score: 10
Revisions (0)
No revisions yet.