patterncModerate

Integer Log2 implemented using binary search

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

codereview c stackoverflow performance bitwise comparative-review

searchimplementedlog2binaryusinginteger

Problem

I saw a simple method to calculate Log2(N), where N is the power of 2:

unsigned int Log2(unsigned int N)
{
    unsigned int n = 0;
    while (N >>= 1)
    {
        ++n;
    }
    return n;
}

It would be not likely very necessary, but I try to improve it in a simple way, the idea comes from binary search.

unsigned int Log2New(unsigned int N)
{
    unsigned int bits = sizeof(N) * 4;
    unsigned int n = 0;
    while (N > 1)
    {
        if (N >> bits)
        {
            N >>= bits;
            n += bits;
        }
        bits >>= 1;
    }
    return n;
}

I run the test in Visual Studio 2013, 32 bits release build and fully optimized. When I look into generated disassembly, both methods are inlined, and there are nothing special in both function. There is only one shr command as expected in

if (N >> bits) { N >>= bits; ... }

I test both of the method by sending 1 << n, n from 0 to 31 to them, and repeat for 1000000 times, and count the time separately.

The new method is about 1.66 times faster than the old one,

Then I make a further change, expand the while loop in the new method. (There was a word for this 'expand', but I don't remember it exactly 8-| )

To make it clear, I make a new define, and the method become:

#define CountShift(bits)  if ((N)>>(bits)) { (N)>>=(bits); (n) += (bits); }
unsigned int Log2NewExpand(unsigned int N)
{
    unsigned int n = 0;    
    CountShift(16);
    CountShift(8);
    CountShift(4);
    CountShift(2);
    CountShift(1);

    return n;
}

Looking into the disassembly again, as expected, it's fully serialized, only 5 test, je shr and add commands.

Timing again, the expanded method is again about 1.66 times faster then the new one, and about 2.8 times faster than the original method.

I think there must be some more space for improvement. Any suggestion are welcome.

Solution

I was initially skeptical of your benchmarking results because of your testing methodology:

I test both of the method by sending 1 2, it turns out that the position of the most-significant set bit is the same as the binary logarithm of the represented value. For example, consider the number 21658123. Its binary logarithm is 24, and its binary representation is:

0001 0100 1010 0111 1010 0000 1011
↑  ↑                             ↑
|  └————— bit 24                 |
bit 31                           bit 0

The index of the first set bit, when searching from most-significant bit (MSB) to least-significant bit (LSB), is 24—exactly the same as the binary logarithm. This hints at an exceptionally simple algorithm (at least conceptually): scan through the binary representation of the number looking for the most significant set bit! As luck would have it, there is an instruction on x86 to do precisely what we want to do, and it's been available since the 386 way back in 1986. That is the BSR instruction (Bit Scan Reverse), usable in MSVC via the _BitScanReverse intrinsic and on Gnu compilers (i.e., GCC and Clang) via the __builtin_clz intrinsic (although you'll need to XOR the result with 32). And even if you're not targeting the x86, the intrinsic will likely still work for you, causing the compiler to use an architecture-provided instruction if available, or emit code to generate the equivalent result.

Here's our updated function:

unsigned int Log2_ViaBSR(unsigned int value)
{
    unsigned long result;
    _BitScanReverse(&result, static_cast(value));
    return result;
}

(Yes, code that should be a single line looks a little bit bloated here because MSVC's _BitScanReverse intrinsic works only with DWORD values, which is a typedef for unsigned long, not unsigned int, so although on Windows they are both 32-bit types, they are nevertheless different types according to the language standard, so you need to declare and use an additional temporary for the output, instead of simply reusing the input parameter. Fortunately, this does not affect the object code generated by the compiler. Note, also, that the intrinsic returns a Boolean value indicating whether it was successful. This allows you to handle the case where the input is 0 and the result is undefined—see "caveats" below.)

And the updated leaderboard:

Benchmark                            Time           CPU Iterations
------------------------------------------------------------------
Log2                                22 ns         21 ns   32000000
Log2New                             10 ns         10 ns   64000000
Log2NewUnrolled                      6 ns          6 ns  112000000
Log2Fast                             7 ns          6 ns   89600000
Log2Fast_Intrinsic                   4 ns          4 ns  186666667
Log2_ViaBSR                          2 ns          2 ns  298666667

Stop and think about this for a second. We went from an O(log n) algorithm to what is effectively an O(1) algorithm, just by thinking about the problem in a different way, and figuring out how we can leverage the computer's internal binary representation to our advantage. It reminds me of a lesson straight out of Michael Abrash's Zen of Assembly Language. We now have the most efficient algorithm and have earned those style points!

Now, there are a couple of caveats. First off, log2(0) is mathematically undefined, so you have to decide how you want to handle that. I've simply ignored it in all of the code I've written, making it undefined there as well. The BSR instruction and its corresponding intrinsics take the same tack, formally defining the result as "undefined" when the input is 0. And although there is actually a lot more that I could say about this (including possible workarounds for "undefined" inputs, inefficient code generation when this intrinsic is used in MSVC, the similar LZCNT instruction introduced on Intel Haswell, etc.), I probably shouldn't make this answer any longer. Instead, I'll refer you to Peter Cordes's recent answer on the subject, and my comments below it.

Code Snippets

Benchmark                     Time           CPU Iterations
-----------------------------------------------------------
Log2                         21 ns         21 ns   29866667
Log2New                      10 ns         10 ns   74666667
Log2NewUnrolled               6 ns          6 ns  112000000

unsigned int Log2Fast(unsigned int N)
{
   N |= (N >> 1);
   N |= (N >> 2);
   N |= (N >> 4);
   N |= (N >> 8);
   N |= (N >> 16);
   return (PopCount(N) - 1);
}

Log2Fast PROC
    mov ecx, DWORD PTR [N]

    mov    eax, ecx       // N |= (N >> 1)
    shr    eax, 1
    or     ecx, eax

    mov    eax, ecx       // N |= (N >> 2)
    shr    eax, 2
    or     ecx, eax

    mov    eax, ecx       // N |= (N >> 4)
    shr    eax, 4
    or     ecx, eax

    mov    eax, ecx       // N |= (N >> 8)
    shr    eax, 8
    or     ecx, eax

    mov    eax, ecx       // N |= (N >> 16)
    shr    eax, 16
    or     eax, ecx

    popcnt eax, eax       // return (PopCount(N) - 1)
    dec    eax
    ret    4
Log2Fast ENDP

Benchmark                            Time           CPU Iterations
------------------------------------------------------------------
Log2                                21 ns         20 ns   34461538
Log2New                             10 ns         10 ns   64000000
Log2NewUnrolled                      6 ns          6 ns  112000000
Log2Fast                             5 ns          5 ns   89600000
Log2Fast_Intrinsic                   4 ns          4 ns  179200000

unsigned int Log2_ViaBSR(unsigned int value)
{
    unsigned long result;
    _BitScanReverse(&result, static_cast<unsigned long>(value));
    return result;
}

Context

StackExchange Code Review Q#151312, answer score: 10

Revisions (0)

No revisions yet.