patterncppCritical
Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with _mm_popcnt_u64 on Intel CPUs
Viewed 0 times
deviationswith_mm_popcnt_u64intelreplacingintroducescpuscounterperformancecrazy
Problem
I was looking for the fastest way to
The Benchmark
As you see, we create a buffer of random data, with the size being
The (absolutely crazy) results
I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1):
Here are the results on my Haswell Core i7-4770K CPU @ 3.50 GHz, running
As you see, the throughput of the
Result:
popcount large arrays of data. I encountered a very weird effect: Changing the loop variable from unsigned to uint64_t made the performance drop by 50% on my PC.The Benchmark
#include
#include
#include
int main(int argc, char* argv[]) {
using namespace std;
if (argc != 2) {
cerr (buffer);
for (unsigned i=0; i startP,endP;
{
startP = chrono::system_clock::now();
count = 0;
for( unsigned k = 0; k (endP-startP).count();
cout (endP-startP).count();
cout << "uint64_t\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
free(charbuffer);
}As you see, we create a buffer of random data, with the size being
x megabytes where x is read from the command line. Afterwards, we iterate over the buffer and use an unrolled version of the x86 popcount intrinsic to perform the popcount. To get a more precise result, we do the popcount 10,000 times. We measure the times for the popcount. In the upper case, the inner loop variable is unsigned, in the lower case, the inner loop variable is uint64_t. I thought that this should make no difference, but the opposite is the case.The (absolutely crazy) results
I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1):
g++ -O3 -march=native -std=c++11 test.cpp -o testHere are the results on my Haswell Core i7-4770K CPU @ 3.50 GHz, running
test 1 (so 1 MB random data):- unsigned 41959360000 0.401554 sec 26.113 GB/s
- uint64_t 41959360000 0.759822 sec 13.8003 GB/s
As you see, the throughput of the
uint64_t version is only half the one of the unsigned version! The problem seems to be that different assembly gets generated, but why? First, I thought of a compiler bug, so I tried clang++ (Ubuntu Clang version 3.4-1ubuntu3):clang++ -O3 -march=native -std=c++11 teest.cpp -o testResult:
test 1- u
Solution
Culprit: False Data Dependency (and the compiler isn't even aware of it)
On Sandy/Ivy Bridge and Haswell processors, the instruction:
appears to have a false dependency on the destination register
Skylake fixed this for
Cannon Lake (and Ice Lake) fixed this for
(Yes, these instructions all run on the same execution unit).
This dependency doesn't just hold up the 4
The
In your case, the speeds are a direct result of what is stuck to the (false) dependency chain depending on what the register allocator decided to do.
The difference between 20 GB/s and 26 GB/s seems to be a minor artifact of the indirect addressing. Either way, the processor starts to hit other bottlenecks once you reach this speed.
To test this, I used inline assembly to bypass the compiler and get exactly the assembly I want. I also split up the
Here are the results:
Sandy Bridge Xeon @ 3.5 GHz: (full test code can be found at the bottom)
Different Registers: 18.6195 GB/s
Same Register: 8.49272 GB/s
Same Register with broken chain: 17.8869 GB/s
So what went wrong with the compiler?
It seems that neither GCC nor Visual Studio are aware that
(Update: As of version 4.9.2, GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled. Major compilers from other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it.)
Why does the CPU have such a false dependency?
We can speculate: it runs on the same execution unit as
Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the output
On Sandy/Ivy Bridge and Haswell processors, the instruction:
popcnt src, destappears to have a false dependency on the destination register
dest. Even though the instruction only writes to it, the instruction will wait until dest is ready before executing. This false dependency is (now) documented by Intel as erratum HSD146 (Haswell) and SKL029 (Skylake)Skylake fixed this for
lzcnt and tzcnt.Cannon Lake (and Ice Lake) fixed this for
popcnt.bsf/bsr have a true output dependency: output unmodified for input=0. (But no way to take advantage of that with intrinsics - only AMD documents it and compilers don't expose it.)(Yes, these instructions all run on the same execution unit).
This dependency doesn't just hold up the 4
popcnts from a single loop iteration. It can carry across loop iterations making it impossible for the processor to parallelize different loop iterations.The
unsigned vs. uint64_t and other tweaks don't directly affect the problem. But they influence the register allocator which assigns the registers to the variables.In your case, the speeds are a direct result of what is stuck to the (false) dependency chain depending on what the register allocator decided to do.
- 13 GB/s has a chain:
popcnt-add-popcnt-popcnt→ next iteration
- 15 GB/s has a chain:
popcnt-add-popcnt-add→ next iteration
- 20 GB/s has a chain:
popcnt-popcnt→ next iteration
- 26 GB/s has a chain:
popcnt-popcnt→ next iteration
The difference between 20 GB/s and 26 GB/s seems to be a minor artifact of the indirect addressing. Either way, the processor starts to hit other bottlenecks once you reach this speed.
To test this, I used inline assembly to bypass the compiler and get exactly the assembly I want. I also split up the
count variable to break all other dependencies that might mess with the benchmarks.Here are the results:
Sandy Bridge Xeon @ 3.5 GHz: (full test code can be found at the bottom)
- GCC 4.6.3:
g++ popcnt.cpp -std=c++0x -O3 -save-temps -march=native
- Ubuntu 12
Different Registers: 18.6195 GB/s
.L4:
movq (%rbx,%rax,8), %r8
movq 8(%rbx,%rax,8), %r9
movq 16(%rbx,%rax,8), %r10
movq 24(%rbx,%rax,8), %r11
addq $4, %rax
popcnt %r8, %r8
add %r8, %rdx
popcnt %r9, %r9
add %r9, %rcx
popcnt %r10, %r10
add %r10, %rdi
popcnt %r11, %r11
add %r11, %rsi
cmpq $131072, %rax
jne .L4Same Register: 8.49272 GB/s
.L9:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq $4, %rdx
# This time reuse "rax" for all the popcnts.
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq $131072, %rdx
jne .L9Same Register with broken chain: 17.8869 GB/s
.L14:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq $4, %rdx
# Reuse "rax" for all the popcnts.
xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax".
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq $131072, %rdx
jne .L14So what went wrong with the compiler?
It seems that neither GCC nor Visual Studio are aware that
popcnt has such a false dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter of whether the compiler is aware of it.popcnt isn't exactly the most used instruction. So it's not really a surprise that a major compiler could miss something like this. There also appears to be no documentation anywhere that mentions this problem. If Intel doesn't disclose it, then nobody outside will know until someone runs into it by chance.(Update: As of version 4.9.2, GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled. Major compilers from other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it.)
Why does the CPU have such a false dependency?
We can speculate: it runs on the same execution unit as
bsf / bsr which do have an output dependency. (How is POPCNT implemented in hardware?). For those instructions, Intel documents the integer result for input=0 as "undefined" (with ZF=1), but Intel hardware actually gives a stronger guarantee to avoid breaking old software: output unmodified. AMD documents this behaviour.Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the output
Code Snippets
popcnt src, dest.L4:
movq (%rbx,%rax,8), %r8
movq 8(%rbx,%rax,8), %r9
movq 16(%rbx,%rax,8), %r10
movq 24(%rbx,%rax,8), %r11
addq $4, %rax
popcnt %r8, %r8
add %r8, %rdx
popcnt %r9, %r9
add %r9, %rcx
popcnt %r10, %r10
add %r10, %rdi
popcnt %r11, %r11
add %r11, %rsi
cmpq $131072, %rax
jne .L4.L9:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq $4, %rdx
# This time reuse "rax" for all the popcnts.
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq $131072, %rdx
jne .L9.L14:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq $4, %rdx
# Reuse "rax" for all the popcnts.
xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax".
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq $131072, %rdx
jne .L14#include <iostream>
#include <chrono>
#include <x86intrin.h>
int main(int argc, char* argv[]) {
using namespace std;
uint64_t size=1<<20;
uint64_t* buffer = new uint64_t[size/8];
char* charbuffer=reinterpret_cast<char*>(buffer);
for (unsigned i=0;i<size;++i) charbuffer[i]=rand()%256;
uint64_t count,duration;
chrono::time_point<chrono::system_clock> startP,endP;
{
uint64_t c0 = 0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 = 0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k < 10000; k++){
for (uint64_t i=0;i<size/8;i+=4) {
uint64_t r0 = buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 = buffer[i + 2];
uint64_t r3 = buffer[i + 3];
__asm__(
"popcnt %4, %4 \n\t"
"add %4, %0 \n\t"
"popcnt %5, %5 \n\t"
"add %5, %1 \n\t"
"popcnt %6, %6 \n\t"
"add %6, %2 \n\t"
"popcnt %7, %7 \n\t"
"add %7, %3 \n\t"
: "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2), "r" (r3)
);
}
}
count = c0 + c1 + c2 + c3;
endP = chrono::system_clock::now();
duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "No Chain\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
{
uint64_t c0 = 0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 = 0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k < 10000; k++){
for (uint64_t i=0;i<size/8;i+=4) {
uint64_t r0 = buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 = buffer[i + 2];
uint64_t r3 = buffer[i + 3];
__asm__(
"popcnt %4, %%rax \n\t"
"add %%rax, %0 \n\t"
"popcnt %5, %%rax \n\t"
"add %%rax, %1 \n\t"
"popcnt %6, %%rax \n\t"
"add %%rax, %2 \n\t"
"popcnt %7, %%rax \n\t"
"add %%rax, %3 \n\t"
: "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2), "r" (r3)
: "rax"
);
}
}
count = c0 + c1 + c2 + c3;
endP = chrono::system_clock::now();
duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "Chain 4 \t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
{
uint64_t c0 = 0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 = 0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k < 10000; k++){
for Context
Stack Overflow Q#25078285, score: 1764
Revisions (0)
No revisions yet.