HiveBrain v1.2.0
Get Started
← Back to all entries
gotchacppCritical

Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?

Submitted by: @import:stackoverflow-api··
0
Viewed 0 times
runfastertestingwhythantheforconjecturedoeshand

Problem

I wrote these two solutions for Project Euler Q14, in assembly and in C++. They implement identical brute force approach for testing the Collatz conjecture. The assembly solution was assembled with:
nasm -felf64 p14.asm && gcc p14.o -o p14


The C++ was compiled with:
g++ p14.cpp -o p14


Assembly, p14.asm:
section .data
fmt db "%d", 10, 0

global main
extern printf

section .text

main:
mov rcx, 1000000
xor rdi, rdi ; max i
xor rsi, rsi ; i

l1:
dec rcx
xor r10, r10 ; count
mov rax, rcx

l2:
test rax, 1
jpe even

mov rbx, 3
mul rbx
inc rax
jmp c1

even:
mov rbx, 2
xor rdx, rdx
div rbx

c1:
inc r10
cmp rax, 1
jne l2

cmp rdi, r10
cmovl rdi, r10
cmovl rsi, rcx

cmp rcx, 2
jne l1

mov rdi, fmt
xor rax, rax
call printf
ret


C++, p14.cpp:
#include

int sequence(long n) {
int count = 1;
while (n != 1) {
if (n % 2 == 0)
n /= 2;
else
n = 3*n + 1;
++count;
}
return count;
}

int main() {
int max = 0, maxi;
for (int i = 999999; i > 0; --i) {
int s = sequence(i);
if (s > max) {
max = s;
maxi = i;
}
}
std::cout

I know about the compiler optimizations to improve speed and everything, but I don’t see many ways to further optimize my assembly solution (speaking programmatically, not mathematically).

The C++ code uses modulus every term and division every other term, while the assembly code only uses a single division every other term.

But the assembly is taking on average 1 second longer than the C++ solution. Why is this? I am asking mainly out of curiosity.
Execution times

My system: 64-bit Linux on 1.4 GHz Intel Celeron 2955U (Haswell microarchitecture).

  • g++ (unoptimized): avg 1272 ms.



  • g++ -O3: avg 578 ms.



  • asm (div) (original): avg 2650 ms.



  • asm (shr)`: avg 679 ms.



  • @johnfound asm (a

Solution

If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code, even with -O0 (compile fast, no extra optimization, and store/reload to memory after/before every C statement so a debugger can modify variables).

See Agner Fog's Optimizing Assembly guide to learn how to write efficient asm. He also has instruction tables and a microarch guide for specific details for specific CPUs. See also the x86 tag wiki for more perf links.

See also this more general question about beating the compiler with hand-written asm: Is inline assembly language slower than native C++ code?. TL:DR: yes if you do it wrong (like this question).

Usually you're fine letting the compiler do its thing, especially if you try to write C++ that can compile efficiently. Also see is assembly faster than compiled languages?. One of the answers links to these neat slides showing how various C compilers optimize some really simple functions with cool tricks. Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” is in a similar vein.

even:
    mov rbx, 2
    xor rdx, rdx
    div rbx


On Intel Haswell, div r64 is 36 uops, with a latency of 32-96 cycles, and a throughput of one per 21-74 cycles. (Plus the 2 uops to set up RBX and zero RDX, but out-of-order execution can run those early). High-uop-count instructions like DIV are microcoded, which can also cause front-end bottlenecks. In this case, latency is the most relevant factor because it's part of a loop-carried dependency chain.

shr rax, 1 does the same unsigned division: It's 1 uop, with 1c latency, and can run 2 per clock cycle.

For comparison, 32-bit division is faster, but still horrible vs. shifts. idiv r32 is 9 uops, 22-29c latency, and one per 8-11c throughput on Haswell.

As you can see from looking at gcc's -O0 asm output (Godbolt compiler explorer), it only uses shifts instructions. clang -O0 does compile naively like you thought, even using 64-bit IDIV twice. (When optimizing, compilers do use both outputs of IDIV when the source does a division and modulus with the same operands, if they use IDIV at all)

GCC doesn't have a totally-naive mode; it always transforms through GIMPLE, which means some "optimizations" can't be disabled. This includes recognizing division-by-constant and using shifts (power of 2) or a fixed-point multiplicative inverse (non power of 2) to avoid IDIV (see div_by_13 in the above godbolt link).

gcc -Os (optimize for size) does use IDIV for non-power-of-2 division,
unfortunately even in cases where the multiplicative inverse code is only slightly larger but much faster.

Helping the compiler

(summary for this case: use uint64_t n)

First of all, it's only interesting to look at optimized compiler output. (-O3).

-O0 speed is basically meaningless.

Look at your asm output (on Godbolt, or see How to remove "noise" from GCC/clang assembly output?). When the compiler doesn't make optimal code in the first place: Writing your C/C++ source in a way that guides the compiler into making better code is usually the best approach. You have to know asm, and know what's efficient, but you apply this knowledge indirectly. Compilers are also a good source of ideas: sometimes clang will do something cool, and you can hand-hold gcc into doing the same thing: see this answer and what I did with the non-unrolled loop in @Veedrac's code below.)

This approach is portable, and in 20 years some future compiler can compile it to whatever is efficient on future hardware (x86 or not), maybe using new ISA extension or auto-vectorizing. Hand-written x86-64 asm from 15 years ago would usually not be optimally tuned for Skylake. e.g. compare&branch macro-fusion didn't exist back then. What's optimal now for hand-crafted asm for one microarchitecture might not be optimal for other current and future CPUs. Comments on @johnfound's answer discuss major differences between AMD Bulldozer and Intel Haswell, which have a big effect on this code. But in theory, g++ -O3 -march=bdver3 and g++ -O3 -march=skylake will do the right thing. (Or -march=native.) Or -mtune=... to just tune, without using instructions that other CPUs might not support.

My feeling is that guiding the compiler to asm that's good for a current CPU you care about shouldn't be a problem for future compilers. They're hopefully better than current compilers at finding ways to transform code, and can find a way that works for future CPUs. Regardless, future x86 probably won't be terrible at anything that's good on current x86, and the future compiler will avoid any asm-specific pitfalls while implementing something like the data movement from your C source, if it doesn't see something better.

Hand-written asm is a black-box for the optimizer, so constant-propagation doesn't work when inlining makes an input a compile-time constant. Oth

Code Snippets

even:
    mov rbx, 2
    xor rdx, rdx
    div rbx
# from gcc5.4 -O3  plus my comments
# edx= count=1
# rax= uint64_t n

.L9:              # do{
    lea    rcx, [rax+1+rax*2]   # rcx = 3*n + 1
    mov    rdi, rax
    shr    rdi        # rdi = n>>1;
    test   al, 1      # set flags based on n%2 (aka n&1)
    mov    rax, rcx
    cmove  rax, rdi   # n= (n%2) ? 3*n+1 : n/2;
    add    edx, 1     # ++count;
    cmp    rax, 1
    jne   .L9      #}while(n!=1)

  cmp/branch to update max and maxi, and then do the next n
### Hand-optimized version of what gcc does
.L9:                  #do{
  lea     rcx, [rax+1+rax*2] # rcx = 3*n + 1
  shr     rax, 1       # n>>=1;    CF = n&1 = n%2
  cmovc   rax, rcx     # n= (n&1) ? 3*n+1 : n/2;
  inc     edx          # ++count;
  cmp     rax, 1
  jne     .L9         #}while(n!=1)
# starting with YMM0 = [ n_d, n_c, n_b, n_a ]  (64-bit elements)
# ymm4 = _mm256_set1_epi64x(1):  increment vector
# ymm5 = all-zeros:  count vector

.inner_loop:
  vpaddq    ymm1, ymm0, xmm0
  vpaddq    ymm1, ymm1, xmm0
  vpaddq    ymm1, ymm1, set1_epi64(1)     # ymm1= 3*n + 1.  Maybe could do this more efficiently?

  vpsllq    ymm3, ymm0, 63                # shift bit 1 to the sign bit

  vpsrlq    ymm0, ymm0, 1                 # n /= 2

  # FP blend between integer insns may cost extra bypass latency, but integer blends don't have 1 bit controlling a whole qword.
  vpblendvpd ymm0, ymm0, ymm1, ymm3       # variable blend controlled by the sign bit of each 64-bit element.  I might have the source operands backwards, I always have to look this up.

  # ymm0 = updated n  in each element.

  vpcmpeqq ymm1, ymm0, set1_epi64(1)
  vpandn   ymm4, ymm1, ymm4         # zero out elements of ymm4 where the compare was true

  vpaddq   ymm5, ymm5, ymm4         # count++ in elements where n has never been == 1

  vptest   ymm4, ymm4
  jnz  .inner_loop
  # Fall through when all the n values have reached 1 at some point, and our increment vector is all-zero

  vextracti128 ymm0, ymm5, 1
  vpmaxuq .... oops, requires AVX-512
  # delay doing a horizontal max until the very end.  But you need some way to record max and maxi.
goto loop_entry;  // C++ structured like the asm, for illustration only
do {
   n = n*3 + 1;
  loop_entry:
   shift = _tzcnt_u64(n);
   n >>= shift;
   count += shift;
} while(n != 1);

Context

Stack Overflow Q#40354978, score: 2156

Revisions (0)

No revisions yet.