HiveBrain v1.2.0
Get Started
← Back to all entries
patternrustCritical

Why is there a large performance impact when looping over an array with 240 or more elements?

Submitted by: @import:stackoverflow-api··
0
Viewed 0 times
arraywithloopingwhyimpactmoreover240whenelements

Problem

When running a sum loop over an array in Rust, I noticed a huge performance drop when CAPACITY >= 240. CAPACITY = 239 is about 80 times faster.

Is there special compilation optimization Rust is doing for "short" arrays?

Compiled with rustc -C opt-level=3.

use std::time::Instant;

const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

fn main() {
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY {
arr[i] = i;
}
let mut sum = 0;
let now = Instant::now();
for _ in 0..IN_LOOPS {
let mut s = 0;
for i in 0..arr.len() {
s += arr[i];
}
sum += s;
}
println!("sum:{} time:{:?}", sum, now.elapsed());
}

Solution

Summary: below 240, LLVM fully unrolls the inner loop and that lets it notice it can optimize away the repeat loop, breaking your benchmark.

You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization – only performed for length 239 – is responsible for the huge speed difference. But let's start slowly:

(All code in this answer is compiled with -C opt-level=3)

pub fn foo() -> usize {
let arr = [0; 240];
let mut s = 0;
for i in 0..arr.len() {
s += arr[i];
}
s
}


This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change 240 to 239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. Here is a small part of the assembly:

movdqa xmm1, xmmword ptr [rsp + 32]
movdqa xmm0, xmmword ptr [rsp + 48]
paddq xmm1, xmmword ptr [rsp]
paddq xmm0, xmmword ptr [rsp + 16]
paddq xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq xmm0, xmmword ptr [rsp + 1840]
paddq xmm1, xmmword ptr [rsp + 1856]
paddq xmm0, xmmword ptr [rsp + 1872]
paddq xmm0, xmm1
pshufd xmm1, xmm0, 78
paddq xmm1, xmm0


This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. incrementing the loop variable, check if the loop has ended and the jump to the start of the loop.

In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. Moreover, two 16-byte SIMD registers (xmm0 and xmm1) are used in parallel so that instruction-level parallelism of the CPU can basically execute two of these instructions at the same time. After all, they are independent of one another. In the end, both registers are added together and then horizontally summed down to the scalar result.

Modern mainstream x86 CPUs (not low-power Atom) really can do 2 vector loads per clock when they hit in L1d cache, and paddq throughput is also at least 2 per clock, with 1 cycle latency on most CPUs. See https://agner.org/optimize/ and also this Q&A about multiple accumulators to hide latency (of FP FMA for a dot product) and bottleneck on throughput instead.

LLVM does unroll small loops some when it's not fully unrolling, and still uses multiple accumulators. So usually, front-end bandwidth and back-end latency bottlenecks aren't a huge problem for LLVM-generated loops even without full unrolling.

But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:

const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;

pub fn foo() -> usize {
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY {
arr[i] = i;
}

let mut sum = 0;
for _ in 0..IN_LOOPS {
let mut s = 0;
for i in 0..arr.len() {
s += arr[i];
}
sum += s;
}

sum
}


(On Godbolt Compiler Explorer)

The assembly for CAPACITY = 240 looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.

The important difference is that for 239, LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times!

First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly; the comments with * are especially important):

; at the start of the function, rbx was set to 0

movq rax, xmm1 ; result of SIMD summing up stored in
rax
add rax, 711 ; add up missing terms from loop unrolling
mov ecx, 500000 ; * init loop variable outer loop
.LBB0_1:
add rbx, rax ; * rbx += rax
add rcx, -1 ; * decrement loop variable
jne .LBB0_1 ; * if loop variable != 0 jump to LBB0_1
mov rax, rbx ; move rbx (the sum) back to rax
; two unimportant instructions omitted
ret ; the return value is stored in
rax


As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the ou

Context

Stack Overflow Q#57458460, score: 398

Revisions (0)

No revisions yet.