patterncMinor

SSE2 assembly optimization - multiply unsigned shorts and add the result

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

shortsresultthemultiplyunsignedoptimizationassemblyandsse2add

Problem

I am attempting to optimize a piece of C code which aims to multiply a series of pairs of unsigned shorts and add the result. I am only concerned about the high 16 bits of the result, and I can guarantee that the sum of the multiples will fit in a 32-bit value. I initially coded this in C, and then rewrote it to use SSE2 intrinsics (slowest), and then rewrote it in SSE2 assembler (fastest). I am not an expert at x86 assembler and would appreciate any recommendations on how to speed this code up. Speed is the priority, this is a tight inner loop. It is OK to assume that input will be valid. In addition, portability is not a major concern, this code will only be used on computers with Intel i5 or i7 processors. Attend:

C

register int i;
uint16_t* iw_ptr = n->iw;
uint16_t* nw_ptr = n->nw;
register uint32_t accvalue = 0;

if (id > 16);

C SSE2

#define SSE2_I ((INPUT_COUNT+7)/8)
#define SSE2_N ((NEURON_COUNT+7)/8)
register int i;

__m128i  mm_sums;
__m128i  mm_arg1;
__m128i  mm_arg2;
__m128i  mm_accum = _mm_setzero_si128();
__m128i* mm_iptr = (__m128i*) i_ptr;
__m128i* mm_nptr = (__m128i*) n_ptr;
__m128i* mm_iwptr = (__m128i*) n->iw_ptr;
__m128i* mm_nwptr = (__m128i*) n->nw_ptr;
int id_8 = (id + 7)/8; // Round up to nearest multiple of 8

if (id < 2 * I_CNT) {
    for (i = 0; i < SSE2_I; i++) {
        mm_arg1 = _mm_loadu_si128(mm_iptr+i);
        mm_arg2 = _mm_loadu_si128(mm_iwptr+i);
        mm_sums = _mm_mulhi_epu16(mm_arg1, mm_arg2);
        mm_accum = _mm_adds_epu16(mm_accum, mm_sums);
    }
}
for (i = 0; i < id_8; i++) {
    mm_arg1 = _mm_loadu_si128(mm_nptr+i);
    mm_arg2 = _mm_loadu_si128(mm_nwptr+i);
    mm_sums = _mm_mulhi_epu16(mm_arg1, mm_arg2);
    mm_accum = _mm_adds_epu16(mm_accum, mm_sums);
}
_mm_storeu_si128(mm_accum_mem, mm_accum);

for (i = 0; i < 8; i++) {
    value += *(((uint16_t*) mm_accum_mem) + i);
}

SSE2 and x86 Assembler

```
#define SSE2_I ((I_CNT+7)/8)
#define SSE2_N ((N_CNT+7)/8)
__m128i mm_iptr = (__m128i) i_ptr;
__m1

Solution

Some suggestions:

-
forget asm (for now at least) and stick with SSE intrinsics - you can concentrate on optimisation and let the compiler worry about the implementation details like register allocation, instruction scheduling and loop unrolling

-
use a decent compiler - e.g. Intel ICC typically generates better code than gcc in most cases, and Visual Studio typically generates worse code (this is just based on empirical evidence - try as many different compilers as is practical to see which gives best results)

-
use x86-64 if you can - this gives you twice as many SSE registers to play with (16 versus 8) which enables more loop unrolling etc

-
if your data contains a lot of zeroes such that it's worth the cost of testing and branching to avoid redundant multiplies then you can use _mm_testz_si128 (PTEST) to handle this case - make sure you time with/without this optimisation as its potential benefit is highly data-dependent

-
you are doing very little computation relative to the number of loads and stores so you may hit an optimisation brick wall due to finite memory bandwidth (you may have hit this already) - if possible try combining some other operations from before/after this loop so that you can mitigate the cost of loads and stores

-
try and ensure that your data is 16 byte aligned and use only aligned loads/stores if you possibly can - even though Core i7 supposedly has zero performance penalty for misaligned loads in practice there is still an overhead (probably due to increased cache footprint)

-
don't mess with prefetch instructions (for now at least) - it's hard to beat the automatic prefetch in Core i7 and you may well end up reducing performance rather then improving it

Context

StackExchange Code Review Q#7364, answer score: 4

Revisions (0)

No revisions yet.