patterncMinor

Optimizing multiplication of square matrices for full CPU utilization

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

codereview sse c openmp stackoverflow optimization matrix

fullutilizationmultiplicationoptimizingforsquarecpumatrices

Problem

Problem

I am learning about HPC and code optimization. I attempt to replicate the results in Goto's seminal matrix multiplication paper. Despite my best efforts, I cannot get over ~50% maximum theoretical CPU performance.

Background

See related issues here, including info about my hardware.

What I have attempted

This related paper has a good description of Goto's algorithmic structure. I provide my source code below.

My question

I am asking for general help. I have been working on this for far too long, have tried many different algorithms, inline assembly, inner kernels of various sizes (2x2, 4x4, 2x8, ..., mxn with m and n large), yet I cannot seem to break 50% CPU GFLOPS. This is purely for education purposes and not a homework.

Compile Options

On 32 bit GCC:

gcc -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-loops -fomit-frame-pointer -masm=intel

Source Code

I set up the macro structure (for loops) as described in the 2nd paper above. I pack the matrices as discussed in either paper. My inner kernel computes 2x8 blocks, as this seems to be the optimal computation for Nehalem architecture (see GotoBLAS source code - kernels). The inner kernel is based on the concept of calculating rank-1 updates as described here.

```
#include
#include
#include
#include
#include
#include
#include
#include

// define some prefetch functions
#define PREFETCHNTA(addr,nrOfBytesAhead) \
_mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_NTA)

#define PREFETCHT0(addr,nrOfBytesAhead) \
_mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_T0)

#define PREFETCHT1(addr,nrOfBytesAhead) \
_mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_T1)

#define PREFETCHT2(addr,nrOfBytesAhead) \
_mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_T2)

// define a min function
#ifndef min
#define min( a, b ) ( ((a) < (b)) ? (a) : (b) )
#endif

// zero a matrix
void zeromat(double *C, int n)
{
int i = n;

Solution

Without addressing performance concerns, some trivial observations:

#include is unnecessary. (You use OpenMP, but don't call any OpenMP functions.)

The return type of main() should be int, not void.

The code also compiles with clang (LLVM), if you omit the -masm=intel option.

zeromat() could simply be memset(C, 0, n * sizeof(double)).

-
When compiling with -Wall, the code in dgemm_2x8_sse() to zero some registers causes spurious warnings:

matmul.c:56:21: warning: variable 'r8' is uninitialized when used here [-Wuninitialized]
    r8 = _mm_xor_pd(r8,r8); // ab
                    ^~
matmul.c:52:5: note: variable 'r8' is declared here
    register __m128d xmm1, xmm4, //
    ^

I recommend disabling the warnings with a pair of pragmas:

#pragma GCC diagnostic ignored "-Wuninitialized"
    r8 = _mm_xor_pd(r8,r8); // ab
    r9 = _mm_xor_pd(r9,r9);
    r10 = _mm_xor_pd(r10,r10);
    r11 = _mm_xor_pd(r11,r11);

    r12 = _mm_xor_pd(r12,r12); // ab + 8
    r13 = _mm_xor_pd(r13,r13);
    r14 = _mm_xor_pd(r14,r14);
    r15 = _mm_xor_pd(r15,r15);
#pragma GCC diagnostic warning "-Wuninitialized"

You should also discard the confusing and useless comment that precedes that code:

// 10 registers declared here

"Gflops/s" is redundant and incorrect terminology (unless you are talking about acceleration, not speed!)

Code Snippets

matmul.c:56:21: warning: variable 'r8' is uninitialized when used here [-Wuninitialized]
    r8 = _mm_xor_pd(r8,r8); // ab
                    ^~
matmul.c:52:5: note: variable 'r8' is declared here
    register __m128d xmm1, xmm4, //
    ^

#pragma GCC diagnostic ignored "-Wuninitialized"
    r8 = _mm_xor_pd(r8,r8); // ab
    r9 = _mm_xor_pd(r9,r9);
    r10 = _mm_xor_pd(r10,r10);
    r11 = _mm_xor_pd(r11,r11);

    r12 = _mm_xor_pd(r12,r12); // ab + 8
    r13 = _mm_xor_pd(r13,r13);
    r14 = _mm_xor_pd(r14,r14);
    r15 = _mm_xor_pd(r15,r15);
#pragma GCC diagnostic warning "-Wuninitialized"

// 10 registers declared here

Context

StackExchange Code Review Q#52140, answer score: 6

Revisions (0)

No revisions yet.