HiveBrain v1.2.0
Get Started
← Back to all entries
patterncMinor

Optimizing multiplication of square matrices for full CPU utilization

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fullutilizationmultiplicationoptimizingforsquarecpumatrices

Problem

Problem

I am learning about HPC and code optimization. I attempt to replicate the results in Goto's seminal matrix multiplication paper. Despite my best efforts, I cannot get over ~50% maximum theoretical CPU performance.

Background

See related issues here, including info about my hardware.

What I have attempted

This related paper has a good description of Goto's algorithmic structure. I provide my source code below.

My question

I am asking for general help. I have been working on this for far too long, have tried many different algorithms, inline assembly, inner kernels of various sizes (2x2, 4x4, 2x8, ..., mxn with m and n large), yet I cannot seem to break 50% CPU GFLOPS. This is purely for education purposes and not a homework.

Compile Options

On 32 bit GCC:


gcc -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-loops -fomit-frame-pointer -masm=intel

Source Code

I set up the macro structure (for loops) as described in the 2nd paper above. I pack the matrices as discussed in either paper. My inner kernel computes 2x8 blocks, as this seems to be the optimal computation for Nehalem architecture (see GotoBLAS source code - kernels). The inner kernel is based on the concept of calculating rank-1 updates as described here.

```
#include
#include
#include
#include
#include
#include
#include
#include

// define some prefetch functions
#define PREFETCHNTA(addr,nrOfBytesAhead) \
_mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_NTA)

#define PREFETCHT0(addr,nrOfBytesAhead) \
_mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_T0)

#define PREFETCHT1(addr,nrOfBytesAhead) \
_mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_T1)

#define PREFETCHT2(addr,nrOfBytesAhead) \
_mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_T2)

// define a min function
#ifndef min
#define min( a, b ) ( ((a) < (b)) ? (a) : (b) )
#endif

// zero a matrix
void zeromat(double *C, int n)
{
int i = n;

Solution

Without addressing performance concerns, some trivial observations:

  • #include is unnecessary. (You use OpenMP, but don't call any OpenMP functions.)



  • The return type of main() should be int, not void.



  • The code also compiles with clang (LLVM), if you omit the -masm=intel option.



  • zeromat() could simply be memset(C, 0, n * sizeof(double)).



-
When compiling with -Wall, the code in dgemm_2x8_sse() to zero some registers causes spurious warnings:

matmul.c:56:21: warning: variable 'r8' is uninitialized when used here [-Wuninitialized]
    r8 = _mm_xor_pd(r8,r8); // ab
                    ^~
matmul.c:52:5: note: variable 'r8' is declared here
    register __m128d xmm1, xmm4, //
    ^


I recommend disabling the warnings with a pair of pragmas:

#pragma GCC diagnostic ignored "-Wuninitialized"
    r8 = _mm_xor_pd(r8,r8); // ab
    r9 = _mm_xor_pd(r9,r9);
    r10 = _mm_xor_pd(r10,r10);
    r11 = _mm_xor_pd(r11,r11);

    r12 = _mm_xor_pd(r12,r12); // ab + 8
    r13 = _mm_xor_pd(r13,r13);
    r14 = _mm_xor_pd(r14,r14);
    r15 = _mm_xor_pd(r15,r15);
#pragma GCC diagnostic warning "-Wuninitialized"


You should also discard the confusing and useless comment that precedes that code:

// 10 registers declared here


  • "Gflops/s" is redundant and incorrect terminology (unless you are talking about acceleration, not speed!)

Code Snippets

matmul.c:56:21: warning: variable 'r8' is uninitialized when used here [-Wuninitialized]
    r8 = _mm_xor_pd(r8,r8); // ab
                    ^~
matmul.c:52:5: note: variable 'r8' is declared here
    register __m128d xmm1, xmm4, //
    ^
#pragma GCC diagnostic ignored "-Wuninitialized"
    r8 = _mm_xor_pd(r8,r8); // ab
    r9 = _mm_xor_pd(r9,r9);
    r10 = _mm_xor_pd(r10,r10);
    r11 = _mm_xor_pd(r11,r11);

    r12 = _mm_xor_pd(r12,r12); // ab + 8
    r13 = _mm_xor_pd(r13,r13);
    r14 = _mm_xor_pd(r14,r14);
    r15 = _mm_xor_pd(r15,r15);
#pragma GCC diagnostic warning "-Wuninitialized"
// 10 registers declared here

Context

StackExchange Code Review Q#52140, answer score: 6

Revisions (0)

No revisions yet.