patterncMinor
Optimizing multiplication of square matrices for full CPU utilization
Viewed 0 times
fullutilizationmultiplicationoptimizingforsquarecpumatrices
Problem
Problem
I am learning about HPC and code optimization. I attempt to replicate the results in Goto's seminal matrix multiplication paper. Despite my best efforts, I cannot get over ~50% maximum theoretical CPU performance.
Background
See related issues here, including info about my hardware.
What I have attempted
This related paper has a good description of Goto's algorithmic structure. I provide my source code below.
My question
I am asking for general help. I have been working on this for far too long, have tried many different algorithms, inline assembly, inner kernels of various sizes (2x2, 4x4, 2x8, ..., mxn with m and n large), yet I cannot seem to break 50% CPU GFLOPS. This is purely for education purposes and not a homework.
Compile Options
On 32 bit GCC:
gcc -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-loops -fomit-frame-pointer -masm=intel
Source Code
I set up the macro structure (
```
#include
#include
#include
#include
#include
#include
#include
#include
// define some prefetch functions
#define PREFETCHNTA(addr,nrOfBytesAhead) \
_mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_NTA)
#define PREFETCHT0(addr,nrOfBytesAhead) \
_mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_T0)
#define PREFETCHT1(addr,nrOfBytesAhead) \
_mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_T1)
#define PREFETCHT2(addr,nrOfBytesAhead) \
_mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_T2)
// define a min function
#ifndef min
#define min( a, b ) ( ((a) < (b)) ? (a) : (b) )
#endif
// zero a matrix
void zeromat(double *C, int n)
{
int i = n;
I am learning about HPC and code optimization. I attempt to replicate the results in Goto's seminal matrix multiplication paper. Despite my best efforts, I cannot get over ~50% maximum theoretical CPU performance.
Background
See related issues here, including info about my hardware.
What I have attempted
This related paper has a good description of Goto's algorithmic structure. I provide my source code below.
My question
I am asking for general help. I have been working on this for far too long, have tried many different algorithms, inline assembly, inner kernels of various sizes (2x2, 4x4, 2x8, ..., mxn with m and n large), yet I cannot seem to break 50% CPU GFLOPS. This is purely for education purposes and not a homework.
Compile Options
On 32 bit GCC:
gcc -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-loops -fomit-frame-pointer -masm=intel
Source Code
I set up the macro structure (
for loops) as described in the 2nd paper above. I pack the matrices as discussed in either paper. My inner kernel computes 2x8 blocks, as this seems to be the optimal computation for Nehalem architecture (see GotoBLAS source code - kernels). The inner kernel is based on the concept of calculating rank-1 updates as described here.```
#include
#include
#include
#include
#include
#include
#include
#include
// define some prefetch functions
#define PREFETCHNTA(addr,nrOfBytesAhead) \
_mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_NTA)
#define PREFETCHT0(addr,nrOfBytesAhead) \
_mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_T0)
#define PREFETCHT1(addr,nrOfBytesAhead) \
_mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_T1)
#define PREFETCHT2(addr,nrOfBytesAhead) \
_mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_T2)
// define a min function
#ifndef min
#define min( a, b ) ( ((a) < (b)) ? (a) : (b) )
#endif
// zero a matrix
void zeromat(double *C, int n)
{
int i = n;
Solution
Without addressing performance concerns, some trivial observations:
-
When compiling with
I recommend disabling the warnings with a pair of pragmas:
You should also discard the confusing and useless comment that precedes that code:
#includeis unnecessary. (You use OpenMP, but don't call any OpenMP functions.)
- The return type of
main()should beint, notvoid.
- The code also compiles with
clang(LLVM), if you omit the-masm=inteloption.
zeromat()could simply bememset(C, 0, n * sizeof(double)).
-
When compiling with
-Wall, the code in dgemm_2x8_sse() to zero some registers causes spurious warnings:matmul.c:56:21: warning: variable 'r8' is uninitialized when used here [-Wuninitialized]
r8 = _mm_xor_pd(r8,r8); // ab
^~
matmul.c:52:5: note: variable 'r8' is declared here
register __m128d xmm1, xmm4, //
^I recommend disabling the warnings with a pair of pragmas:
#pragma GCC diagnostic ignored "-Wuninitialized"
r8 = _mm_xor_pd(r8,r8); // ab
r9 = _mm_xor_pd(r9,r9);
r10 = _mm_xor_pd(r10,r10);
r11 = _mm_xor_pd(r11,r11);
r12 = _mm_xor_pd(r12,r12); // ab + 8
r13 = _mm_xor_pd(r13,r13);
r14 = _mm_xor_pd(r14,r14);
r15 = _mm_xor_pd(r15,r15);
#pragma GCC diagnostic warning "-Wuninitialized"You should also discard the confusing and useless comment that precedes that code:
// 10 registers declared here- "Gflops/s" is redundant and incorrect terminology (unless you are talking about acceleration, not speed!)
Code Snippets
matmul.c:56:21: warning: variable 'r8' is uninitialized when used here [-Wuninitialized]
r8 = _mm_xor_pd(r8,r8); // ab
^~
matmul.c:52:5: note: variable 'r8' is declared here
register __m128d xmm1, xmm4, //
^#pragma GCC diagnostic ignored "-Wuninitialized"
r8 = _mm_xor_pd(r8,r8); // ab
r9 = _mm_xor_pd(r9,r9);
r10 = _mm_xor_pd(r10,r10);
r11 = _mm_xor_pd(r11,r11);
r12 = _mm_xor_pd(r12,r12); // ab + 8
r13 = _mm_xor_pd(r13,r13);
r14 = _mm_xor_pd(r14,r14);
r15 = _mm_xor_pd(r15,r15);
#pragma GCC diagnostic warning "-Wuninitialized"// 10 registers declared hereContext
StackExchange Code Review Q#52140, answer score: 6
Revisions (0)
No revisions yet.