patterncMinor
Multiplying square matrix with C pthreads (POSIX threads)
Viewed 0 times
pthreadswithmultiplyingthreadsposixsquarematrix
Problem
I'm a student, and I'm trying to make the product of two square matrix with some threads in the soup. I've already worked on some fast single-threaded product (with some cache-friendly tricks), and I'm interested now on multithreading.
The following code works, but I don't fully understand my outputs in terms of performance.
I'd like to improve my code a bit and of course, improve my implementation (and in a perfect world, achieving a multithreaded version faster than a naive singlethreaded version).
Makefile:
mat_mult.c:
```
#include
#include
#include
#include
#include
// Pick your favorite flavor
// #define modular
// #define B_bloc
// #define K_bloc
#ifdef modular
#define FOO_TH modular_
#define MALLOC_P_TH default_alloc_
#elif defined(B_bloc)
#define FOO_TH B_bloc_
#define MALLOC_P_TH default_alloc_
#elif defined(K_bloc)
#define FOO_TH K_bloc_
#define MALLOC_P_TH K_bloc_alloc_
#ifndef BLK
#define BLK 2
#endif
#endif
#ifndef BLK
#define BLK 16
#endif
#ifndef NB_TH
#define NB_TH 8
#endif
#define THREAD_WAIT 1
#define THREAD_OVER 0
// / // / // / // / // / // / // / // / // / // / // / // / // / // / // / // /
// TOOLS
// / // / // / // / // / // / // / // / // / // / // / // / // / // / // / // /
// Tool to build a matrix
void build_mat_(int*** tar
The following code works, but I don't fully understand my outputs in terms of performance.
I'd like to improve my code a bit and of course, improve my implementation (and in a perfect world, achieving a multithreaded version faster than a naive singlethreaded version).
Makefile:
CC=gcc
CCO=# -O3 -march=native
CCF=-std=gnu11 -Wall -Wextra -pedantic -Wno-unused
LDF=-pthread
EXTRA=-DNB_TH=8
all: modular B_bloc K_bloc bench
modular:
$(CC) -Dmodular $(EXTRA) $(LDF) $(CCF) $(CCO) ./mat_mult.c -o modular
B_bloc:
$(CC) -DB_bloc $(EXTRA) $(LDF) $(CCF) $(CCO) ./mat_mult.c -o B_bloc
K_bloc:
$(CC) -DK_bloc $(EXTRA) $(LDF) $(CCF) $(CCO) ./mat_mult.c -o K_bloc
mr_proper:
rm -f modular B_bloc K_bloc
bench:
./modular 128
./B_bloc 128
./K_bloc 128
./modular 208
./B_bloc 208
./K_bloc 208
./modular 400
./B_bloc 400
./K_bloc 400
./modular 1024
./B_bloc 1024
./K_bloc 1024mat_mult.c:
```
#include
#include
#include
#include
#include
// Pick your favorite flavor
// #define modular
// #define B_bloc
// #define K_bloc
#ifdef modular
#define FOO_TH modular_
#define MALLOC_P_TH default_alloc_
#elif defined(B_bloc)
#define FOO_TH B_bloc_
#define MALLOC_P_TH default_alloc_
#elif defined(K_bloc)
#define FOO_TH K_bloc_
#define MALLOC_P_TH K_bloc_alloc_
#ifndef BLK
#define BLK 2
#endif
#endif
#ifndef BLK
#define BLK 16
#endif
#ifndef NB_TH
#define NB_TH 8
#endif
#define THREAD_WAIT 1
#define THREAD_OVER 0
// / // / // / // / // / // / // / // / // / // / // / // / // / // / // / // /
// TOOLS
// / // / // / // / // / // / // / // / // / // / // / // / // / // / // / // /
// Tool to build a matrix
void build_mat_(int*** tar
Solution
Wrong loop order
In your naive matrix multiplication, you have your loops in a suboptimal ordering. I compared your implementation with this one, where I swapped the second and third loops:
When I ran
So clearly you need to think about cache usage a little bit more. The version I listed here is more cache friendly because
Use a struct!
The code you use to pass information between the main thread and the pthreads is not good at all. You are reinventing the wheel by constructing your own handcoded struct:
You should be using an actual struct, like this:
As you can see, this improves the code in many ways:
Using the wrong clock function
When I ran your program, I was surprised to see that none of your multithreaded versions ran faster than the single threaded version. However, I discovered that you were using the wrong function to time your program. The
In your naive matrix multiplication, you have your loops in a suboptimal ordering. I compared your implementation with this one, where I swapped the second and third loops:
for(int i = 0; i < width; ++i)
for(int k = 0; k < width; ++k)
for(int j = 0; j < height; ++j)
D[i][j] += A[i][k] * B[k][j];When I ran
B_bloc 1024, I got these numbers (no threads involved):Original loops: 9.89 sec
Swapped loops : 1.07 sec
So clearly you need to think about cache usage a little bit more. The version I listed here is more cache friendly because
A[i][k] is constant in the inner loop while the other two matrices are being iterated in memory order. With the original loops where the inner loop iterated on k, the B[k][j] term was jumping cache lines on every loop iteration.Use a struct!
The code you use to pass information between the main thread and the pthreads is not good at all. You are reinventing the wheel by constructing your own handcoded struct:
// This is what MALLOC_P_TH() calls:
void* default_alloc_()
{
return malloc(0
+ sizeof(int) * 3
+ sizeof(int*) * 1
+ sizeof(int**) * 3
);
}
// In main():
thread_p[i] = MALLOC_P_TH();
int* int_p = (int*) thread_p[i];
int_p[0] = width;
int_p[1] = height;
int_p[2] = i;
int** vec_p = (int**) (int_p + 3);
vec_p[0] = thread_status;
int*** mat_p = (int***) (vec_p + 1);
mat_p[0] = A;
mat_p[1] = B;
mat_p[2] = C;You should be using an actual struct, like this:
typedef struct ThreadArgs {
int width;
int height;
int threadNum;
int *pStatus;
int **matrixA;
int **matrixB;
int **matrixC;
} ThreadArgs;
// In main():
thread_p[i] = malloc(sizeof(ThreadArgs));
thread_p[i]->width = width;
thread_p[i]->height = height;
thread_p[i]->threadNum = i;
thread_p[i]->pStatus = thread_status;
thread_p[i]->matrixA = A;
thread_p[i]->matrixB = B;
thread_p[i]->matrixC = C;As you can see, this improves the code in many ways:
- You don't have to have a custom function just to compute the size of the struct.
- Your fields are actually named, so you can refer to each field by name rather than number.
- If you need to add or remove fields, you simply make modifications to the struct rather than fiddling with the handcoded arrays.
Using the wrong clock function
When I ran your program, I was surprised to see that none of your multithreaded versions ran faster than the single threaded version. However, I discovered that you were using the wrong function to time your program. The
clock() function measures cpu cycles used across all threads. So even if your program ran 8 times faster, you would still get the same reading from clock() as for a single threaded program. What you want is a wall clock measurement. I would suggest using something like clock_gettime() or gettimeofday(). After I modified your program to use clock_gettime() and fixed the loop orders, I ran B_bloc and got the following output, which makes a lot more sense (I have a 4 core system, so 0.25 on the third line is expected):$ B_bloc 1024
1.102912
0.276752
0.250928
Code Snippets
for(int i = 0; i < width; ++i)
for(int k = 0; k < width; ++k)
for(int j = 0; j < height; ++j)
D[i][j] += A[i][k] * B[k][j];// This is what MALLOC_P_TH() calls:
void* default_alloc_()
{
return malloc(0
+ sizeof(int) * 3
+ sizeof(int*) * 1
+ sizeof(int**) * 3
);
}
// In main():
thread_p[i] = MALLOC_P_TH();
int* int_p = (int*) thread_p[i];
int_p[0] = width;
int_p[1] = height;
int_p[2] = i;
int** vec_p = (int**) (int_p + 3);
vec_p[0] = thread_status;
int*** mat_p = (int***) (vec_p + 1);
mat_p[0] = A;
mat_p[1] = B;
mat_p[2] = C;typedef struct ThreadArgs {
int width;
int height;
int threadNum;
int *pStatus;
int **matrixA;
int **matrixB;
int **matrixC;
} ThreadArgs;
// In main():
thread_p[i] = malloc(sizeof(ThreadArgs));
thread_p[i]->width = width;
thread_p[i]->height = height;
thread_p[i]->threadNum = i;
thread_p[i]->pStatus = thread_status;
thread_p[i]->matrixA = A;
thread_p[i]->matrixB = B;
thread_p[i]->matrixC = C;Context
StackExchange Code Review Q#121494, answer score: 3
Revisions (0)
No revisions yet.