patterncMinor

Multiplying square matrix with C pthreads (POSIX threads)

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

pthreads codereview c stackoverflow performance multithreading matrix

pthreadswithmultiplyingthreadsposixsquarematrix

Problem

I'm a student, and I'm trying to make the product of two square matrix with some threads in the soup. I've already worked on some fast single-threaded product (with some cache-friendly tricks), and I'm interested now on multithreading.

The following code works, but I don't fully understand my outputs in terms of performance.

I'd like to improve my code a bit and of course, improve my implementation (and in a perfect world, achieving a multithreaded version faster than a naive singlethreaded version).

Makefile:

CC=gcc
CCO=# -O3 -march=native
CCF=-std=gnu11 -Wall -Wextra -pedantic -Wno-unused
LDF=-pthread
EXTRA=-DNB_TH=8

all: modular B_bloc K_bloc bench

modular:
    $(CC) -Dmodular $(EXTRA) $(LDF) $(CCF) $(CCO) ./mat_mult.c -o modular

B_bloc:
    $(CC) -DB_bloc $(EXTRA) $(LDF) $(CCF) $(CCO) ./mat_mult.c -o B_bloc

K_bloc:
    $(CC) -DK_bloc $(EXTRA) $(LDF) $(CCF) $(CCO) ./mat_mult.c -o K_bloc

mr_proper:
    rm -f modular B_bloc K_bloc

bench:
    ./modular 128
    ./B_bloc 128
    ./K_bloc 128

    ./modular 208
    ./B_bloc 208
    ./K_bloc 208

    ./modular 400
    ./B_bloc 400
    ./K_bloc 400

    ./modular 1024
    ./B_bloc 1024
    ./K_bloc 1024

mat_mult.c:

```
#include
#include
#include
#include
#include

// Pick your favorite flavor
// #define modular
// #define B_bloc
// #define K_bloc

#ifdef modular
#define FOO_TH modular_
#define MALLOC_P_TH default_alloc_
#elif defined(B_bloc)
#define FOO_TH B_bloc_
#define MALLOC_P_TH default_alloc_
#elif defined(K_bloc)
#define FOO_TH K_bloc_
#define MALLOC_P_TH K_bloc_alloc_
#ifndef BLK
#define BLK 2
#endif
#endif

#ifndef BLK
#define BLK 16
#endif
#ifndef NB_TH
#define NB_TH 8
#endif

#define THREAD_WAIT 1
#define THREAD_OVER 0

// / // / // / // / // / // / // / // / // / // / // / // / // / // / // / // /
// TOOLS
// / // / // / // / // / // / // / // / // / // / // / // / // / // / // / // /

// Tool to build a matrix
void build_mat_(int*** tar

Solution

Wrong loop order

In your naive matrix multiplication, you have your loops in a suboptimal ordering. I compared your implementation with this one, where I swapped the second and third loops:

for(int i = 0; i < width; ++i)
    for(int k = 0; k < width; ++k)
        for(int j = 0; j < height; ++j)
            D[i][j] += A[i][k] * B[k][j];

When I ran B_bloc 1024, I got these numbers (no threads involved):

Original loops: 9.89 sec
Swapped loops : 1.07 sec

So clearly you need to think about cache usage a little bit more. The version I listed here is more cache friendly because A[i][k] is constant in the inner loop while the other two matrices are being iterated in memory order. With the original loops where the inner loop iterated on k, the B[k][j] term was jumping cache lines on every loop iteration.
Use a struct!

The code you use to pass information between the main thread and the pthreads is not good at all. You are reinventing the wheel by constructing your own handcoded struct:

// This is what MALLOC_P_TH() calls:
    void* default_alloc_()
    {
        return malloc(0
                + sizeof(int)   * 3
                + sizeof(int*)  * 1
                + sizeof(int**) * 3
                );
    }

    // In main():
    thread_p[i] = MALLOC_P_TH();
    int* int_p = (int*) thread_p[i];
    int_p[0] = width;
    int_p[1] = height;
    int_p[2] = i;
    int** vec_p = (int**) (int_p + 3);
    vec_p[0] = thread_status;
    int*** mat_p = (int***) (vec_p + 1);
    mat_p[0] = A;
    mat_p[1] = B;
    mat_p[2] = C;

You should be using an actual struct, like this:

typedef struct ThreadArgs {
    int width;
    int height;
    int threadNum;
    int *pStatus;
    int **matrixA;
    int **matrixB;
    int **matrixC;
} ThreadArgs;

// In main():
thread_p[i]            = malloc(sizeof(ThreadArgs));
thread_p[i]->width     = width;
thread_p[i]->height    = height;
thread_p[i]->threadNum = i;
thread_p[i]->pStatus   = thread_status;
thread_p[i]->matrixA   = A;
thread_p[i]->matrixB   = B;
thread_p[i]->matrixC   = C;

As you can see, this improves the code in many ways:

You don't have to have a custom function just to compute the size of the struct.

Your fields are actually named, so you can refer to each field by name rather than number.

If you need to add or remove fields, you simply make modifications to the struct rather than fiddling with the handcoded arrays.

Using the wrong clock function

When I ran your program, I was surprised to see that none of your multithreaded versions ran faster than the single threaded version. However, I discovered that you were using the wrong function to time your program. The clock() function measures cpu cycles used across all threads. So even if your program ran 8 times faster, you would still get the same reading from clock() as for a single threaded program. What you want is a wall clock measurement. I would suggest using something like clock_gettime() or gettimeofday(). After I modified your program to use clock_gettime() and fixed the loop orders, I ran B_bloc and got the following output, which makes a lot more sense (I have a 4 core system, so 0.25 on the third line is expected):

$ B_bloc 1024
1.102912
0.276752
0.250928

Code Snippets

for(int i = 0; i < width; ++i)
    for(int k = 0; k < width; ++k)
        for(int j = 0; j < height; ++j)
            D[i][j] += A[i][k] * B[k][j];

// This is what MALLOC_P_TH() calls:
    void* default_alloc_()
    {
        return malloc(0
                + sizeof(int)   * 3
                + sizeof(int*)  * 1
                + sizeof(int**) * 3
                );
    }

    // In main():
    thread_p[i] = MALLOC_P_TH();
    int* int_p = (int*) thread_p[i];
    int_p[0] = width;
    int_p[1] = height;
    int_p[2] = i;
    int** vec_p = (int**) (int_p + 3);
    vec_p[0] = thread_status;
    int*** mat_p = (int***) (vec_p + 1);
    mat_p[0] = A;
    mat_p[1] = B;
    mat_p[2] = C;

typedef struct ThreadArgs {
    int width;
    int height;
    int threadNum;
    int *pStatus;
    int **matrixA;
    int **matrixB;
    int **matrixC;
} ThreadArgs;

// In main():
thread_p[i]            = malloc(sizeof(ThreadArgs));
thread_p[i]->width     = width;
thread_p[i]->height    = height;
thread_p[i]->threadNum = i;
thread_p[i]->pStatus   = thread_status;
thread_p[i]->matrixA   = A;
thread_p[i]->matrixB   = B;
thread_p[i]->matrixC   = C;

Context

StackExchange Code Review Q#121494, answer score: 3

Revisions (0)

No revisions yet.