HiveBrain v1.2.0
Get Started
← Back to all entries
patterncMinor

Multiplying square matrix with C pthreads (POSIX threads)

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
pthreadswithmultiplyingthreadsposixsquarematrix

Problem

I'm a student, and I'm trying to make the product of two square matrix with some threads in the soup. I've already worked on some fast single-threaded product (with some cache-friendly tricks), and I'm interested now on multithreading.

The following code works, but I don't fully understand my outputs in terms of performance.

I'd like to improve my code a bit and of course, improve my implementation (and in a perfect world, achieving a multithreaded version faster than a naive singlethreaded version).

Makefile:

CC=gcc
CCO=# -O3 -march=native
CCF=-std=gnu11 -Wall -Wextra -pedantic -Wno-unused
LDF=-pthread
EXTRA=-DNB_TH=8

all: modular B_bloc K_bloc bench

modular:
    $(CC) -Dmodular $(EXTRA) $(LDF) $(CCF) $(CCO) ./mat_mult.c -o modular

B_bloc:
    $(CC) -DB_bloc $(EXTRA) $(LDF) $(CCF) $(CCO) ./mat_mult.c -o B_bloc

K_bloc:
    $(CC) -DK_bloc $(EXTRA) $(LDF) $(CCF) $(CCO) ./mat_mult.c -o K_bloc

mr_proper:
    rm -f modular B_bloc K_bloc

bench:
    ./modular 128
    ./B_bloc 128
    ./K_bloc 128

    ./modular 208
    ./B_bloc 208
    ./K_bloc 208

    ./modular 400
    ./B_bloc 400
    ./K_bloc 400

    ./modular 1024
    ./B_bloc 1024
    ./K_bloc 1024


mat_mult.c:

```
#include
#include
#include
#include
#include

// Pick your favorite flavor
// #define modular
// #define B_bloc
// #define K_bloc

#ifdef modular
#define FOO_TH modular_
#define MALLOC_P_TH default_alloc_
#elif defined(B_bloc)
#define FOO_TH B_bloc_
#define MALLOC_P_TH default_alloc_
#elif defined(K_bloc)
#define FOO_TH K_bloc_
#define MALLOC_P_TH K_bloc_alloc_
#ifndef BLK
#define BLK 2
#endif
#endif

#ifndef BLK
#define BLK 16
#endif
#ifndef NB_TH
#define NB_TH 8
#endif

#define THREAD_WAIT 1
#define THREAD_OVER 0

// / // / // / // / // / // / // / // / // / // / // / // / // / // / // / // /
// TOOLS
// / // / // / // / // / // / // / // / // / // / // / // / // / // / // / // /

// Tool to build a matrix
void build_mat_(int*** tar

Solution

Wrong loop order

In your naive matrix multiplication, you have your loops in a suboptimal ordering. I compared your implementation with this one, where I swapped the second and third loops:

for(int i = 0; i < width; ++i)
    for(int k = 0; k < width; ++k)
        for(int j = 0; j < height; ++j)
            D[i][j] += A[i][k] * B[k][j];


When I ran B_bloc 1024, I got these numbers (no threads involved):
Original loops: 9.89 sec
Swapped loops : 1.07 sec


So clearly you need to think about cache usage a little bit more. The version I listed here is more cache friendly because A[i][k] is constant in the inner loop while the other two matrices are being iterated in memory order. With the original loops where the inner loop iterated on k, the B[k][j] term was jumping cache lines on every loop iteration.
Use a struct!

The code you use to pass information between the main thread and the pthreads is not good at all. You are reinventing the wheel by constructing your own handcoded struct:

// This is what MALLOC_P_TH() calls:
    void* default_alloc_()
    {
        return malloc(0
                + sizeof(int)   * 3
                + sizeof(int*)  * 1
                + sizeof(int**) * 3
                );
    }

    // In main():
    thread_p[i] = MALLOC_P_TH();
    int* int_p = (int*) thread_p[i];
    int_p[0] = width;
    int_p[1] = height;
    int_p[2] = i;
    int** vec_p = (int**) (int_p + 3);
    vec_p[0] = thread_status;
    int*** mat_p = (int***) (vec_p + 1);
    mat_p[0] = A;
    mat_p[1] = B;
    mat_p[2] = C;


You should be using an actual struct, like this:

typedef struct ThreadArgs {
    int width;
    int height;
    int threadNum;
    int *pStatus;
    int **matrixA;
    int **matrixB;
    int **matrixC;
} ThreadArgs;

// In main():
thread_p[i]            = malloc(sizeof(ThreadArgs));
thread_p[i]->width     = width;
thread_p[i]->height    = height;
thread_p[i]->threadNum = i;
thread_p[i]->pStatus   = thread_status;
thread_p[i]->matrixA   = A;
thread_p[i]->matrixB   = B;
thread_p[i]->matrixC   = C;


As you can see, this improves the code in many ways:

  • You don't have to have a custom function just to compute the size of the struct.



  • Your fields are actually named, so you can refer to each field by name rather than number.



  • If you need to add or remove fields, you simply make modifications to the struct rather than fiddling with the handcoded arrays.



Using the wrong clock function

When I ran your program, I was surprised to see that none of your multithreaded versions ran faster than the single threaded version. However, I discovered that you were using the wrong function to time your program. The clock() function measures cpu cycles used across all threads. So even if your program ran 8 times faster, you would still get the same reading from clock() as for a single threaded program. What you want is a wall clock measurement. I would suggest using something like clock_gettime() or gettimeofday(). After I modified your program to use clock_gettime() and fixed the loop orders, I ran B_bloc and got the following output, which makes a lot more sense (I have a 4 core system, so 0.25 on the third line is expected):
$ B_bloc 1024
1.102912
0.276752
0.250928

Code Snippets

for(int i = 0; i < width; ++i)
    for(int k = 0; k < width; ++k)
        for(int j = 0; j < height; ++j)
            D[i][j] += A[i][k] * B[k][j];
// This is what MALLOC_P_TH() calls:
    void* default_alloc_()
    {
        return malloc(0
                + sizeof(int)   * 3
                + sizeof(int*)  * 1
                + sizeof(int**) * 3
                );
    }

    // In main():
    thread_p[i] = MALLOC_P_TH();
    int* int_p = (int*) thread_p[i];
    int_p[0] = width;
    int_p[1] = height;
    int_p[2] = i;
    int** vec_p = (int**) (int_p + 3);
    vec_p[0] = thread_status;
    int*** mat_p = (int***) (vec_p + 1);
    mat_p[0] = A;
    mat_p[1] = B;
    mat_p[2] = C;
typedef struct ThreadArgs {
    int width;
    int height;
    int threadNum;
    int *pStatus;
    int **matrixA;
    int **matrixB;
    int **matrixC;
} ThreadArgs;

// In main():
thread_p[i]            = malloc(sizeof(ThreadArgs));
thread_p[i]->width     = width;
thread_p[i]->height    = height;
thread_p[i]->threadNum = i;
thread_p[i]->pStatus   = thread_status;
thread_p[i]->matrixA   = A;
thread_p[i]->matrixB   = B;
thread_p[i]->matrixC   = C;

Context

StackExchange Code Review Q#121494, answer score: 3

Revisions (0)

No revisions yet.