patterncppMajor

A library to do maths with matrices written from scratch

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

codereview cpp c++stackoverflow beginner performance library matrix

mathsscratchwrittenwithlibrarymatricesfrom

Problem

Background

Some time ago I've encountered some very good articles about neural networks that represented an ANN as a set of matrices, so everything was done using matrix operations. These articles show code written in Python, and since I know Python well, I decided to translate it to C++ (which I'm currently learning). Very soon I understood that it's too difficult to represent a matrix as a vector of vectors and mess with many of these vectors. That's why I decided to write my own library to do maths with matrices. Now it's a project on GitHub called Matrix.
Matrix

I know there are some libraries to work with matrices out there, like BLAS, Eigen, etc. Matrix was originally written for educational purpose, but now I'm also aiming for speed.

You can do almost any mathematical operation with an object of type Matrix as described in the Matrix Wiki.

Internally, all the data is stored in a vector of vectors of double. My attempts to make Matrix a template class were unsuccessful.

Code

This is the code responsible for matrix-by-matrix multiplication

```
Matrix Matrix::operator*(const Matrix& right) const {
if (cols != right.rows) {
std::string msg=std::string("Size mismatch while multiplying matrices: ").append(to_string(rows).append(std::string("X")).append(to_string(cols)));
msg.append(std::string(" vs ").append(to_string(right.rows)).append(std::string("X")).append(to_string(right.cols)));
throw SizeException(msg);
}

if (right.IsNum())
return this->operator*(right.M[0][0]);

size_t a, b, c;

Matrix res(rows, right.cols);

if (right.IsCol()) {
for (a = 0; a IsSquare(2) && right.IsSquare(2)) {
// loop unrolling for 2x2 matrices
res.M[0][0] = M[0][0] right.M[0][0] + M[0][1] right.M[1][0],
res.M[0][1] = M[0][0] right.M[0][1] + M[0][1] right.M[1][1],
res.M[1][0] = M[1][0] right.M[0][0] + M[1][1] right.M[1][0],
res.M[1][1] = M[1][0] * right

Solution

Compared to Python's NumPy, my matrix multiplication is quite a bit slower. Can it be made faster without using sophisticated matrix multiplication algorithms like Strassen's?

One issue is that you use a vector of vectors - your memory accesses aren't going to be as contiguous. Typically one represents a matrix as a single really long vector, and then accesses into it like this matrix[rowNum*numColumns + colNum]. This will greatly improve your memory access patterns (MAPs) and caching behavior. With good enough caching you'll be able to start actually fighting against the memory-boundedness of your program. You'll also have to decide if you want to represent the matrices as row-major or column-major.

Secondly, think about the way matrix multiplication happens - \$row\ x\ column\$. I haven't had a chance to look at your GitHub repo so I don't know if you're representing a Matrix as row-major or column-major. The best way to multiply matrices on a CPU is with a row-major left-matrix and a column-major right-matrix. Obviously you can't always predict when something is going to be used, so you can use a tiled algorithm - essentially you multiply small internal matrices together and then put them all together. This is a more complicated algorithm, but you'll get much better caching behavior for large matrices. It'll look something like this (no guarantees that this is exactly right, I can never remember this off the top of my head). Note that this assumes the tile's size is a factor of the matrices' size, and that the matrices' side lengths are the same, however those can be adjusted for (or you can pad the matrices with 0's).

for tile in result matrix
    tile = 0 // zero out the result matrix, if they aren't initialized to 0
    for tileNum in tilesPerSide
        // These will both have to be adjusted - you'll have to
        // do some computation to actually get these tiles
        leftTile = leftMatrix.getTile(tileNum)
        rightTile = rightMatrix.getTile(tileNum)
        tile += leftTile * rightTile

The real algorithm is quite a bit more complicated than that, and involves an unfortunate number of nested loops, but it'll be faster (on a CPU).

You can also look at things like SIMD vectorization, prefetching, avoiding malloc, and instruction-level parallelism.

SIMD vectorization (or just vectorization) is when you perform a Single Instruction on Multiple Data - basically you perform the same instruction on a couple of (hopefully contiguous in memory) pieces of data. This can improve speed by (usually) a factor of 4 or 8, depending on the size of you computer's vector registers and if you're using floats or doubles. I recommend using Agner Fog's library instead of intrinsics - it's much cleaner, and more OS independent.

Prefetching is just what it sounds like - you ask for data from memory before you actually need it. Cache misses are expensive, so it can be helpful to request memory before you need it. This generally needs some tuning to find a good prefetch distance, and I won't go into all of the specifics here - there are decent tutorials online.

You also want to avoid malloc - this is generally a rather expensive operation. While you may not explicitly malloc or new anything, things like std::vector and std::string use malloc under the hood. This can end up in an operating system call (super slow) or using previously mallocd and freed memory (less slow, but not fantastic). If, for example, you know that messages will be below a certain size, consider using a statically allocated char array instead of a std::string - they can be a little trickier (don't forget the null \0 character) but you won't need to allocate as much on the heap.

Lastly, instruction level parallelism (I've always called this bucketing, but I don't know that anyone else does most people call this loop unrolling) takes advantage of your computer's pipeline and out of order execution to do things faster. If you have multiple subsequent and independent operations, then you can do them like that.

This is how you would normally sum a vector (well, now it isn't, but bear with me):

double result = 0;
for (unsigned i = 0; i 

Each iteration of the loop is independent of the other, so you can do something like this:

/*
Each r# is a "bucket", and the number you can have depends on your machine
I've never actually been able to figure out how many a given architecture
supports, however generally 4-8 is pretty safe. If you want to do some
macro voodoo you can set this with a #define and compile differently
depending on architecture
*/
double r1 = 0; double r2 = 0; double r3 = 0; double r4 = 0;
unsigned int i = 0;
/*
This upper bound is nasty, but it is safer (what if you have a size of 3
and subtract 4?) and marginally (read as - probably not measurably)
faster. You can replace it with vector.size() - 4 if you'd rather.
*/
for (; i

This is ugly, but i

Context

StackExchange Code Review Q#115194, answer score: 20

Revisions (0)

No revisions yet.