patterncppMinor

Compute weighted average around a point in a matrix quickly

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

codereview cpp stackoverflow c++c++11 performance matrix

aroundquicklypointcomputeweightedaveragematrix

Problem

I have a short snippet of code that simply computes a weighted average for surrounding elements in a square matrix. The actual implementation that I'm working on is not an average (more complex equation), but I am using this one to figure out how to handle various performance hinderances.

A few notes about my system:

Windows 7, x64

Visual Studio 2010, with the Intel C++ Compiler

Profiling it using Intel VTune Amplifier

Running in Release, with optimizations on (Intel compiler doesn't specify which level, but with comparison I did I think it's O2 or O3).

The timing values I am mentioning I got from running it here with -O2

DIM=512 and ITERATIONS=1000

Complete code is provided at the end, but for the next few explanations I will just focus on this loop of interest (it's the only loop, so...). I started off with a pretty straightforward, basic implementation:

for (int iter = 0; iter < ITERATIONS; iter++)
    {
        for (int x = 1; x < DIM-1; x++) // avoid boundary cases for this example
        {
            for (int y = 1; y < DIM-1; y++)
            {
                f0 = d_matrix[x][y];                
                f1 = d_matrix[x-1][y];
                f2 = d_matrix[x+1][y];
                f3 = d_matrix[x][y-1];
                f4 = d_matrix[x][y+1];

                d_res_matrix[x][y] = f0*0.6 + f1*0.1 + f2*0.1 + f3*0.1 + f4*0.1;                    
            }
        }
        for (int x = 0; x < DIM; x++)
        {
            for (int y = 0; y < (DIM); y++)
            {
                d_matrix[x][y] = d_res_matrix[x][y];                    
            }
        }
    }

This attempt took ~1.9s to execute. VTune suggested that I had problems with 4K Aliasing (read-before-write memory situations) specifically on lines for the y-loop and the memory writes that preceed them (in both loops). It also identified back-end bound, core-bound problems. I figured that the 4K aliasing problems might be causing the core-bound is

Solution

Since I do not know what the actual implementation will be I can only comment on the code as it is presented. From the algorithm it appears that you are applying a filter multiple times in order to monitor some propagation effect - as opposed to applying the filter multiple times to stabilize time profiling. From here on out I will assume that running ITERATIONS times is part of the core algorithm.

General C++ concepts

-
Initializing memory

Are you sure that it is legal to memset double types to zero to obtain logical \$0.0\$? Me neither. Use std::fill instead or better yet if you have C++03 or newer use value initialization so that you can initialize the arrays to zero with double *array = new double[n]().

If T is an array type, each element of the array is value-initialized

...

double f = double(); // non-class value-init, value is 0.0

-
Use std::vector so that you do not have to do the memory allocation yourself.

Just be sure to use the -O2 or higher compiler flag alongside or else vector is slower than raw arrays (in my experience).

Bugs

Your optimized routine has a couple major bugs. I always recommend checking the output between original code and optimized code should you need to perform optimization.

-
Shift bug

save_write_loc = &d_res_matrix[x][0];

You are writing output values from column 1 to column 0. This would be fine if you ran the algorithm once but since you are running this algorithm ITERATIONS times, it results in shifting the output out multiple times. In general, on iteration \$i\$ you are writing the \$i^{th}\$ column of the original d_matrix to column 0 of d_res_matrix. To fix this simply keep the matrices aligned:

save_write_loc = &d_res_matrix[x][1];

-
Skipped output indices bug

for (int y = 0; y < (DIM/8); y=y+8)

When you unrolled the loop you both divided the bounds by \$8\$ and increased the step to \$8\$. You only wanted to do one or the other. You should prefer changing the bounds since DIM/8 can be computed at compile time:

for (int y = 0; y < (DIM/8); y++)

The code runs much slower with these bug fixes.

Quick Optimization

The algorithm cannot be performed in-place so you are using d_res_matrix to store the output of applying your filter to d_matrix. But then you want the output to be placed back in d_matrix so you perform a deep copy.

However a swap of the pointers would suffice so you could use this instead:

std::swap(d_matrix, d_res_matrix);

This has a major impact on the performance of your original code, but as you will see we can do better.

Problematic Optimization

Flattening the 2D array to 1D sounds like a good idea until you factor in the cost of post-processing. When the array is processed as 1D the border pixels change values and, in fact, exhibit border-wrap effects.

We often fix problems like this by wrapping the matrix inside a false border - i.e. we add a \$1\times1\$ border all around the matrix. Unfortunately this would not work here. Instead, you can fix this by applying the filter and then going back and zeroing out border pixels each iteration.

More Optimizations

-
Use the -O3 flag before attempting generic optimizations.

Readability often goes out the window when applying optimizations to source code. The built-in compiler flags can be used instead. As you will see below, applying the -O3 flag gives me comparable results to unrolling loops, using pointer-based indexing, and flattening the array to 1D.

-
Consider a generic matrix element \$\textrm{matrix}[a][b]\$. Let's determine how many times \$\textrm{matrix}[a][b]\$ is multiplied by \$0.1\$.

\$\textrm{matrix}[a][b]\$ has four 4-connected neighbors: $$\textrm{matrix}[a-1][b],\enspace \textrm{matrix}[a+1][b],\enspace \textrm{matrix}[a][b-1],\enspace \textrm{matrix}[a][b+1]$$
Each of these neighbors multiplies \$\textrm{matrix}[a][b]\$ by \$0.1\$. Therefore, \$\textrm{matrix}[a][b]\$ is multiplied by \$0.1\$ four times. Instead we could cache this computation once and reuse it when we need it. This requires \$\mathcal{O}(n)\$ space, however, since the overall algorithm cannot be done in-place, we can use some space we already were going to need.

(Technically the original algorithm only needs one row of extra space and the cached multiply algorithm only needs three rows of extra space.)

-
The real optimization here is parallelization. If we had \$ \mathrm{DIM} \cdot \mathrm{DIM}\$ processors then each processor could be charged with applying the filter to its pixel in any order we choose (or rather don't choose). Hence there is zero serial code per iteration. Of course we would not actually want that many processors/threads due to communication costs. If you are interested in running this algorithm in parallel you could use OpenMP, MPI, boost::thread, etc.

Code and Timings

Test machine - CentOS 6.5, libstdc++-4.4.7, -O3, Intel Core i7-2670QM (2.20 GHz)

-
Your

Code Snippets

save_write_loc = &d_res_matrix[x][0];

save_write_loc = &d_res_matrix[x][1];

for (int y = 0; y < (DIM/8); y=y+8)

for (int y = 0; y < (DIM/8); y++)

std::swap(d_matrix, d_res_matrix);

Context

StackExchange Code Review Q#86538, answer score: 7

Revisions (0)

No revisions yet.