patterncppMinor

Neural Network Simulator with OpenMP

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

codereview ai cpp openmp c++neural-network stackoverflow multithreading

neuralwithopenmpsimulatornetwork

Problem

I wrote a simple neural network simulator (the biophysical kind) from scratch, and was hoping to get some feedback on how I can speed things up, or any C++ / compilation best practices that I can improve on.

The code is at this repository.

Main Problem:

OpenMP doesn't seem to be conferring speedup.

The performance-critical section of the code is in src/networks/spikingnet.cpp, but for additional context, see the rest of the code in the repository.

#pragma omp parallel for
for (size_t li=0; li layers[li];
    Stim *stim = rs->stimuli[li];
    boolvec doSpike = stim->yield();
    conn_vec pre_arr = net->pre[li];
    updateLayer(layer, pre_arr, doSpike, t);
    recordSpikes(results.mutable_spikes(li), layer, i);
}

// update transmission & stdp
#pragma omp parallel for
for (size_t li=0; li layers[li];

    // transmission
    for (SpikingConnection *conn : net->post[li]) {
        for (SpikingSynapse* syn : conn->synapses) {
            updateTransmission(syn, layer->units[syn->s]);
        }
    }

    // STDP
    for (SpikingConnection *conn : net->pre[li]) {
        SpikingLayer *source = net->layers[conn->s];
        SpikingLayer *target = net->layers[conn->t];
        if (conn->stdp_enabled) {
            #pragma omp parallel for
            for (SpikingSynapse* syn : conn->synapses) {
                updateSTDP(syn, source->units[syn->s], target->units[syn->t]);
            }
        }
    } // end STDP

} // end for

GProf Trace

Architectural details:

The network consists of L=9 "layers", each with 100-900 units. There are on the order of L^2 "connections" (bundles of synapses between layers), and 2000 synapses per "connection" (synapses are sparse).

During each update cycle, all the layers (neurons) are updated (conditioned on connections), and then all the connections (synapses) are updated (conditioned on layers). That is to say, layer updating is independent conditioned on connections, and connection updating is independent conditioned on lay

Solution

It is generally advisable to parallelize loops at the absolute outermost level possible. Creating new OS threads, dividing up the relevant loop, and allocating data in the private address space for each thread all take up time, as does synchronization once the parallel task is over. Often, this overhead can outweigh any benefit that parallelism might have conferred in the first place. Whenever you use OpenMP to parallelize a task or loop, it has to be large enough to amortize the cost of spawning threads.

I would hoist as many parallel directives to the outermost loop as possible, and avoid having nested parallel directives. You may have to switch from using omp parallel for to just using omp parallel at the outermost level, then having some logic to explicitly decide which part of the data each thread works on depending on its thread number.

You'll have to be the judge of whether this is actually the problem or not; I only did a shallow read of your code, and the work that gets parallelized in the inner loops might actually be large enough that this is not the consideration at all.

Finally, there are alternatives to OpenMP for shared-memory parallel programming, e.g. Intel's Threading Building Blocks. OpenMP is pretty crude in the grand scheme of things, so you might find task-based parallelism easier or more fun to work with. You get to pretend like you have as many parallel threads of execution as you want, and the library manages farming out the tasks you define to the existing OS threads.

Context

StackExchange Code Review Q#87395, answer score: 3

Revisions (0)

No revisions yet.