patterncppMinor

Multithreading algorithm for high performance parallel computing and Producer-Consumer queue using POSIX threads

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

threadshighposixalgorithmparallelperformanceforconsumerandmultithreading

Problem

The following is the producer-consumer algorithm for a piece of software that I'm upgrading to take advantage of multi-core processing. The intended platform is some flavor of Linux running on an EC2 HPC cc4x.large cluster, which feature 2 x Intel Xeon X5570, quad-core “Nehalem” architecture processors (2*4=8 cores). The software runs a genetic algorithm for optimizing artificial neural networks.

My dominating concern is performance. RAM and HD capacity are not an issue, but CPU time and anything else that delays processing is. Right now I've made a few (noted below), hopefully trivial, compromises to make the program portable to Mac OS X, which is on my home computer that I use for coding/debugging. I note a few other minor issues in the comments, most notably an uncertainty about thread-safeness in the consumer function. This is my first time working with threads. Advice/criticism of any kind is much appreciated.

```
#include
#include
#include
#include

#define NUM_THREADS 3 //will be >= to # of cores
#define N 30

//globals
sem_t* producer_gate;
sem_t* consumer_gate;
pthread_mutex_t access_queued =PTHREAD_MUTEX_INITIALIZER;
int queued;
int completed;

//a dummy class for testing thread-safeness
class the_class
{
public:

void find_num()
{
//make sure completion is non-instant and variable
double num = rand();
for(double q=0; 1; q++)
if(num == q)
return;
}
};

//the consumer function for handling the parallelizable code
void Consumer(void argument)
{

std::vector p_v_work = (std::vector ) argument ;

while(1)
{
sem_wait(consumer_gate);
pthread_mutex_lock (&access_queued);
int index = queued;
queued--;
pthread_mutex_unlock (&access_queued);
(*p_v_work)[index-1].find_num();
completed++; // work(N);
std::vector * p_work = &work;

sem_unlink("producer_gate");
sem_unlink("

Solution

Style

I see something that looks like a thread safe queue abstraction, which is mysteriously implemented as a collection of globals and one argument. These are a fairly well-known subspecies of container, so write one (or find an existing one that suits your purposes)

template 
class MTQueue
{
public:
  // simple operations will lock & unlock on every call
  void push(T const &);
  T pop();

  // ...
};

Just pass a pointer to this to your thread function, and your threads can pop at will.
Note this could trivially use a std::queue or any other container internally (and you may prefer FIFO rather than LIFO) - all you're really doing is bundling some container with a mutex and enforcing the lock semantics when you change (or inspect) it.

I also see something that looks like a barrier (your use of semaphores), but I suspect your consumers can start consuming as soon as there's something in the queue, so you just need a way for the producer to wait until they're all done. We could add a method to wait until the work queue is empty, but that doesn't tell us the work is complete. Should there be a results queue? If so, use another instance of the same type: the producer can be popping results in parallel, it doesn't need to wait until the last one is calculated.

Performance

There are a couple of approaches to making it fast:

Batching

Consider the relative cost of the synchronization involved in popping an item from the queue, and actually doing the work: it might be faster overall to add a more elaborate interface so you can push/pop multiple items with only the same synchronization overhead, eg.

// more elaborate interface to amortise synchronization cost?
  void push(std::vector const &);
  void pop(std::vector &out);

or you could just switch to pushing/popping a batch of work object instead of single jobs: same effect, may leave you with a simpler and more reuseable queue.

Alternative Synchronization Strategies

Mutual exclusion is only one way of sharing data across multiple cores, and to be honest it's probably a good fit here (especially if you calibrate the batch size correctly).
However, if you really want to get stuck in, look up lock free or non-blocking queues as well - for some workloads and core counts, they might suit you better.

Code Snippets

template <typename T>
class MTQueue
{
public:
  // simple operations will lock & unlock on every call
  void push(T const &);
  T pop();

  // ...
};

// more elaborate interface to amortise synchronization cost?
  void push(std::vector<T> const &);
  void pop(std::vector<T> &out);

Context

StackExchange Code Review Q#5963, answer score: 5

Revisions (0)

No revisions yet.