patterncppMinor

Parallel factorial algorithm using std::thread

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

codereview cpp stackoverflow c++c++11 multithreading

stdfactorialalgorithmparallelusingthread

Problem

This code calculates the factorial of a number on multiple threads. My issue: it is only a little bit faster than the sequential version of it (and I think I know why, I just can't find a way to solve this).

I use boost::multiprecision::cpp_int so the limits of default integers are not a problem, the size of integers is only limited by memory.

Only showing the relevant parts:

// ... other includes ...
#include 

#define THREAD_COUNT 4
std::atomic thread_num(1);         // global variable

// stuff...

void threaded_factorial(unsigned long long int num, boost::multiprecision::cpp_int& bigInt)
{
    int threadid = thread_num++;     // thread_num is atomic, so this is safe
    boost::multiprecision::cpp_int N = 1;
    for (unsigned long long int i = threadid; i  lock(mu);      // race condition --> mutex needed
    bigInt *= N;
}

// more stuff ...

And the call of the function:

// ...

boost::multiprecision::cpp_int result = 1;
std::thread workers[THREAD_COUNT];

for (int i = 0; i < THREAD_COUNT; ++i)
{
    workers[i] = std::thread(threaded_factorial, num, std::ref(result));
}
for (int i = 0; i < THREAD_COUNT; ++i)
{
    workers[i].join();
}

// ...

The results seem correct, but as I said, this is not much faster than sequential code.

For example. The calculation of the factorial of 325253 took

67586 ms on 4 threads

76226 ms on a single thread

That is some really poor performance.

The reason, I think is that the for cycle in the threaded_factorial function roughly takes the same amount of time for each thread to complete, so when the std::mutex mu is locked, (THREAD_COUNT-1) threads have to wait for the one which locked the mutex.

This way, most of the work (by far the largest multiplications) is happening in a sequential manner, so the algorithm is really slow.

How can I work around this issue and make this work efficiently?

Solution

Firstly, some issues with API usage. Using raw std::threads is generally not the way you want to go with this sort of thing: prefer to use std::async. This also means you don't need to pass in by reference for updates, as it can return a value. The other big win from this is that you don't need to lock and perform an update in the thread that is running the calculation; this can be done independently.

Firstly, let's modify the threaded_factorial function:

constexpr static auto threads = 4U;
constexpr static auto test = 325253U;

namespace mp = boost::multiprecision;

mp::cpp_int thread_fact(unsigned num, int start)
{
    mp::cpp_int n = start;
    for (auto i = start + threads; i <= num; i += threads) {
        n *= i;
    }
    return n;
}

To call this, we setup some arrays for the std::futures that will be returned, as well as an array of partial results.

std::array, threads> futures;
std::array results;
for (auto i = 1; i <= threads; ++i) {
    futures[i - 1] = std::async(std::launch::async, thread_fact, test, i);
}
for (auto i = 0; i < threads; ++i) {
    results[i] = futures[i].get();
}

Now, the step where you combine these is actually pretty expensive. Multiplying two numbers that are of this magnitude will be time consuming; let's launch the multiplications in separate threads as well:

std::future x = std::async([&results]() -> mp::cpp_int { return results[0] * results[1]; });
std::future y = std::async([&results]() -> mp::cpp_int { return results[2] * results[3]; });
auto x_val = x.get();
auto y_val = y.get();
auto z = x_val * y_val;

Making these changes, this runs in a bit under 10 seconds for me. In fact, from the profile graph, most of that time is spent doing the combining.

Others have pointed out that further algorithmic improvements are possible if you need more speed.

Code Snippets

constexpr static auto threads = 4U;
constexpr static auto test = 325253U;

namespace mp = boost::multiprecision;

mp::cpp_int thread_fact(unsigned num, int start)
{
    mp::cpp_int n = start;
    for (auto i = start + threads; i <= num; i += threads) {
        n *= i;
    }
    return n;
}

std::array<std::future<mp::cpp_int>, threads> futures;
std::array<mp::cpp_int, threads> results;
for (auto i = 1; i <= threads; ++i) {
    futures[i - 1] = std::async(std::launch::async, thread_fact, test, i);
}
for (auto i = 0; i < threads; ++i) {
    results[i] = futures[i].get();
}

std::future<mp::cpp_int> x = std::async([&results]() -> mp::cpp_int { return results[0] * results[1]; });
std::future<mp::cpp_int> y = std::async([&results]() -> mp::cpp_int { return results[2] * results[3]; });
auto x_val = x.get();
auto y_val = y.get();
auto z = x_val * y_val;

Context

StackExchange Code Review Q#94574, answer score: 7

Revisions (0)

No revisions yet.