patterncppMinor
Parallel factorial algorithm using std::thread
Viewed 0 times
stdfactorialalgorithmparallelusingthread
Problem
This code calculates the factorial of a number on multiple threads. My issue: it is only a little bit faster than the sequential version of it (and I think I know why, I just can't find a way to solve this).
I use
Only showing the relevant parts:
And the call of the function:
The results seem correct, but as I said, this is not much faster than sequential code.
For example. The calculation of the factorial of 325253 took
That is some really poor performance.
The reason, I think is that the
This way, most of the work (by far the largest multiplications) is happening in a sequential manner, so the algorithm is really slow.
How can I work around this issue and make this work efficiently?
I use
boost::multiprecision::cpp_int so the limits of default integers are not a problem, the size of integers is only limited by memory.Only showing the relevant parts:
// ... other includes ...
#include
#define THREAD_COUNT 4
std::atomic thread_num(1); // global variable
// stuff...
void threaded_factorial(unsigned long long int num, boost::multiprecision::cpp_int& bigInt)
{
int threadid = thread_num++; // thread_num is atomic, so this is safe
boost::multiprecision::cpp_int N = 1;
for (unsigned long long int i = threadid; i lock(mu); // race condition --> mutex needed
bigInt *= N;
}
// more stuff ...And the call of the function:
// ...
boost::multiprecision::cpp_int result = 1;
std::thread workers[THREAD_COUNT];
for (int i = 0; i < THREAD_COUNT; ++i)
{
workers[i] = std::thread(threaded_factorial, num, std::ref(result));
}
for (int i = 0; i < THREAD_COUNT; ++i)
{
workers[i].join();
}
// ...The results seem correct, but as I said, this is not much faster than sequential code.
For example. The calculation of the factorial of 325253 took
- 67586 ms on 4 threads
- 76226 ms on a single thread
That is some really poor performance.
The reason, I think is that the
for cycle in the threaded_factorial function roughly takes the same amount of time for each thread to complete, so when the std::mutex mu is locked, (THREAD_COUNT-1) threads have to wait for the one which locked the mutex.This way, most of the work (by far the largest multiplications) is happening in a sequential manner, so the algorithm is really slow.
How can I work around this issue and make this work efficiently?
Solution
Firstly, some issues with API usage. Using raw
Firstly, let's modify the
To call this, we setup some arrays for the
Now, the step where you combine these is actually pretty expensive. Multiplying two numbers that are of this magnitude will be time consuming; let's launch the multiplications in separate threads as well:
Making these changes, this runs in a bit under 10 seconds for me. In fact, from the profile graph, most of that time is spent doing the combining.
Others have pointed out that further algorithmic improvements are possible if you need more speed.
std::threads is generally not the way you want to go with this sort of thing: prefer to use std::async. This also means you don't need to pass in by reference for updates, as it can return a value. The other big win from this is that you don't need to lock and perform an update in the thread that is running the calculation; this can be done independently.Firstly, let's modify the
threaded_factorial function:constexpr static auto threads = 4U;
constexpr static auto test = 325253U;
namespace mp = boost::multiprecision;
mp::cpp_int thread_fact(unsigned num, int start)
{
mp::cpp_int n = start;
for (auto i = start + threads; i <= num; i += threads) {
n *= i;
}
return n;
}To call this, we setup some arrays for the
std::futures that will be returned, as well as an array of partial results.std::array, threads> futures;
std::array results;
for (auto i = 1; i <= threads; ++i) {
futures[i - 1] = std::async(std::launch::async, thread_fact, test, i);
}
for (auto i = 0; i < threads; ++i) {
results[i] = futures[i].get();
}Now, the step where you combine these is actually pretty expensive. Multiplying two numbers that are of this magnitude will be time consuming; let's launch the multiplications in separate threads as well:
std::future x = std::async([&results]() -> mp::cpp_int { return results[0] * results[1]; });
std::future y = std::async([&results]() -> mp::cpp_int { return results[2] * results[3]; });
auto x_val = x.get();
auto y_val = y.get();
auto z = x_val * y_val;Making these changes, this runs in a bit under 10 seconds for me. In fact, from the profile graph, most of that time is spent doing the combining.
Others have pointed out that further algorithmic improvements are possible if you need more speed.
Code Snippets
constexpr static auto threads = 4U;
constexpr static auto test = 325253U;
namespace mp = boost::multiprecision;
mp::cpp_int thread_fact(unsigned num, int start)
{
mp::cpp_int n = start;
for (auto i = start + threads; i <= num; i += threads) {
n *= i;
}
return n;
}std::array<std::future<mp::cpp_int>, threads> futures;
std::array<mp::cpp_int, threads> results;
for (auto i = 1; i <= threads; ++i) {
futures[i - 1] = std::async(std::launch::async, thread_fact, test, i);
}
for (auto i = 0; i < threads; ++i) {
results[i] = futures[i].get();
}std::future<mp::cpp_int> x = std::async([&results]() -> mp::cpp_int { return results[0] * results[1]; });
std::future<mp::cpp_int> y = std::async([&results]() -> mp::cpp_int { return results[2] * results[3]; });
auto x_val = x.get();
auto y_val = y.get();
auto z = x_val * y_val;Context
StackExchange Code Review Q#94574, answer score: 7
Revisions (0)
No revisions yet.