HiveBrain v1.2.0
Get Started
← Back to all entries
snippetcppMinor

Multithreaded bottom up merge sort

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
bottomsortmultithreadedmerge

Problem

Before I tried this, my impression was that (bottom up) merge sort would be memory (actual RAM) bound, and not affected much by multithreading, but a 4 thread sort is about 3.0 times as fast as a single thread sort, and an 8 thread sort is about 3.9 times as fast. This is on a 4 core Intel 3770k (3.5 ghz), Win 7, 64 bit mode, Visual Studio 2015. One gain of the 4/8 threads is using all 4 L1 and L2 caches local to each core, but L3 and main memory are shared. This example sorts 16 million 32 bit positive integers, taking about 0.36 seconds for 8 thread version, 0.47 seconds for 4 thread version, 1.41 seconds for 1 thread version.

Most of the time is spent in the // merge data loop in BottomUpMerge() (compare two elements from two sub-arrays, move smaller to output array). My assumption was that this relatively small loop would be fast enough to be main memory bandwidth limited, but this isn't true. For the 16 million 32 bit integers, the 24 passes to merge sort read/writes 3221225472 (3GB) bytes of data, so the single thread bandwidth is ~2.2845 GB/sec, the 4 thread bandwidth is ~6.8535 GB/sec, 8 thread is ~8.9478 GB/sec. Max bandwidth for 3770k is 25.6 GB/sec. Actual bandwidth for an assembly move memory (rep movsd ;to move 16MB data), is about 18 GB/sec.

So the issue is that the merge sort is CPU bound, not memory (RAM) bound, even with a relatively tight loop, at least on my system.

I tried a similar comparison on an older (2004) system, Intel Pentium 4 EE (3.73 ghz, 2 cores) on a Intel 975 motherboard, and a 2 thread merge sort is about 1.65 times as fast as a single thread sort. Single threaded bandwidth is ~1.6433 GB/sec, two threaded is ~2.7115 GB/sec, assembly move is ~3.1236 GB/sec. Four threaded is a bit slower than two threaded with this system.

Example code for the 4 thread version, the single threaded sort uses the same functions for the merge sort. Size is assumed to be multiple of 4 (else a minor change to handling the last "quarter" size coul

Solution

You've tagged the question with C, but your code includes a lot of C++ features that are not available in C:

  • new operator



  • std::swap



  • std::cout



As Visual Studio has no problem with C++ (even defaults to it), embracing it's facilities can make your code much cleaner:

  • RAII and smart pointers remove the need for remembering to delete buffers or close handles



  • Threading support in the C++(11+) standard library makes your code portable and cleaner (you can pass functions with any types of parameters for starting threads)



  • Classes and lambdas allow for better encapsulation, no more global state



  • Templates, standard iterator and comparison objects allow you to more easily integrate your code with other code bases and make it more generic (sort any kind of objects with arbitrarily complex comparisons)



  • #include provides better pseudo-random generators.

Context

StackExchange Code Review Q#148025, answer score: 2

Revisions (0)

No revisions yet.