snippetcMinor
Double-Pivot Quicksort in C
Viewed 0 times
pivotdoublequicksort
Problem
I've added a sort function to my library of utility functions. I will need to extend its behaviour at a later time but for now it's just a replacement for the
I've implemented Yaroslavskiy's Double Pivot QuickSort which is fast and easy to implement (and has been selected as the sort of choice for Java7).
I've removed recursion via a fixed stack and randomized the choice of the pivot elements to avoid O(n^2) behaviour when the array is already sorted. I've also tested the difference of having a final insertion sort run on the entire array instead of many insertion sort runs on shorter arrays but I've seen no noticeable difference.
As for the stack depth I've tested that the average depth increase logarithmically (as expected). For example for 100 000 elements the average is 22, for 1 000 000 (10x) is 55. The chosen size should be more than enough for any realistic array.
The swap operation is dominant: any improvement will directly impact the overall speed. I tried copying one-byte-at-times and
The random number generator
I'd be especially interested in any comment about speed. Currently this is just a little slower than gcc's
Code is below (alternatively as github gist: https://gist.github.com/rdentato/8cf8007525feacc8df519bc91e98f20d):
```
/ Quick and dirty PRNG (xorshift) /
static uint32_t utl_rnd()
{
static uint32_t rnd = 0;
while (rnd == 0) rnd = (uint32_t)time(0);
rnd ^= rnd > 17;
rnd ^= rnd = 4) { \
tmp32 = (uint32_t )pa; \
(uint32_t )pa = (uint32_t )pb;\
(uint32_t )pb = tmp32;\
sz-=4; pa+=4; pb+=4;\
}\
qsort() function.I've implemented Yaroslavskiy's Double Pivot QuickSort which is fast and easy to implement (and has been selected as the sort of choice for Java7).
I've removed recursion via a fixed stack and randomized the choice of the pivot elements to avoid O(n^2) behaviour when the array is already sorted. I've also tested the difference of having a final insertion sort run on the entire array instead of many insertion sort runs on shorter arrays but I've seen no noticeable difference.
As for the stack depth I've tested that the average depth increase logarithmically (as expected). For example for 100 000 elements the average is 22, for 1 000 000 (10x) is 55. The chosen size should be more than enough for any realistic array.
The swap operation is dominant: any improvement will directly impact the overall speed. I tried copying one-byte-at-times and
memcpy() but they were slower.The random number generator
utl_rnd() is surely not the best PRNG around but is fast and should be good enough for the purpose.I'd be especially interested in any comment about speed. Currently this is just a little slower than gcc's
qsort() (e.g. 1.6ms vs 1.2ms to sort an array of 10 million random integers).Code is below (alternatively as github gist: https://gist.github.com/rdentato/8cf8007525feacc8df519bc91e98f20d):
```
/ Quick and dirty PRNG (xorshift) /
static uint32_t utl_rnd()
{
static uint32_t rnd = 0;
while (rnd == 0) rnd = (uint32_t)time(0);
rnd ^= rnd > 17;
rnd ^= rnd = 4) { \
tmp32 = (uint32_t )pa; \
(uint32_t )pa = (uint32_t )pb;\
(uint32_t )pb = tmp32;\
sz-=4; pa+=4; pb+=4;\
}\
Solution
Tips on general sorting performance
I recently implemented a sorting library (in c++) for the use in my novel suffix array construction algorithm (SACA) and did a lot of research for it. You asked on how to improve performance and as algorithm usally beats implementation I want to give you some tips on the quicksort algorithm in general:
The paper "How Branch Mispredictions Affect Quicksort" - Kaligosi, Sanders showed that when sorting simple types (e.g. integers) branch misspredictions are a huge slow down. I've implemented and use the ideas from the really ground breaking paper "BlockQuicksort: How Branch Mispredictions don't affect Quicksort" - Edelkamp, Weiss. It's almost twice as fast as
Note that regular quicksort favors slightly skewed distributions to achieve better branch prediction. This is not the case for BlockQuicksort: It really favors good pivots (I use up to 65 elements to find the pivots, from an information theoric background median-of-O(sqrt(n)) would be optimal for BlockQuicksort).
Cache misses are a huge problem when sorting pointers/indices. The amount of cache misses can be greatly reduced by increasing the number of pivots. I currently use 3 to balance between swaps and cache trashing. Note that this is still faster than vanilla
The paper "How Good is Multi-Pivot Quicksort?" - Aumueller, Dietzfelbinger, Klaue is very interessting on this topic. This also shows that 3 or 5 pivots result in a minimum number of scanned elements so if your access is slow you want to use 3 or 5 pivots.
In my implementation I use a heuristic to choose between copying together key&value + BlockQuicksort and three-pivot quicksort when sorting pointers to integers but this is specially tuned towards my SACA. Generally copying + BlockQuicksort is favored for bigger lists.
Sorting takes \$O(n log(n))\$ time in general. But there are corner cases like almost sorted lists or lists with a lot of duplicates. My implementation of BlockQuicksort switches to three-way quicksort ( partitions with single pivot) when the sample I use to find the pivots has a lot of duplicates. This way performance of \$O(n log(m))\$ can be achieved where m is the number of different values.
pdqsort nicely tackles not so random permutations. I greatly encourage you to look at the source code on github for info and benchmark data (it also uses BlockQuicksort). It is able to detect partially sorted subsequences and sometimes sort them in \$O(n)\$ time.
The standard
Sadly it's quite complex to get a general sorting routine in c without function pointers because of the lack of templates. So if you really need a high performance generic sorting routine in c you probably need to plug in some preprocessor tricks (one of the reasons I switched to c++).
Insertion sort is also the finisher in my implementation. I use it as soon as the list is small enough (somewhere between 16-64 elements). This is simply because when sorting pointers this results in the values still being in cache from the last partition call.
Using it afterwards results in less branches but requires to load everything another time which isn't a big problem when sorting integers but rather heavy for pointers.
I use sorting networks to find the pivots which is really fast but I have yet to find a way that outperforms insertion sort when sorting pointers. This is due to the compare-and-swap operation being way slower when the values to be compared differ from the values being swapped (conditional move + xor swap doesn't work).
Another performance improvement really was to switch to heap sort with deep recursion. I don't know why I ran into them but adding in heapsort made my sort around 3-4% faster in my tests.
I found using recursive calls to be faster than using a separated stack. So you should probably test both. Stackoverflow is not a problem if you limit recursion by switching to heap sort after a limited number of recursions. Sadly I don't understand exactly why but from looking at the assembly it seems the compiler had problems optimizing the separated stack.
Rather special info:
My library also differs from the standard
I recently implemented a sorting library (in c++) for the use in my novel suffix array construction algorithm (SACA) and did a lot of research for it. You asked on how to improve performance and as algorithm usally beats implementation I want to give you some tips on the quicksort algorithm in general:
- Branch misses:
The paper "How Branch Mispredictions Affect Quicksort" - Kaligosi, Sanders showed that when sorting simple types (e.g. integers) branch misspredictions are a huge slow down. I've implemented and use the ideas from the really ground breaking paper "BlockQuicksort: How Branch Mispredictions don't affect Quicksort" - Edelkamp, Weiss. It's almost twice as fast as
std::sort() on integer data.Note that regular quicksort favors slightly skewed distributions to achieve better branch prediction. This is not the case for BlockQuicksort: It really favors good pivots (I use up to 65 elements to find the pivots, from an information theoric background median-of-O(sqrt(n)) would be optimal for BlockQuicksort).
- Cache misses:
Cache misses are a huge problem when sorting pointers/indices. The amount of cache misses can be greatly reduced by increasing the number of pivots. I currently use 3 to balance between swaps and cache trashing. Note that this is still faster than vanilla
std::sort() on integer data. The extrem form of this is called superscalar sample sort.The paper "How Good is Multi-Pivot Quicksort?" - Aumueller, Dietzfelbinger, Klaue is very interessting on this topic. This also shows that 3 or 5 pivots result in a minimum number of scanned elements so if your access is slow you want to use 3 or 5 pivots.
Number of scanned elements in the whole sort on average:
3 pivots: 1.385 * n * log2(n)
5 pivots: 1.379 * n * log2(n)
Number of copied elements in the whole sort on average:
3 pivots: 1.108 * n * log2(n)
5 pivots: 1.248 * n * log2(n)In my implementation I use a heuristic to choose between copying together key&value + BlockQuicksort and three-pivot quicksort when sorting pointers to integers but this is specially tuned towards my SACA. Generally copying + BlockQuicksort is favored for bigger lists.
- Complexity
Sorting takes \$O(n log(n))\$ time in general. But there are corner cases like almost sorted lists or lists with a lot of duplicates. My implementation of BlockQuicksort switches to three-way quicksort ( partitions with single pivot) when the sample I use to find the pivots has a lot of duplicates. This way performance of \$O(n log(m))\$ can be achieved where m is the number of different values.
pdqsort nicely tackles not so random permutations. I greatly encourage you to look at the source code on github for info and benchmark data (it also uses BlockQuicksort). It is able to detect partially sorted subsequences and sometimes sort them in \$O(n)\$ time.
- Indirections
The standard
qsort() interface has a major drawback when compared to c++ std::sort(): It uses function pointers. Calling them almost always results in a branch missprediction and is a huge slow down. This bypasses the idea of BlockQuicksort to get rid of branches. Sadly it's quite complex to get a general sorting routine in c without function pointers because of the lack of templates. So if you really need a high performance generic sorting routine in c you probably need to plug in some preprocessor tricks (one of the reasons I switched to c++).
- Insertion sort
Insertion sort is also the finisher in my implementation. I use it as soon as the list is small enough (somewhere between 16-64 elements). This is simply because when sorting pointers this results in the values still being in cache from the last partition call.
Using it afterwards results in less branches but requires to load everything another time which isn't a big problem when sorting integers but rather heavy for pointers.
- Sorting networks
I use sorting networks to find the pivots which is really fast but I have yet to find a way that outperforms insertion sort when sorting pointers. This is due to the compare-and-swap operation being way slower when the values to be compared differ from the values being swapped (conditional move + xor swap doesn't work).
- Heap sort
Another performance improvement really was to switch to heap sort with deep recursion. I don't know why I ran into them but adding in heapsort made my sort around 3-4% faster in my tests.
- Stack usage
I found using recursive calls to be faster than using a separated stack. So you should probably test both. Stackoverflow is not a problem if you limit recursion by switching to heap sort after a limited number of recursions. Sadly I don't understand exactly why but from looking at the assembly it seems the compiler had problems optimizing the separated stack.
Rather special info:
My library also differs from the standard
std::sort() interface: the comparision function. When sorting pointers a compaCode Snippets
Number of scanned elements in the whole sort on average:
3 pivots: 1.385 * n * log2(n)
5 pivots: 1.379 * n * log2(n)
Number of copied elements in the whole sort on average:
3 pivots: 1.108 * n * log2(n)
5 pivots: 1.248 * n * log2(n)// Example for sorting a struct data e.g. a points:
// with comparision:
std::sort(elements.begin(), elements.end(), [](auto a, auto b) {
return std::tie(a.x, a.y) < std::tie(b.x, b.y);
// this would be a simple bug:
// return std::tie(a.x, a.y) < std::tie(b.y, b.x);
});
// with index:
sort_index(elements.begin(), elements.end(), [](auto a) {
return std::tie(a.x, a.y);
});Context
StackExchange Code Review Q#151731, answer score: 7
Revisions (0)
No revisions yet.