patterncppMinor

RadixSort implementation design and performance

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

designradixsortperformanceandimplementation

Problem

Credits for the original implementation I based my code on:

Quora - What is the most efficient way to sort a million 32-bit integers?

Implementation

// RadixSort - works for values up to unsigned_int_max (32-bit)
template
void radix_sort(Iter __first, Iter __last){
    typedef typename iterator_traits::value_type value_type;
    vector out(__last - __first);

    // calculate most-significant-digit (256-base)
    value_type __mx = *max_element(__first, __last); //O(n)
    int __msb = 0;
    do {
        __msb += 8;
        __mx = __mx >> 8;
    } while(__mx);

    Iter __i, __j, __s;
    bool __swapped = false;
    for (int __shift = 0; __shift > __shift) & 0xFF]++;

        // prefix-sum
        size_t __m, __q = 0;
        for (int i = 0; i > __shift) & 0xFF]++;
            *__v = *__p;
        }
        __swapped = !__swapped;
    }

    // if ended on auxiliar vector, copy to input vector
    if (__swapped) copy(out.begin(), out.end(), __first); //O(n)
}

Discussion

In order to implement a more general algorithm than the one used as reference, I tried to refactor the code using template and iterators. This is the main difference between my code and the original.

I can't simply swap references when using iterators, so I approached this problem by cycling the iterators used in each counting-sort loop. Doing so prevents the need to copy the output array on every loop.

Another different aspect of my code is that I find max-element and calculate the most-significant-digit on base-256. I use this information to determine how many counting-sort loops are necessary, instead of hardcoding 32 (unsigned_max) and always running the loop 4-times even if all values are less than 255. This actually adds unnecessary overhead if 4-loops are necessary, but it should increase execution time otherwise.

The temporary container used is a vector and this is something I'm not quite sure about, I feel like this is a point where I could improve my code. I'd like to

Solution

Improvement to radix_sort_32

One of the things that you were wondering about was whether radix_sort_msd or radix_sort_32 was better. The former called max_element() in order to determine the max width of the values, and the latter always did 4 passes even when they weren't necessary.

I took radix_sort_32 and added this code after generating the counts:

// If this byte is zero for every value, skip the byte entirely.
if (count[0] == (size_t)(__last - __first))
    continue;

This is a quick check to find out if the entire input was full of zeroes for the current byte. If so, it moves on to the next byte instead of wasting time making an exact copy of the input to the temp area. It still uses more time than radix_sort_msd in the "values ` is an appropriate container for your auxiliary array. I can't think of anything better than that.

As far as copy vs move, they should be equivalent for numeric types. See this Stackoverflow question and its answers for some good explanations on why that is.

Code Snippets

// If this byte is zero for every value, skip the byte entirely.
if (count[0] == (size_t)(__last - __first))
    continue;

// LSD RadixSort helper function for radix_sort() below.
template<typename Iter>
static void radix_sort_lsd(Iter __first, Iter __last, Iter __out, Iter __outEnd,
        int __msb, bool needsSwap)
{
    Iter __i, __j, __s;
    bool __swapped = false;

    for (int __shift = 0; __shift < __msb; __shift += 8) {

        // cycle input/auxiliar vectors
        if (__swapped) {
            __i = __out;
            __j = __outEnd;
            __s = __first;
        }
        else {
            __i = __first;
            __j = __last;
            __s = __out;
        }

        // counting_sort
        size_t count[0x100] = {};
        for (Iter __p = __i; __p != __j; __p++)
            count[(*__p >> __shift) & 0xFF]++;

        if (count[0] == (size_t)(__last - __first))
            continue;

        // prefix-sum
        size_t __m, __q = 0;
        for (int i = 0; i < 0x100; i++) {
            __m = count[i];
            count[i] = __q;
            __q += __m;
        }

        // filling result
        for (Iter __p = __i; __p != __j; __p++) {
            *(__s + count[(*__p >> __shift) & 0xFF]++) = *__p;
        }
        __swapped = !__swapped;
    }

    // if ended on auxiliar vector, copy to input vector
    if (__swapped != needsSwap) copy(__out, __outEnd, __first); //O(n)
}

// If the input exceeds this threshold (in bytes), we do one pass of MSD
// followed by the rest of the passes done by LSD.  This number should be
// an estimate of the cache size.
#define THRESHOLD 8000000

// RadixSort - works for values up to unsigned_int_max (32-bit)
template<typename Iter>
void radix_sort(Iter __first, Iter __last)
{
    typedef typename iterator_traits<Iter>::value_type value_type;
    size_t len = (size_t) (__last - __first);
    vector<value_type> out(len);

    // First, test if the input exceeds the caching threshold.  If the input
    // is smaller than the threshold, just do a straight LSD radix sort.
    if (len * sizeof(value_type) < THRESHOLD) {
        radix_sort_lsd(__first, __last, out.begin(), out.end(),
                sizeof(value_type) * 8, false);
        return;
    }

    // Set __shift to the most significant byte.
    int  __shift = (sizeof(value_type)-1) * 8;
    Iter __s     = out.begin();
    Iter __p     = __first;

    // counting_sort
    size_t count[0x100] = {};

    for (size_t i = 0; i < len; i++) {
        count[(*__p++ >> __shift) & 0xFF]++;
    }

    // prefix-sum
    size_t __m, __q = 0;
    for (int i = 0; i < 0x100; i++) {
        __m = count[i];
        count[i] = __q;
        __q += __m;
    }

    // filling result
    __p = __first;
    for (size_t i = 0; i < len; i++) {
        *(__s + count[(*__p >> __shift) & 0xFF]++) = *__p++;
    }

    // For each of the 256 bins, do a LSD radix sort on the bin.  The input
    // and auxiliary vectors have been swapped, so we pass needsSwap = true to
    // indicate that the LSD sort should end on the auxiliary vector instead.
    int startIndex = 0;
    for (int i

Context

StackExchange Code Review Q#156275, answer score: 3

Revisions (0)

No revisions yet.