HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

Speed optimization for block XOR

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
xorblockoptimizationforspeed

Problem

In code I'm currently maintaining, there is a need to do very many repeated XOR operations of blocks of memory. The block size in my case is always 16 bytes. Because the code is executed very frequently, and because it's speed critical, I would like it to be as fast as possible. Ordinarily I'd write this in assembly language for that reason, but I'd prefer that the code stay in C++ to maximize portability. This is what I'm currently using, but I would like a review to address particularly the following questions:

  • Can this code be made faster?



  • Can the portability be improved? In particular, I'm interested in avoiding problems on architectures for which there is a penalty for misaligned memory access.



memxor.c

const size_t BLKSIZE = 16;
typedef unsigned bigsize_t;
inline void memxor(uint8_t dst[BLKSIZE], const uint8_t *a, const uint8_t *b)
{
    bigsize_t *s=reinterpret_cast(dst);
    const bigsize_t *au = reinterpret_cast(a);
    const bigsize_t *bu = reinterpret_cast(b);
    for (size_t i=BLKSIZE/sizeof(*s); i; --i)
        *s++ = *au++ ^ *bu++;
}

Solution

You're stuck at a point where your readability is good, and any performance improvements are going to come at the expense of making the code more ugly. Additionally, you're already
making architecture choices based on the compiler's interpretation of 'unsigned'. It's already not pretty.

Oh, and the un-braced 1-line for-loop is a problem for readability too.

I don't know of any optimizations that are available at this point that will not come with the portability, or readability cost. For a block-size of 16 bytes, and with a likely 4-byte unsigned value, the odds are that your loop will iterate 4 times only anyway.

The compiler may unroll that loop, and move on.

The above is just a waffle that really means: from here on, you cannot accomplish all three: clean code, portability, and performance. You need to compromise somewhere.

Profiling would be the logical thing to do. profile your current code, and establish a baseline.

I would then suggest that you investigate whether you can have a larger block size. Any alignment fiddling you may have to do will pay off better if you are working with larger blocks. Can you batch these blocks in to 1MB contiguous regions?

With larger contiguous blocks you could then consider vectorization, alignment, and other optimizations that have an increased, and fixed setup cost, but improved throughput.

Additionally, I would consider forcing bigsize_t to be unsigned long long, and then manually unroll the loop. If you find an exact type that is 64 bits (there has to be one? and, if not, you can have an alternate implementation...), then you can force your type to match:

#include 

#if UINT_MAX == 18446744073709551615ULL
typedef unsigned int big64_t;
#elif ULONG_MAX == 18446744073709551615ULL
typedef unsigned long big64_t ;
#elif ULLONG_MAX == 18446744073709551615ULL
typedef unsigned long long big64_t;
#else
#error "Cannot find unsigned 64bit integer."
#endif

big64_t *s = reinterpret_cast(dst);
const big64_t *au = reinterpret_cast(a);
const big64_t *bu = reinterpret_cast(b);

s[0] = au[0] ^ bu[0];
s[1] = au[1] ^ bu[1];


With systems that are not natively 64-bit, the compiler will adjust the operation to be relatively efficient anyway.

In the event that this is the year 2100, and all datatypes are 128 bit, or more, then you should possibly add the code that does a single 128-bit XOR for your input.

I see the above as being as portable, and equally readable. As for the performance, that will require a profile and benchmark.

Code Snippets

#include <limits.h>

#if UINT_MAX == 18446744073709551615ULL
typedef unsigned int big64_t;
#elif ULONG_MAX == 18446744073709551615ULL
typedef unsigned long big64_t ;
#elif ULLONG_MAX == 18446744073709551615ULL
typedef unsigned long long big64_t;
#else
#error "Cannot find unsigned 64bit integer."
#endif

big64_t *s = reinterpret_cast<big64_t *>(dst);
const big64_t *au = reinterpret_cast<const big64_t *>(a);
const big64_t *bu = reinterpret_cast<const big64_t *>(b);

s[0] = au[0] ^ bu[0];
s[1] = au[1] ^ bu[1];

Context

StackExchange Code Review Q#77093, answer score: 3

Revisions (0)

No revisions yet.