patterncMinor
Fastest StringToLowerCase()
Viewed 0 times
fasteststringtolowercasestackoverflow
Problem
I want to optimize the following code:
#ifdef USE_SSE
#define STRING_PREFETCH_TBL(ptr) \
_mm_prefetch(ptr, _MM_HINT_T0); \
_mm_prefetch(ptr+64, _MM_HINT_T0); \
_mm_prefetch(ptr+128, _MM_HINT_T0); \
_mm_prefetch(ptr+192, _MM_HINT_T0)
#else
#define STRING_PREFETCH_TBL(ptr)
#endif
__declspec(align(128)) const char TblToLower[] =
{
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,
35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,97,98,
99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,91,
92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,
119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,
144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,
170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,
197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,
225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,
};
void StrToLowerCase( const char* src, char* dst )
{
STRING_PREFETCH_TBL(TblToLower);
while(*src)
{
*dst = TblToLower[*src];
dst++;
src++;
}
*dst = '\0';
}Solution
As far as optimizing the tolower logic itself for a single ASCII character, your table is about as fast as it gets.
Alternatives would be:
A) the standard tolower function
B) this bit trick if you have only ASCII alpha characters
C) this bit trick for all ASCII characters
I ran 300 iterations on a pre-generated random string of one-billion bytes (including the null terminator) of some x64 code under full optimization and no debugger attached from Visual Studio 2013.
For whatever it's worth, replacing the tolower logic with a NOP and simply copying the source string took 15.0 seconds, so that's what's being spent in loop overhead.
But no matter, your real opportunity for optimization here is in vectorization and parallelization.
However this cannot happen as the signature and contract to the
So, for the sake of moving forward. Let's assume our signature now looks like this:
where for the sake simplicity,
Lets explore vectorized approaches:
If you have a processor that supports AVX2 instructions, then you can use the VGATHERDPS instruction which performs 8 vectorized lookups. However I don't have this instruction so I couldn't try it out. In fact few CPUs will since it first showed up in 2013. So moving on...
Almost everyone has SSE2 now days, and luckily the above bit trick can be vectorized with SSE2. Here is an implementation that operates on 128-bit chunks:
The above SSE2 implementation finishes the test in just 5.3 seconds, roughly 4.5 times faster than the table lookup code that was previously that fastest. However, keep in mind some of this is in the loop overhead we eliminated by effectively unrolling the loop 16 times (each iteration of the SSE2 loop lower-cases 16 ASCII chars). If we unroll the loop 16 times for the lookup code, it finishes in 15.3 seconds. So loop overhead aside, the SSE2 implementation is about 3 times faster. Furthermore, if we employee the alpha-character only tolower in the unrolled loop it finishes in 8.9 seconds, and the SSE2 code that handles non-alpha ASCII characters as well as extended ASCII (codes greater than 127) is still faster.
The SSE2 implementation will be the fastest presented here, but lets say you don't want to use SSE2, you have 64-bit support, and you only have standard ASCII in the range [0, 127] and not extended ASCII in the range [0, 255] (the most significant bit is always zero). In that case you can use a 64-bit vectorized approach that hijacks the 64-valued bit in the same way the above bit trick operations hijack the 128-valued bit. The subtractions can be done on 8 characters shoved into a 64-bit integer, but the trick
Alternatives would be:
A) the standard tolower function
B) this bit trick if you have only ASCII alpha characters
(unsigned char)((c) | (unsigned char)0x20)C) this bit trick for all ASCII characters
// set the most significant bit in value X if greater than or equal to 'A'
// set the most significant bit in value Y if less than or equal to 'Z'
// bitwise AND X and Y to determine if both are true and mask out all other bits
// shift the set bit into the 0x20 place and bitwise OR in the bit
(unsigned char)(((((unsigned char)0x40 - (c)) & ((c) - (unsigned char)0x5b) & (unsigned char)0x80) >> 2) | (c))I ran 300 iterations on a pre-generated random string of one-billion bytes (including the null terminator) of some x64 code under full optimization and no debugger attached from Visual Studio 2013.
- your table lookup took about 22.4 seconds
- alternative A took about 106.0 seconds (the call was not inlined)
- alternative B took about 15.1 seconds (but it only works for strings only containing letters)
- alternative C took about 31.9 seconds (so the table beats the bit tricks)
For whatever it's worth, replacing the tolower logic with a NOP and simply copying the source string took 15.0 seconds, so that's what's being spent in loop overhead.
But no matter, your real opportunity for optimization here is in vectorization and parallelization.
However this cannot happen as the signature and contract to the
StrToLowerCase function exists now. This is for two reasons:- Any vectorized instructions will need to access data in larger chunks than one byte, and reading (and writing) past the null is bad if you don't know the arrays are padded out the appropriate amount.
- You don't know the length of the string without doing a byte-by-byte scan for the null terminator. So you can't break the string into chunks easily and spin off threads to process separate chunks, and you're going to have to check for the null terminator in the middle of vectorized code (if you go that route), and you will lose much of if not all of what you gain.
So, for the sake of moving forward. Let's assume our signature now looks like this:
void StrToLowerCase(const char src, char dst, unsigned len)where for the sake simplicity,
len is however many chunks of data the function must process whatever its size may be for the given implementation.Lets explore vectorized approaches:
If you have a processor that supports AVX2 instructions, then you can use the VGATHERDPS instruction which performs 8 vectorized lookups. However I don't have this instruction so I couldn't try it out. In fact few CPUs will since it first showed up in 2013. So moving on...
Almost everyone has SSE2 now days, and luckily the above bit trick can be vectorized with SSE2. Here is an implementation that operates on 128-bit chunks:
__forceinline void StrToLowerCaseSSE2(const char *src, char *dst, unsigned len)
{
__m128i const *src128 = (__m128i const*)src;
__m128i *dst128 = (__m128i*)dst;
while (len--)
{
__m128i const c16 = _mm_load_si128(src128++);
__m128i const abv = _mm_sub_epi8(_mm_set_epi32(0x40404040, 0x40404040, 0x40404040, 0x40404040), c16);
__m128i const blw = _mm_sub_epi8(c16, _mm_set_epi32(0x5B5B5B5B, 0x5B5B5B5B, 0x5B5B5B5B, 0x5B5B5B5B));
__m128i msk = _mm_and_si128(abv, blw);
msk = _mm_and_si128(msk, _mm_set_epi32(0x80808080, 0x80808080, 0x80808080, 0x80808080));
msk = _mm_srli_epi16(msk, 2);
msk = _mm_or_si128(msk, c16);
_mm_store_si128(dst128++, msk);
}
}The above SSE2 implementation finishes the test in just 5.3 seconds, roughly 4.5 times faster than the table lookup code that was previously that fastest. However, keep in mind some of this is in the loop overhead we eliminated by effectively unrolling the loop 16 times (each iteration of the SSE2 loop lower-cases 16 ASCII chars). If we unroll the loop 16 times for the lookup code, it finishes in 15.3 seconds. So loop overhead aside, the SSE2 implementation is about 3 times faster. Furthermore, if we employee the alpha-character only tolower in the unrolled loop it finishes in 8.9 seconds, and the SSE2 code that handles non-alpha ASCII characters as well as extended ASCII (codes greater than 127) is still faster.
The SSE2 implementation will be the fastest presented here, but lets say you don't want to use SSE2, you have 64-bit support, and you only have standard ASCII in the range [0, 127] and not extended ASCII in the range [0, 255] (the most significant bit is always zero). In that case you can use a 64-bit vectorized approach that hijacks the 64-valued bit in the same way the above bit trick operations hijack the 128-valued bit. The subtractions can be done on 8 characters shoved into a 64-bit integer, but the trick
Code Snippets
(unsigned char)((c) | (unsigned char)0x20)// set the most significant bit in value X if greater than or equal to 'A'
// set the most significant bit in value Y if less than or equal to 'Z'
// bitwise AND X and Y to determine if both are true and mask out all other bits
// shift the set bit into the 0x20 place and bitwise OR in the bit
(unsigned char)(((((unsigned char)0x40 - (c)) & ((c) - (unsigned char)0x5b) & (unsigned char)0x80) >> 2) | (c))__forceinline void StrToLowerCaseSSE2(const char *src, char *dst, unsigned len)
{
__m128i const *src128 = (__m128i const*)src;
__m128i *dst128 = (__m128i*)dst;
while (len--)
{
__m128i const c16 = _mm_load_si128(src128++);
__m128i const abv = _mm_sub_epi8(_mm_set_epi32(0x40404040, 0x40404040, 0x40404040, 0x40404040), c16);
__m128i const blw = _mm_sub_epi8(c16, _mm_set_epi32(0x5B5B5B5B, 0x5B5B5B5B, 0x5B5B5B5B, 0x5B5B5B5B));
__m128i msk = _mm_and_si128(abv, blw);
msk = _mm_and_si128(msk, _mm_set_epi32(0x80808080, 0x80808080, 0x80808080, 0x80808080));
msk = _mm_srli_epi16(msk, 2);
msk = _mm_or_si128(msk, c16);
_mm_store_si128(dst128++, msk);
}
}__forceinline void StrToLowerCase64(const char *src, char *dst, unsigned len)
{
unsigned const __int64 *src64 = (unsigned const __int64 *)src;
unsigned __int64 *dst64 = (unsigned __int64 *)dst;
while (len--)
{
unsigned const __int64 x = *src64++;
*dst64++ = (((0xC0C0C0C0C0C0C0C0 - x) & (x + 0x2525252525252525) & 0x4040404040404040) >> 1) | x;
}
}__forceinline void StrToLowerCaseSSE(const char* src, char* dst)
{
__m128i const *src128 = (__m128i const*)src;
__m128i *dst128 = (__m128i*)dst;
for (;;)
{
__m128i const c16 = _mm_load_si128(src128);
__m128i const null = _mm_cmpeq_epi8(c16, _mm_set_epi32(0, 0, 0, 0));
if (_mm_movemask_epi8(null))
break;
++src128;
__m128i const abv = _mm_sub_epi8(_mm_set_epi32(0x40404040, 0x40404040, 0x40404040, 0x40404040), c16);
__m128i const blw = _mm_sub_epi8(c16, _mm_set_epi32(0x5B5B5B5B, 0x5B5B5B5B, 0x5B5B5B5B, 0x5B5B5B5B));
__m128i msk = _mm_and_si128(abv, blw);
msk = _mm_and_si128(msk, _mm_set_epi32(0x80808080, 0x80808080, 0x80808080, 0x80808080));
msk = _mm_srli_epi16(msk, 2);
msk = _mm_or_si128(msk, c16);
_mm_store_si128(dst128++, msk);
}
src = (const char*)src128;
dst = (char*)dst128;
while (*src)
{
const unsigned char c = *src++;
*dst++ = (unsigned char)(((((unsigned char)0x40 - (c)) & ((c)-(unsigned char)0x5b) & (unsigned char)0x80) >> 2) | (c));
}
*dst = '\0';
}Context
StackExchange Code Review Q#31036, answer score: 7
Revisions (0)
No revisions yet.