patterncMinor
UTF8 Codepoint decode and length
Viewed 0 times
lengthutf8andcodepointdecode
Problem
I needed a function that would:
I had no special requirement on what to do with invalid sequences so I opted for the following behaviour:
I wrote the following function:
So that a code like:
Produces:
To make it faster I thought about unrolling the
I will appreciate any comment you may have and any suggestion for improvement.
- Decode, and return, the first character in an UTF8 encoded strings
- Return the length of encoding with the special case that lenght of
'\0'must be 0
- Perfomance are important
I had no special requirement on what to do with invalid sequences so I opted for the following behaviour:
- The first byte of an invalid sequences is considered as a single "character" (e.g. for the sequence
"\xFF\x2F"it would return'\xFF'as value and 1 as length).
- Overlong encoding are accepted
I wrote the following function:
static uint8_t LEN[] = {1,1,1,1,2,2,3,0};
static uint8_t MSK[] = {0,0,3,4,5,0,0,0};
static int utf8_cp(char *txt, int32_t *ch)
{
int len = 0;
int32_t val = 0;
uint8_t first = (uint8_t)(*txt);
len = (first > 0) * (1 + ((first & 0xC0) == 0xC0) * LEN[(first >> 3) & 7]);
val = first & (0xFF >> MSK[len]);
for (int k=len; k>1; k--) {
if ((*++txt & 0xC0) != 0x80) {
val = first;
len = 1;
break;
}
val = (val << 6) | (*txt & 0x3F);
}
*ch = val;
return len;
}So that a code like:
char *t; int l; int32_t c;
t = "aàも";
while(1) {
l = utf8_cp(t, &c);
printf("'%s' len:%d cp:0x%05x\n", t, l, c);
if (*t == 0) break;
t += l;
}Produces:
'aàも' len:1 cp:0x00061
'àも' len:2 cp:0x000e0
'も' len:3 cp:0x03082
'' len:4 cp:0x2b014
'' len:0 cp:0x00000To make it faster I thought about unrolling the
for loop (but I wonder how much could I gain) and introducing, at the beginning, some if to handle ASCII character (but I fear that branching could be more costly that just making a bunch of operation).I will appreciate any comment you may have and any suggestion for improvement.
Solution
Rather than use narrow types, use fastest ones
Rather than lookup a value to shift, look up the shifted value.
Some modern compilers can make additional optimizations if the pointers are known to not overlap - use
Coding the companion function would aid in testing for both functions.
As code does not detect invalid encoding like surrogates, redundant patterns and values above max_Unicode, I see little value in handling only a subset of invalid sequences. Either detect them all (maybe in debug mode) or skip detection.
Suggest doing a 32-byte (or 256-byte) lookup for performance. Profile to find optimal.
Could extend the above to do one lookup for both the
// uint8_t first = (uint8_t)(*txt);
unsigned first = (uint8_t)(*txt);
// or
uint_fast8_t first = (uint8_t)(*txt);Rather than lookup a value to shift, look up the shifted value.
// static uint8_t MSK[] = {0,0,3,4,5,0,0,0};
// val = first & (0xFF >> MSK[len]);
static const uint8_t FF_MSK[] = {0xFF >>0, 0xFF >>0, 0xFF >>3,
0xFF >>4, 0xFF >>5, 0xFF >>0, 0xFF >>0, 0xFF >>0};
val = first & FF_MSK[len];Some modern compilers can make additional optimizations if the pointers are known to not overlap - use
restrict and const where applicable.// int utf8_cp(char *txt, int32_t *ch)
int utf8_cp(const char * restrict txt, int32_t *restrict ch)Coding the companion function would aid in testing for both functions.
int utf8_cp_encode(int32_t *ch, char *txt);As code does not detect invalid encoding like surrogates, redundant patterns and values above max_Unicode, I see little value in handling only a subset of invalid sequences. Either detect them all (maybe in debug mode) or skip detection.
Suggest doing a 32-byte (or 256-byte) lookup for performance. Profile to find optimal.
// len = (first > 0) * (1 + ((first & 0xC0) == 0xC0) * LEN[(first >> 3) & 7]);
len = (first > 0) * LEN_32[first >> 3];
// or
len = LEN_256[first];Could extend the above to do one lookup for both the
len and val.Code Snippets
// uint8_t first = (uint8_t)(*txt);
unsigned first = (uint8_t)(*txt);
// or
uint_fast8_t first = (uint8_t)(*txt);// static uint8_t MSK[] = {0,0,3,4,5,0,0,0};
// val = first & (0xFF >> MSK[len]);
static const uint8_t FF_MSK[] = {0xFF >>0, 0xFF >>0, 0xFF >>3,
0xFF >>4, 0xFF >>5, 0xFF >>0, 0xFF >>0, 0xFF >>0};
val = first & FF_MSK[len];// int utf8_cp(char *txt, int32_t *ch)
int utf8_cp(const char * restrict txt, int32_t *restrict ch)int utf8_cp_encode(int32_t *ch, char *txt);// len = (first > 0) * (1 + ((first & 0xC0) == 0xC0) * LEN[(first >> 3) & 7]);
len = (first > 0) * LEN_32[first >> 3];
// or
len = LEN_256[first];Context
StackExchange Code Review Q#141975, answer score: 4
Revisions (0)
No revisions yet.