patterncppMinor
UTF-8 character reader function
Viewed 0 times
functioncharacterreaderutf
Problem
You may see full code here (note that the link points to the specific commit).
Language is "clean C" (that is, a subset of C89, C99 and C++98 — it is intended to compile under all of these standards). The code must be portable between x86 and x86_64.
UTF-8 handling implemented based on information here and tested with data in this file.
First and foremost I'm interested in correctness. But I'm a bit worried about the length and readability of the code and its speed (I did not profile this thing yet though, so that particular worry is not motivated). I will accept any comments, including constructive nitpicks.
The function itself:
```
/*
Increments* len_bytes by the number of bytes read.
* Fails on invalid UTF-8 characters.
*/
static int ltsLS_eatutf8char(lts_LoadState ls, size_t len_bytes)
{
unsigned char b = 0;
signed char expected_length = 0;
int i = 0;
const unsigned char * origin = ls->pos;
/ Check if we have any data in the buffer /
if (!ltsLS_good(ls) || ltsLS_unread(ls) unread = 0;
ls->pos = NULL;
return LUATEXTS_ECLIPPED;
}
/ We have at least one byte in the buffer, let's check it out. /
b = *ls->pos;
/ We did just eat a byte, no matter what happens next. /
++ls->pos;
--ls->unread;
/ Get an expected length of a character. /
expected_length = utf8_char_len[b];
/ Check if it was a valid first byte. /
if (expected_length unread = 0;
ls->pos = NULL;
return LUATEXTS_EBADUTF8;
}
/ If it was a single-byte ASCII character, return right away. /
if (expected_length == 1)
{
*len_bytes += expected_length;
return LUATEXTS_ESUCCESS;
}
/*
* It was a multi-byte character. Check if we have enough bytes unread.
* Note that we've eaten one byte already.
*/
if (ltsLS_unread(ls) + 1 unread = 0;
ls->pos = NULL;
return LUATEXTS_ECLIPPED; / Assuming it is buffer's fault. /
}
/ Let's eat the rest of characters /
for (i = 1; i pos;
/* We did just eat
Language is "clean C" (that is, a subset of C89, C99 and C++98 — it is intended to compile under all of these standards). The code must be portable between x86 and x86_64.
UTF-8 handling implemented based on information here and tested with data in this file.
First and foremost I'm interested in correctness. But I'm a bit worried about the length and readability of the code and its speed (I did not profile this thing yet though, so that particular worry is not motivated). I will accept any comments, including constructive nitpicks.
The function itself:
```
/*
Increments* len_bytes by the number of bytes read.
* Fails on invalid UTF-8 characters.
*/
static int ltsLS_eatutf8char(lts_LoadState ls, size_t len_bytes)
{
unsigned char b = 0;
signed char expected_length = 0;
int i = 0;
const unsigned char * origin = ls->pos;
/ Check if we have any data in the buffer /
if (!ltsLS_good(ls) || ltsLS_unread(ls) unread = 0;
ls->pos = NULL;
return LUATEXTS_ECLIPPED;
}
/ We have at least one byte in the buffer, let's check it out. /
b = *ls->pos;
/ We did just eat a byte, no matter what happens next. /
++ls->pos;
--ls->unread;
/ Get an expected length of a character. /
expected_length = utf8_char_len[b];
/ Check if it was a valid first byte. /
if (expected_length unread = 0;
ls->pos = NULL;
return LUATEXTS_EBADUTF8;
}
/ If it was a single-byte ASCII character, return right away. /
if (expected_length == 1)
{
*len_bytes += expected_length;
return LUATEXTS_ESUCCESS;
}
/*
* It was a multi-byte character. Check if we have enough bytes unread.
* Note that we've eaten one byte already.
*/
if (ltsLS_unread(ls) + 1 unread = 0;
ls->pos = NULL;
return LUATEXTS_ECLIPPED; / Assuming it is buffer's fault. /
}
/ Let's eat the rest of characters /
for (i = 1; i pos;
/* We did just eat
Solution
One obvious issue - the longest valid Unicode character is represented by 4 bytes in UTF-8. Although it is possible to extend the logic to 5 or 6 bytes, there is no need since Unicode is a 21-bit codeset.
Be careful about using version 3.2 of the Unicode standard; the current version is 6.0. Granted, they keep them as backwards compatible as possible, but it is as well to use the latest version. Paragraph C5 in 6.0.0 bears no resemblance to the paragraph you quote. The bytes 0xC0, 0xC1, and 0xF5..0xFF cannot appear in valid UTF-8.
In view of that, you can modify your lookup table to:
The zeroes for the forbidden bytes will trigger an early exit.
The loadstate structure is odd; you have to have something else that keeps a record of where the start of the buffer is. However, since the code zaps the loadstate on encountering an error, there has to be another place where the information is stored, so maybe it isn't too critical.
A good compiler might well optimize:
so that it is the same as:
The overlong and surrogate testing is interesting. Table 3.7 from Chapter 3 of the 6.0.0 Unicode standard lists:
You need to do some extra testing for 0xF4 as the first byte; your code allows through quite a lot of invalid characters. Otherwise, the tests you have do work, allowing through the surrogates, which you test for separately.
The test for the surrogates is incorrect. The bit masking operation is wrong, allowing through many surrogates. I think you could write:
The conditions for overlong forms and surrogates can be reduced to:
This is vastly shorter and therefore more readable than the original. The near symmetry of the terms is made evident by the layout. It is wider than 80 characters; if that's a problem, I'd shorten the name
Be careful about using version 3.2 of the Unicode standard; the current version is 6.0. Granted, they keep them as backwards compatible as possible, but it is as well to use the latest version. Paragraph C5 in 6.0.0 bears no resemblance to the paragraph you quote. The bytes 0xC0, 0xC1, and 0xF5..0xFF cannot appear in valid UTF-8.
In view of that, you can modify your lookup table to:
static const signed char utf8_char_len[256] =
{
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
};The zeroes for the forbidden bytes will trigger an early exit.
The loadstate structure is odd; you have to have something else that keeps a record of where the start of the buffer is. However, since the code zaps the loadstate on encountering an error, there has to be another place where the information is stored, so maybe it isn't too critical.
A good compiler might well optimize:
if (expected_length == 1)
{
*len_bytes += expected_length;
return LUATEXTS_ESUCCESS;
}so that it is the same as:
if (expected_length == 1)
{
*len_bytes++;
return LUATEXTS_ESUCCESS;
}The overlong and surrogate testing is interesting. Table 3.7 from Chapter 3 of the 6.0.0 Unicode standard lists:
Table 3-7. Well-Formed UTF-8 Byte Sequences
Code Points First Byte Second Byte Third Byte Fourth Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BFYou need to do some extra testing for 0xF4 as the first byte; your code allows through quite a lot of invalid characters. Otherwise, the tests you have do work, allowing through the surrogates, which you test for separately.
The test for the surrogates is incorrect. The bit masking operation is wrong, allowing through many surrogates. I think you could write:
if (expected_length == 3 && (origin[0] == 0xED && (origin[1] & 0xE0) != 0x80))The conditions for overlong forms and surrogates can be reduced to:
/* All bytes are correct; check out for overlong forms and surrogates */
if ((expected_length == 2 && ((origin[0] & 0xFE) == 0xC0)) ||
(expected_length == 3 && (origin[0] == 0xE0 && (origin[1] & 0xE0) == 0x80)) ||
(expected_length == 4 && (origin[0] == 0xF0 && (origin[1] & 0xF0) == 0x80)) ||
(expected_length == 4 && (origin[0] == 0xF4 && (origin[1] > 0x8F))) ||
(expected_length == 3 && (origin[0] == 0xED && (origin[1] & 0xE0) != 0x80)))
{
ls->unread = 0;
ls->pos = NULL;
return LUATEXTS_EBADUTF8;
}This is vastly shorter and therefore more readable than the original. The near symmetry of the terms is made evident by the layout. It is wider than 80 characters; if that's a problem, I'd shorten the name
expected_length (to maybe exp_len).Code Snippets
static const signed char utf8_char_len[256] =
{
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
};if (expected_length == 1)
{
*len_bytes += expected_length;
return LUATEXTS_ESUCCESS;
}if (expected_length == 1)
{
*len_bytes++;
return LUATEXTS_ESUCCESS;
}Table 3-7. Well-Formed UTF-8 Byte Sequences
Code Points First Byte Second Byte Third Byte Fourth Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BFif (expected_length == 3 && (origin[0] == 0xED && (origin[1] & 0xE0) != 0x80))Context
StackExchange Code Review Q#1624, answer score: 8
Revisions (0)
No revisions yet.