patterncMinor
Macros to detect UTF-8
Viewed 0 times
macrosdetectutf
Problem
I'm working on a program that handles UTF-8 characters. I've made the following macros to detect UTF-8. I've tested them with a few thousand words and they seem to work.
I'll add another one to do error-checking later, but for now I would like to know what mistakes I've made and how these macros can be improved.
I'll add another one to do error-checking later, but for now I would like to know what mistakes I've made and how these macros can be improved.
//check if value is in range of leading byte
#define IS_UTF8_LEADING_BYTE(b) (((unsigned char)(b) >= 192) && (unsigned char)(b) = 128 && (unsigned char)(b) < 192)
//can be any utf8 byte, first, last...
#define IS_UTF8_BYTE(b) (IS_UTF8_LEADING_BYTE(b) || IS_UTF8_SEQUENCE_BYTE(b))
//no error checking, it must be used only on leading byte
#define HOW_MANY_UTF8_SEQUENCE_BYTES(b) ((((unsigned char)(b) & 64) == 64) + (((unsigned char)(b) & 32) == 32) + (((unsigned char)(b) & 16) == 16))Solution
There are a lot of magic numbers. They are numbers that would be familiar to any good programmer, but magic numbers nonetheless. Furthermore, it's more useful to think of the problem using bitwise operations, since they are more closely related to the conceptual design behind UTF-8.
You only accept two-, three-, and four-byte sequences, i.e., only code points U+0080 to U+1FFFFF, i.e., the Basic Multilingual Plane (minus the ASCII set) plus the Supplementary Multilingual Plane. That should be made clear in comments.
I suggest renaming
You only accept two-, three-, and four-byte sequences, i.e., only code points U+0080 to U+1FFFFF, i.e., the Basic Multilingual Plane (minus the ASCII set) plus the Supplementary Multilingual Plane. That should be made clear in comments.
I suggest renaming
IS_UTF8_BYTE() to IS_UTF8_MULTIBYTE(), since ASCII characters ≤ 127 are valid UTF-8 bytes too — just not members of a multibyte sequence.//check if value is in range of leading byte (0b11?????? but not 0b11111???)
#define IS_UTF8_LEADING_BYTE(b) (((b) & 0xc0) && !((b) & 0xf8))
//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) (((b) & 0x80) && !((b) & 0xc0))
//can be any byte within a UTF-8 multibyte sequence
#define IS_UTF8_MULTIBYTE(b) (IS_UTF8_LEADING_BYTE(b) || IS_UTF8_SEQUENCE_BYTE(b))
//no error checking; it must be used only on leading byte
#define HOW_MANY_UTF8_SEQUENCE_BYTES(b) (((b) & 0x40 == 0x40) + ((b) & 0x20 == 0x20) + ((b) & 0x10 == 0x10))IS_UTF8_SEQUENCE_BYTE() could be more clever.//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) ((b) & 0xc0 == 0x80)Code Snippets
//check if value is in range of leading byte (0b11?????? but not 0b11111???)
#define IS_UTF8_LEADING_BYTE(b) (((b) & 0xc0) && !((b) & 0xf8))
//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) (((b) & 0x80) && !((b) & 0xc0))
//can be any byte within a UTF-8 multibyte sequence
#define IS_UTF8_MULTIBYTE(b) (IS_UTF8_LEADING_BYTE(b) || IS_UTF8_SEQUENCE_BYTE(b))
//no error checking; it must be used only on leading byte
#define HOW_MANY_UTF8_SEQUENCE_BYTES(b) (((b) & 0x40 == 0x40) + ((b) & 0x20 == 0x20) + ((b) & 0x10 == 0x10))//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) ((b) & 0xc0 == 0x80)Context
StackExchange Code Review Q#32567, answer score: 3
Revisions (0)
No revisions yet.