patternjavaModerate
Validating UTF-8 byte array
Viewed 0 times
arraybytevalidatingutf
Problem
I'm writing a validator function that receives a
Is my approach correct? So far it has worked fine with my tests, but I'm worried that I might be missing some edge case, or that the way I'm handling
And more importantly, how can I improve the conditions using bit-twiddling? I'm a bit rusty here, and I think that a comparison such as this:
can be expressed in a simpler, more idiomatic way.
And these are some tests to try it out:
byte[] and checks whether it represents a valid UTF-8 byte sequence, according to this table.Is my approach correct? So far it has worked fine with my tests, but I'm worried that I might be missing some edge case, or that the way I'm handling
bytes is off.And more importantly, how can I improve the conditions using bit-twiddling? I'm a bit rusty here, and I think that a comparison such as this:
if ((b & 0xFF) >> 1 == (byte) 0b1111110)can be expressed in a simpler, more idiomatic way.
public static boolean validate(byte[] bytes) {
int length = bytes.length;
if (length 6)
return false;
byte b = bytes[0];
if (length == 1)
return (b & (1 > 1 == (byte) 0b1111110)
n = 5;
else if ((b & 0xFF) >> 2 == (byte) 0b111110)
n = 4;
else if ((b & 0xFF) >> 3 == (byte) 0b11110)
n = 3;
else if ((b & 0xFF) >> 4 == (byte) 0b1110)
n = 2;
else if ((b & 0xFF) >> 5 == (byte) 0b110)
n = 1;
else
return false;
if (length-1 != n)
return false;
for (int i = 1; i > 6 != (byte) 0b10)
return false;
return true;
}And these are some tests to try it out:
byte[] bytes1 = {(byte) 0b11001111, (byte) 0b10111111};
System.out.println(validate(bytes1)); // true
byte[] bytes2 = {(byte) 0b11101111, (byte) 0b10101010, (byte) 0b10111111};
System.out.println(validate(bytes2)); // true
byte[] bytes3 = {(byte) 0b10001111, (byte) 0b10111111};
System.out.println(validate(bytes3)); // false
byte[] bytes4 = {(byte) 0b11101111, (byte) 0b10101010, (byte) 0b00111111};
System.out.println(validate(bytes4)); // falseSolution
Logic
Never omit the optional braces like that. Think of yourself as a contributing factor to a future coding accident. If you really want to omit braces, then put the statement on the same line, so that there is no possibility of misinterpretation.
The function does not check for overlong encodings, invalid byte sequences, or invalid code points. Those caveats should be declared in JavaDoc.
I would eliminate the
I would also incorporate the
In general, text is more likely to contain shorter UTF-8 sequences than longer ones, so you might as well handle the shorter cases first to save a few CPU cycles.
Rather than setting
Eliminate the bit-shifting. Just AND with the bitmask to specify which bits you are interested in inspecting.
Interface design
It's rarely useful to validate a single character: usually, you'll want to validate a whole string. You should name your function to avoid giving the impression that it checks for multi-character strings.
In Java, functions that perform a test and return a boolean are conventionally named
Therefore, I'd rename your function to
So, what if I need to validate a string? I'd have to somehow chunk it up into right-sized byte arrays first. That's not really possible without examining the string using similar logic to what is in the function itself. Even then, it would be wasteful to construct a byte array just for the function call. Therefore, I think that it would be more useful to provide a function that validates a string. To go further, you could make such a function return something more informative than just a boolean.
That's a more versatile function, for about the same amount of code.
Never omit the optional braces like that. Think of yourself as a contributing factor to a future coding accident. If you really want to omit braces, then put the statement on the same line, so that there is no possibility of misinterpretation.
The function does not check for overlong encodings, invalid byte sequences, or invalid code points. Those caveats should be declared in JavaDoc.
I would eliminate the
length > 6 check, as 6 is a magic number. You don't need that special case anyway.I would also incorporate the
length == 1 special case into the regular logic.In general, text is more likely to contain shorter UTF-8 sequences than longer ones, so you might as well handle the shorter cases first to save a few CPU cycles.
Rather than setting
n to be the number of trailing bytes, set it to be the expected length of the array. I'd rename n → expectedLen to be more descriptive. That makes the code more readable (and saves one pointless subtraction).Eliminate the bit-shifting. Just AND with the bitmask to specify which bits you are interested in inspecting.
public static boolean validate(byte[] bytes) {
int expectedLen;
if (bytes.length == 0) return false;
else if ((bytes[0] & 0b10000000) == 0b00000000) expectedLen = 1;
else if ((bytes[0] & 0b11100000) == 0b11000000) expectedLen = 2;
else if ((bytes[0] & 0b11110000) == 0b11100000) expectedLen = 3;
else if ((bytes[0] & 0b11111000) == 0b11110000) expectedLen = 4;
else if ((bytes[0] & 0b11111100) == 0b11111000) expectedLen = 5;
else if ((bytes[0] & 0b11111110) == 0b11111100) expectedLen = 6;
else return false;
if (expectedLen != bytes.length) return false;
for (int i = 1; i < bytes.length; i++) {
if ((bytes[i] & 0b11000000) != 0b10000000) {
return false;
}
}
return true;
}Interface design
It's rarely useful to validate a single character: usually, you'll want to validate a whole string. You should name your function to avoid giving the impression that it checks for multi-character strings.
In Java, functions that perform a test and return a boolean are conventionally named
isSomething() or hasSomething(). A function named validate() suggests that it performs an action as a side-effect, perhaps throwing an exception on failure.Therefore, I'd rename your function to
isValidChar(byte[] bytes).So, what if I need to validate a string? I'd have to somehow chunk it up into right-sized byte arrays first. That's not really possible without examining the string using similar logic to what is in the function itself. Even then, it would be wasteful to construct a byte array just for the function call. Therefore, I think that it would be more useful to provide a function that validates a string. To go further, you could make such a function return something more informative than just a boolean.
/**
* Returns the number of UTF-8 characters, or -1 if the array
* does not contain a valid UTF-8 string. Overlong encodings,
* null characters, invalid Unicode values, and surrogates are
* accepted.
*/
public static int charLength(byte[] bytes) {
int charCount = 0, expectedLen;
for (int i = 0; i 0) {
if (++i >= bytes.length) {
return -1;
}
if ((bytes[i] & 0b11000000) != 0b10000000) {
return -1;
}
}
}
return charCount;
}That's a more versatile function, for about the same amount of code.
Code Snippets
public static boolean validate(byte[] bytes) {
int expectedLen;
if (bytes.length == 0) return false;
else if ((bytes[0] & 0b10000000) == 0b00000000) expectedLen = 1;
else if ((bytes[0] & 0b11100000) == 0b11000000) expectedLen = 2;
else if ((bytes[0] & 0b11110000) == 0b11100000) expectedLen = 3;
else if ((bytes[0] & 0b11111000) == 0b11110000) expectedLen = 4;
else if ((bytes[0] & 0b11111100) == 0b11111000) expectedLen = 5;
else if ((bytes[0] & 0b11111110) == 0b11111100) expectedLen = 6;
else return false;
if (expectedLen != bytes.length) return false;
for (int i = 1; i < bytes.length; i++) {
if ((bytes[i] & 0b11000000) != 0b10000000) {
return false;
}
}
return true;
}/**
* Returns the number of UTF-8 characters, or -1 if the array
* does not contain a valid UTF-8 string. Overlong encodings,
* null characters, invalid Unicode values, and surrogates are
* accepted.
*/
public static int charLength(byte[] bytes) {
int charCount = 0, expectedLen;
for (int i = 0; i < bytes.length; i++) {
charCount++;
// Lead byte analysis
if ((bytes[i] & 0b10000000) == 0b00000000) continue;
else if ((bytes[i] & 0b11100000) == 0b11000000) expectedLen = 2;
else if ((bytes[i] & 0b11110000) == 0b11100000) expectedLen = 3;
else if ((bytes[i] & 0b11111000) == 0b11110000) expectedLen = 4;
else if ((bytes[i] & 0b11111100) == 0b11111000) expectedLen = 5;
else if ((bytes[i] & 0b11111110) == 0b11111100) expectedLen = 6;
else return -1;
// Count trailing bytes
while (--expectedLen > 0) {
if (++i >= bytes.length) {
return -1;
}
if ((bytes[i] & 0b11000000) != 0b10000000) {
return -1;
}
}
}
return charCount;
}Context
StackExchange Code Review Q#59428, answer score: 14
Revisions (0)
No revisions yet.