HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaModerate

Validating UTF-8 byte array

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
arraybytevalidatingutf

Problem

I'm writing a validator function that receives a byte[] and checks whether it represents a valid UTF-8 byte sequence, according to this table.

Is my approach correct? So far it has worked fine with my tests, but I'm worried that I might be missing some edge case, or that the way I'm handling bytes is off.

And more importantly, how can I improve the conditions using bit-twiddling? I'm a bit rusty here, and I think that a comparison such as this:

if ((b & 0xFF) >> 1 == (byte) 0b1111110)


can be expressed in a simpler, more idiomatic way.

public static boolean validate(byte[] bytes) {

    int length = bytes.length;

    if (length  6)
        return false;

    byte b = bytes[0];

    if (length == 1)
        return (b & (1 > 1 == (byte) 0b1111110)
        n = 5;
    else if ((b & 0xFF) >> 2 == (byte) 0b111110)
        n = 4;
    else if ((b & 0xFF) >> 3 == (byte) 0b11110)
        n = 3;
    else if ((b & 0xFF) >> 4 == (byte) 0b1110)
        n = 2;
    else if ((b & 0xFF) >> 5 == (byte) 0b110)
        n = 1;
    else
        return false;

    if (length-1 != n)
        return false;

    for (int i = 1; i > 6 != (byte) 0b10)
            return false;

    return true;

}


And these are some tests to try it out:

byte[] bytes1 = {(byte) 0b11001111, (byte) 0b10111111};
System.out.println(validate(bytes1)); // true

byte[] bytes2 = {(byte) 0b11101111, (byte) 0b10101010, (byte) 0b10111111};
System.out.println(validate(bytes2)); // true

byte[] bytes3 = {(byte) 0b10001111, (byte) 0b10111111};
System.out.println(validate(bytes3)); // false

byte[] bytes4 = {(byte) 0b11101111, (byte) 0b10101010, (byte) 0b00111111};
System.out.println(validate(bytes4)); // false

Solution

Logic

Never omit the optional braces like that. Think of yourself as a contributing factor to a future coding accident. If you really want to omit braces, then put the statement on the same line, so that there is no possibility of misinterpretation.

The function does not check for overlong encodings, invalid byte sequences, or invalid code points. Those caveats should be declared in JavaDoc.

I would eliminate the length > 6 check, as 6 is a magic number. You don't need that special case anyway.

I would also incorporate the length == 1 special case into the regular logic.

In general, text is more likely to contain shorter UTF-8 sequences than longer ones, so you might as well handle the shorter cases first to save a few CPU cycles.

Rather than setting n to be the number of trailing bytes, set it to be the expected length of the array. I'd rename nexpectedLen to be more descriptive. That makes the code more readable (and saves one pointless subtraction).

Eliminate the bit-shifting. Just AND with the bitmask to specify which bits you are interested in inspecting.

public static boolean validate(byte[] bytes) {
    int expectedLen;
    if      (bytes.length == 0)                     return false;
    else if ((bytes[0] & 0b10000000) == 0b00000000) expectedLen = 1;
    else if ((bytes[0] & 0b11100000) == 0b11000000) expectedLen = 2;
    else if ((bytes[0] & 0b11110000) == 0b11100000) expectedLen = 3;
    else if ((bytes[0] & 0b11111000) == 0b11110000) expectedLen = 4;
    else if ((bytes[0] & 0b11111100) == 0b11111000) expectedLen = 5;
    else if ((bytes[0] & 0b11111110) == 0b11111100) expectedLen = 6;
    else    return false;

    if (expectedLen != bytes.length) return false;

    for (int i = 1; i < bytes.length; i++) {
        if ((bytes[i] & 0b11000000) != 0b10000000) {
            return false;
        }
    }

    return true;
}


Interface design

It's rarely useful to validate a single character: usually, you'll want to validate a whole string. You should name your function to avoid giving the impression that it checks for multi-character strings.

In Java, functions that perform a test and return a boolean are conventionally named isSomething() or hasSomething(). A function named validate() suggests that it performs an action as a side-effect, perhaps throwing an exception on failure.

Therefore, I'd rename your function to isValidChar(byte[] bytes).

So, what if I need to validate a string? I'd have to somehow chunk it up into right-sized byte arrays first. That's not really possible without examining the string using similar logic to what is in the function itself. Even then, it would be wasteful to construct a byte array just for the function call. Therefore, I think that it would be more useful to provide a function that validates a string. To go further, you could make such a function return something more informative than just a boolean.

/**
 * Returns the number of UTF-8 characters, or -1 if the array
 * does not contain a valid UTF-8 string.  Overlong encodings,
 * null characters, invalid Unicode values, and surrogates are
 * accepted.
 */
public static int charLength(byte[] bytes) {
    int charCount = 0, expectedLen;

    for (int i = 0; i  0) {
            if (++i >= bytes.length) {
                return -1;
            }
            if ((bytes[i] & 0b11000000) != 0b10000000) {
                return -1;
            }
        }
    }
    return charCount;
}


That's a more versatile function, for about the same amount of code.

Code Snippets

public static boolean validate(byte[] bytes) {
    int expectedLen;
    if      (bytes.length == 0)                     return false;
    else if ((bytes[0] & 0b10000000) == 0b00000000) expectedLen = 1;
    else if ((bytes[0] & 0b11100000) == 0b11000000) expectedLen = 2;
    else if ((bytes[0] & 0b11110000) == 0b11100000) expectedLen = 3;
    else if ((bytes[0] & 0b11111000) == 0b11110000) expectedLen = 4;
    else if ((bytes[0] & 0b11111100) == 0b11111000) expectedLen = 5;
    else if ((bytes[0] & 0b11111110) == 0b11111100) expectedLen = 6;
    else    return false;

    if (expectedLen != bytes.length) return false;

    for (int i = 1; i < bytes.length; i++) {
        if ((bytes[i] & 0b11000000) != 0b10000000) {
            return false;
        }
    }

    return true;
}
/**
 * Returns the number of UTF-8 characters, or -1 if the array
 * does not contain a valid UTF-8 string.  Overlong encodings,
 * null characters, invalid Unicode values, and surrogates are
 * accepted.
 */
public static int charLength(byte[] bytes) {
    int charCount = 0, expectedLen;

    for (int i = 0; i < bytes.length; i++) {
        charCount++;
        // Lead byte analysis
        if      ((bytes[i] & 0b10000000) == 0b00000000) continue;
        else if ((bytes[i] & 0b11100000) == 0b11000000) expectedLen = 2;
        else if ((bytes[i] & 0b11110000) == 0b11100000) expectedLen = 3;
        else if ((bytes[i] & 0b11111000) == 0b11110000) expectedLen = 4;
        else if ((bytes[i] & 0b11111100) == 0b11111000) expectedLen = 5;
        else if ((bytes[i] & 0b11111110) == 0b11111100) expectedLen = 6;
        else    return -1;

        // Count trailing bytes
        while (--expectedLen > 0) {
            if (++i >= bytes.length) {
                return -1;
            }
            if ((bytes[i] & 0b11000000) != 0b10000000) {
                return -1;
            }
        }
    }
    return charCount;
}

Context

StackExchange Code Review Q#59428, answer score: 14

Revisions (0)

No revisions yet.