patternpythonMinor

UTF-8 Validation

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

validationutfstackoverflow

Problem

I've recently solved the "UTF-8 Validation" LeetCode problem:

A character in UTF8 can be from 1 to 4 bytes long, subjected to the
following rules:

For 1-byte character, the first bit is a 0, followed by its unicode
code.

For n-bytes character, the first n-bits are all one's, the n+1
bit is 0, followed by n-1 bytes with most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Given an array of integers representing the data, return whether it is
a valid utf-8 encoding.

Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This
means each integer represents only 1 byte of data.

Example 1:

data = [197, 130, 1], which represents the octet sequence:

11000101
  10000010 00000001

.

Return true. It is a valid utf-8 encoding for a 2-bytes character
followed by a 1-byte character.

Example 2:

data = [235, 140, 4], which represented the octet sequence:
11101011 10001100 00000100.

Return false. The first 3 bits are all one's and the 4th bit is 0
means it is a 3-bytes character. The next byte is a continuation byte
which starts with 10 and that's correct. But the second continuation
byte does not start with 10, so it is invalid.

The code works and was accepted by the OJ:

```
NUMBER_OF_BITS_PER_BLOCK = 8
MAX_NUMBER_OF_ONES = 4

class Solution(object):
def validUtf8(self, data):
"""
:type data: List[int]
:rtype: bool
"""
index = 0
while index >= (NUMBER_OF_BITS_PER_BLOCK - 1)
if

Solution

Bugs

I'm surprised that the code was accepted by the online judge. This part of the code makes no sense to me:

# check for out of bounds and exit early
if index >= len(data) or index >= (index + number_of_ones - 1):
    return False  

# every next byte has to start with "10"
for i in range(index, index + number_of_ones - 1):
    number = data[i]

    number >>= (NUMBER_OF_BITS_PER_BLOCK - 1)
    if number != 1:
        return False
    number >>= (NUMBER_OF_BITS_PER_BLOCK - 1)
    if number != 0:
        return False

    index += 1

What does index >= (index + number_of_ones - 1) mean? It's the same as number_of_ones = len(data): return False. That verifies that the data are long enough to contain one trailing byte, but it does not ensure that it is long enough to contain all of the expected trailing bytes. That is, number = data[i] could crash if data is a prematurely truncated sequence.

I'm displeased with the way you reassign number with number >>= … here and in the # get the number of significant ones loop. Such mutation makes your code fragile and hard to understand.

This loop does not actually validate that the trailing bytes start with "10", as the comment claims. You do verify that the most-significant bit is 1, with if number != 1: return False. That's all. If you right-shift number (which is an integer ≤ 255) by 14 bits, then of course you will always get 0!

Elegance

Looping by keeping incrementing an index is awkward in Python: you have index = 0, while index < len(data), and several index += 1. Python does offer more expressive iteration techniques, and here I would recommend using an iterator.

The code would be more readable if the # get the number of significant ones loop were moved into a helper function.

I don't think that you need a special case for # single byte char. Just treat it as if you expect zero trailing bytes.

I would rename number to byte, because that's what it represents.

A docstring with a doctest would be appropriate for this function.

Suggested solution

class Solution(object):
    def validUtf8(self, data):
        """
        Check that a sequence of byte values follows the UTF-8 encoding
        rules.  Does not check for canonicalization (i.e. overlong encodings
        are acceptable).

        >>> s = Solution()
        >>> s.validUtf8([197, 130, 1])
        True
        >>> s.validUtf8([235, 140, 4])
        False
        """
        data = iter(data)
        for leading_byte in data:
            leading_ones = self._count_leading_ones(leading_byte)
            if leading_ones in [1, 7, 8]:
                return False        # Illegal leading byte
            for _ in range(leading_ones - 1):
                trailing_byte = next(data, None)
                if trailing_byte is None or trailing_byte >> 6 != 0b10:
                    return False    # Missing or illegal trailing byte
        return True

    @staticmethod
    def _count_leading_ones(byte):
        for i in range(8):
            if byte >> (7 - i) == 0b11111111 >> (7 - i) & ~1:
                return i
        return 8

Code Snippets

# check for out of bounds and exit early
if index >= len(data) or index >= (index + number_of_ones - 1):
    return False  

# every next byte has to start with "10"
for i in range(index, index + number_of_ones - 1):
    number = data[i]

    number >>= (NUMBER_OF_BITS_PER_BLOCK - 1)
    if number != 1:
        return False
    number >>= (NUMBER_OF_BITS_PER_BLOCK - 1)
    if number != 0:
        return False

    index += 1

class Solution(object):
    def validUtf8(self, data):
        """
        Check that a sequence of byte values follows the UTF-8 encoding
        rules.  Does not check for canonicalization (i.e. overlong encodings
        are acceptable).

        >>> s = Solution()
        >>> s.validUtf8([197, 130, 1])
        True
        >>> s.validUtf8([235, 140, 4])
        False
        """
        data = iter(data)
        for leading_byte in data:
            leading_ones = self._count_leading_ones(leading_byte)
            if leading_ones in [1, 7, 8]:
                return False        # Illegal leading byte
            for _ in range(leading_ones - 1):
                trailing_byte = next(data, None)
                if trailing_byte is None or trailing_byte >> 6 != 0b10:
                    return False    # Missing or illegal trailing byte
        return True

    @staticmethod
    def _count_leading_ones(byte):
        for i in range(8):
            if byte >> (7 - i) == 0b11111111 >> (7 - i) & ~1:
                return i
        return 8

Context

StackExchange Code Review Q#159814, answer score: 8

Revisions (0)

No revisions yet.