patternpythonMinor
UTF-8 Validation
Viewed 0 times
validationutfstackoverflow
Problem
I've recently solved the "UTF-8 Validation" LeetCode problem:
A character in UTF8 can be from 1 to 4 bytes long, subjected to the
following rules:
For 1-byte character, the first bit is a 0, followed by its unicode
code.
For n-bytes character, the first n-bits are all one's, the n+1
bit is 0, followed by n-1 bytes with most significant 2 bits being 10.
This is how the UTF-8 encoding would work:
Given an array of integers representing the data, return whether it is
a valid utf-8 encoding.
Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This
means each integer represents only 1 byte of data.
Example 1:
Return true. It is a valid utf-8 encoding for a 2-bytes character
followed by a 1-byte character.
Example 2:
Return false. The first 3 bits are all one's and the 4th bit is 0
means it is a 3-bytes character. The next byte is a continuation byte
which starts with 10 and that's correct. But the second continuation
byte does not start with 10, so it is invalid.
The code works and was accepted by the OJ:
```
NUMBER_OF_BITS_PER_BLOCK = 8
MAX_NUMBER_OF_ONES = 4
class Solution(object):
def validUtf8(self, data):
"""
:type data: List[int]
:rtype: bool
"""
index = 0
while index >= (NUMBER_OF_BITS_PER_BLOCK - 1)
if
A character in UTF8 can be from 1 to 4 bytes long, subjected to the
following rules:
For 1-byte character, the first bit is a 0, followed by its unicode
code.
For n-bytes character, the first n-bits are all one's, the n+1
bit is 0, followed by n-1 bytes with most significant 2 bits being 10.
This is how the UTF-8 encoding would work:
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxGiven an array of integers representing the data, return whether it is
a valid utf-8 encoding.
Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This
means each integer represents only 1 byte of data.
Example 1:
data = [197, 130, 1], which represents the octet sequence: 11000101
10000010 00000001.Return true. It is a valid utf-8 encoding for a 2-bytes character
followed by a 1-byte character.
Example 2:
data = [235, 140, 4], which represented the octet sequence:11101011 10001100 00000100.Return false. The first 3 bits are all one's and the 4th bit is 0
means it is a 3-bytes character. The next byte is a continuation byte
which starts with 10 and that's correct. But the second continuation
byte does not start with 10, so it is invalid.
The code works and was accepted by the OJ:
```
NUMBER_OF_BITS_PER_BLOCK = 8
MAX_NUMBER_OF_ONES = 4
class Solution(object):
def validUtf8(self, data):
"""
:type data: List[int]
:rtype: bool
"""
index = 0
while index >= (NUMBER_OF_BITS_PER_BLOCK - 1)
if
Solution
Bugs
I'm surprised that the code was accepted by the online judge. This part of the code makes no sense to me:
What does
I'm displeased with the way you reassign
This loop does not actually validate that the trailing bytes start with
Elegance
Looping by keeping incrementing an
The code would be more readable if the
I don't think that you need a special case for
I would rename
A docstring with a doctest would be appropriate for this function.
Suggested solution
I'm surprised that the code was accepted by the online judge. This part of the code makes no sense to me:
# check for out of bounds and exit early
if index >= len(data) or index >= (index + number_of_ones - 1):
return False
# every next byte has to start with "10"
for i in range(index, index + number_of_ones - 1):
number = data[i]
number >>= (NUMBER_OF_BITS_PER_BLOCK - 1)
if number != 1:
return False
number >>= (NUMBER_OF_BITS_PER_BLOCK - 1)
if number != 0:
return False
index += 1What does
index >= (index + number_of_ones - 1) mean? It's the same as number_of_ones = len(data): return False. That verifies that the data are long enough to contain one trailing byte, but it does not ensure that it is long enough to contain all of the expected trailing bytes. That is, number = data[i] could crash if data is a prematurely truncated sequence.I'm displeased with the way you reassign
number with number >>= … here and in the # get the number of significant ones loop. Such mutation makes your code fragile and hard to understand.This loop does not actually validate that the trailing bytes start with
"10", as the comment claims. You do verify that the most-significant bit is 1, with if number != 1: return False. That's all. If you right-shift number (which is an integer ≤ 255) by 14 bits, then of course you will always get 0!Elegance
Looping by keeping incrementing an
index is awkward in Python: you have index = 0, while index < len(data), and several index += 1. Python does offer more expressive iteration techniques, and here I would recommend using an iterator.The code would be more readable if the
# get the number of significant ones loop were moved into a helper function.I don't think that you need a special case for
# single byte char. Just treat it as if you expect zero trailing bytes.I would rename
number to byte, because that's what it represents.A docstring with a doctest would be appropriate for this function.
Suggested solution
class Solution(object):
def validUtf8(self, data):
"""
Check that a sequence of byte values follows the UTF-8 encoding
rules. Does not check for canonicalization (i.e. overlong encodings
are acceptable).
>>> s = Solution()
>>> s.validUtf8([197, 130, 1])
True
>>> s.validUtf8([235, 140, 4])
False
"""
data = iter(data)
for leading_byte in data:
leading_ones = self._count_leading_ones(leading_byte)
if leading_ones in [1, 7, 8]:
return False # Illegal leading byte
for _ in range(leading_ones - 1):
trailing_byte = next(data, None)
if trailing_byte is None or trailing_byte >> 6 != 0b10:
return False # Missing or illegal trailing byte
return True
@staticmethod
def _count_leading_ones(byte):
for i in range(8):
if byte >> (7 - i) == 0b11111111 >> (7 - i) & ~1:
return i
return 8Code Snippets
# check for out of bounds and exit early
if index >= len(data) or index >= (index + number_of_ones - 1):
return False
# every next byte has to start with "10"
for i in range(index, index + number_of_ones - 1):
number = data[i]
number >>= (NUMBER_OF_BITS_PER_BLOCK - 1)
if number != 1:
return False
number >>= (NUMBER_OF_BITS_PER_BLOCK - 1)
if number != 0:
return False
index += 1class Solution(object):
def validUtf8(self, data):
"""
Check that a sequence of byte values follows the UTF-8 encoding
rules. Does not check for canonicalization (i.e. overlong encodings
are acceptable).
>>> s = Solution()
>>> s.validUtf8([197, 130, 1])
True
>>> s.validUtf8([235, 140, 4])
False
"""
data = iter(data)
for leading_byte in data:
leading_ones = self._count_leading_ones(leading_byte)
if leading_ones in [1, 7, 8]:
return False # Illegal leading byte
for _ in range(leading_ones - 1):
trailing_byte = next(data, None)
if trailing_byte is None or trailing_byte >> 6 != 0b10:
return False # Missing or illegal trailing byte
return True
@staticmethod
def _count_leading_ones(byte):
for i in range(8):
if byte >> (7 - i) == 0b11111111 >> (7 - i) & ~1:
return i
return 8Context
StackExchange Code Review Q#159814, answer score: 8
Revisions (0)
No revisions yet.