patternjavaModerate
Customised Java UTF-8
Viewed 0 times
utfjavacustomised
Problem
I have implemented a customized UTF-8 encoding mechanism. The code works fine, but I have a lot of concerns regarding the code.
```
public class Utf8Encoding {
public static void main(String[] args) {
byte [] arr = new byte[1000];
int iStr = 95000; // or > 65535, as till 65535 all chars are unicode and above are surrogates.
String str = new String(Character.toChars(iStr));
encode(arr,0,str);
System.out.println(decode(arr,utf8Length(str)));
System.out.println(decode(arr,utf8Length(str)).equals(str));
}
public static byte[] encode(byte[] aByteArray , int offset ,String str) {
int len = str.length();
int j = offset;
try {
for (int k = 0; k = 1) && (l = 128) && (l > 6));
aByteArray[(j++)] = (byte) (128 + (l & 0x3F));
} else {
aByteArray[(j++)] = (byte) (224 + (l >> 12));
aByteArray[(j++)] = (byte) (128 + (l >> 6 & 0x3F));
aByteArray[(j++)] = (byte) (128 + (l & 0x3F));
}
}
} catch (ArrayIndexOutOfBoundsException localArrayIndexOutOfBoundsException) {
throw new InternalError(
"Cannot encode the chracter "+str);
}
return aByteArray;
}
public static String decode(byte[] aByteArray ,int len) {
int j = 0;
int aOffset = 0;
int i = len;
char[] charArray = new char[len];
while ((j = 0)) {
charArray[(j++)] = (char) aByteArray[(aOffset++)];
}
while (aOffset = 0) {
charArray[(j++)] = (char) l;
} else {
int i1;
if (l >> 5 == -2) {
if (aOffset > 4 == -2) {
if (aOffset + 1 > 3 == -2) {
if (aOffset + 2 = 1) && (l = 128) && (l >> 10)+ ('\uD800' - (0x0
```
public class Utf8Encoding {
public static void main(String[] args) {
byte [] arr = new byte[1000];
int iStr = 95000; // or > 65535, as till 65535 all chars are unicode and above are surrogates.
String str = new String(Character.toChars(iStr));
encode(arr,0,str);
System.out.println(decode(arr,utf8Length(str)));
System.out.println(decode(arr,utf8Length(str)).equals(str));
}
public static byte[] encode(byte[] aByteArray , int offset ,String str) {
int len = str.length();
int j = offset;
try {
for (int k = 0; k = 1) && (l = 128) && (l > 6));
aByteArray[(j++)] = (byte) (128 + (l & 0x3F));
} else {
aByteArray[(j++)] = (byte) (224 + (l >> 12));
aByteArray[(j++)] = (byte) (128 + (l >> 6 & 0x3F));
aByteArray[(j++)] = (byte) (128 + (l & 0x3F));
}
}
} catch (ArrayIndexOutOfBoundsException localArrayIndexOutOfBoundsException) {
throw new InternalError(
"Cannot encode the chracter "+str);
}
return aByteArray;
}
public static String decode(byte[] aByteArray ,int len) {
int j = 0;
int aOffset = 0;
int i = len;
char[] charArray = new char[len];
while ((j = 0)) {
charArray[(j++)] = (char) aByteArray[(aOffset++)];
}
while (aOffset = 0) {
charArray[(j++)] = (char) l;
} else {
int i1;
if (l >> 5 == -2) {
if (aOffset > 4 == -2) {
if (aOffset + 1 > 3 == -2) {
if (aOffset + 2 = 1) && (l = 128) && (l >> 10)+ ('\uD800' - (0x0
Solution
Handling of supplementary characters
No, your code does not handle supplementary characters correctly. A string encoded in UTF-8 must not contain high and low surrogate halves. Instead, supplementary characters (in the range U+10000 to U+10FFFF) should be encoded as 4-byte UTF-8 sequences. Your
As mentioned in a previous answer, Java stores its strings internally using UTF-16. Characters that don't fit into a 16-bit
Overlong encoding of NUL
You encode the NUL byte as
In short, what you have implemented is Modified UTF-8.
No, your code does not handle supplementary characters correctly. A string encoded in UTF-8 must not contain high and low surrogate halves. Instead, supplementary characters (in the range U+10000 to U+10FFFF) should be encoded as 4-byte UTF-8 sequences. Your
encode() function only has three cases, covering 1-, 2-, and 3-byte sequences.As mentioned in a previous answer, Java stores its strings internally using UTF-16. Characters that don't fit into a 16-bit
char are split into a high surrogate and a low surrogate char. You'll need to detect such surrogates pairs in the string and merge them before encoding.Overlong encoding of NUL
You encode the NUL byte as
11000000 10000000. According to the specification, that's an overlong encoding, and it's illegal.In short, what you have implemented is Modified UTF-8.
Context
StackExchange Code Review Q#42863, answer score: 16
Revisions (0)
No revisions yet.