patternjavaModerate

Customised Java UTF-8

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

utfjavacustomised

Problem

I have implemented a customized UTF-8 encoding mechanism. The code works fine, but I have a lot of concerns regarding the code.

```
public class Utf8Encoding {

public static void main(String[] args) {

byte [] arr = new byte[1000];
int iStr = 95000; // or > 65535, as till 65535 all chars are unicode and above are surrogates.
String str = new String(Character.toChars(iStr));
encode(arr,0,str);
System.out.println(decode(arr,utf8Length(str)));
System.out.println(decode(arr,utf8Length(str)).equals(str));
}

public static byte[] encode(byte[] aByteArray , int offset ,String str) {

int len = str.length();
int j = offset;

try {
for (int k = 0; k = 1) && (l = 128) && (l > 6));
aByteArray[(j++)] = (byte) (128 + (l & 0x3F));
} else {
aByteArray[(j++)] = (byte) (224 + (l >> 12));
aByteArray[(j++)] = (byte) (128 + (l >> 6 & 0x3F));
aByteArray[(j++)] = (byte) (128 + (l & 0x3F));
}
}
} catch (ArrayIndexOutOfBoundsException localArrayIndexOutOfBoundsException) {
throw new InternalError(
"Cannot encode the chracter "+str);
}

return aByteArray;

}
public static String decode(byte[] aByteArray ,int len) {

int j = 0;
int aOffset = 0;
int i = len;
char[] charArray = new char[len];
while ((j = 0)) {
charArray[(j++)] = (char) aByteArray[(aOffset++)];
}
while (aOffset = 0) {
charArray[(j++)] = (char) l;
} else {
int i1;
if (l >> 5 == -2) {
if (aOffset > 4 == -2) {
if (aOffset + 1 > 3 == -2) {
if (aOffset + 2 = 1) && (l = 128) && (l >> 10)+ ('\uD800' - (0x0

Solution

Handling of supplementary characters

No, your code does not handle supplementary characters correctly. A string encoded in UTF-8 must not contain high and low surrogate halves. Instead, supplementary characters (in the range U+10000 to U+10FFFF) should be encoded as 4-byte UTF-8 sequences. Your encode() function only has three cases, covering 1-, 2-, and 3-byte sequences.

As mentioned in a previous answer, Java stores its strings internally using UTF-16. Characters that don't fit into a 16-bit char are split into a high surrogate and a low surrogate char. You'll need to detect such surrogates pairs in the string and merge them before encoding.

Overlong encoding of NUL

You encode the NUL byte as 11000000 10000000. According to the specification, that's an overlong encoding, and it's illegal.

In short, what you have implemented is Modified UTF-8.

Context

StackExchange Code Review Q#42863, answer score: 16

Revisions (0)

No revisions yet.