patternjavaModerate

Customised Java UTF-16

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

utfjavacustomised

Problem

I have implemented customized encoding mechanism for javaUTF16. Does this implementation support all the characters?

public class Encoding {
    public static void main(String[] args) {

        byte [] arr = new byte[1000];

        String str = "abcde" ; //even this encoding works supplementary characters
        Encode(arr,0,str);
        System.out.println(Decode(arr,str.length()));
    }

public static byte[] Encode(byte[] ByteArray , int offset ,String str) {

        char[] ch = str.toCharArray();
        for(char c : ch) {
        ByteArray[offset++] = (byte) (c >>> 8);
        ByteArray[offset++] = (byte) (c & 0xff);
        }
        return ByteArray;
    }

    public static String Decode(byte[] ByteArray ,int len) {

        char [] res = new char[len*2];
        int i = 0;
        int offset = 0;
        while(i < len) {
    res[i] = (char) ((ByteArray[offset++] << 8) | (ByteArray[offset++] & 0xff));
            i++;
        }

        return new String(res);
    }
}

Solution

Question of Completeness

Yes, your code covers all Unicode characters, including the supplementary characters U+10000 to U+10FFFF, because you "inherit" that functionality from the way such characters would be stored in Java's String class:

Unicode Character Representations

The char data type (and therefore the value that a Character object
encapsulates) are based on the original Unicode specification, which
defined characters as fixed-width 16-bit entities. The Unicode
Standard has since been changed to allow for characters whose
representation requires more than 16 bits. The range of legal code
points is now U+0000 to U+10FFFF, known as Unicode scalar value.
(Refer to the definition of the U+n notation in the Unicode Standard.)

The set of characters from U+0000 to U+FFFF is sometimes referred to
as the Basic Multilingual Plane (BMP). Characters whose code points
are greater than U+FFFF are called supplementary characters. The Java
platform uses the UTF-16 representation in char arrays and in the
String and StringBuffer classes. In this representation, supplementary
characters are represented as a pair of char values, the first from
the high-surrogates range, (\uD800-\uDBFF), the second from the
low-surrogates range (\uDC00-\uDFFF).

A char value, therefore, represents Basic Multilingual Plane (BMP)
code points, including the surrogate code points, or code units of the
UTF-16 encoding. An int value represents all Unicode code points,
including supplementary code points. […]

Reinventing the Wheel

Since you did not tag your question as reinventing-the-wheel, I'm obligated to mention that you could accomplish the task more simply using the built-in support for charsets.

private static final Charset UTF_16 = Charset.forName("UTF-16BE");

public static byte[] Encode(byte[] ByteArray , int offset ,String str) {
    byte[] bytes = str.getBytes(UTF_16);
    System.arraycopy(bytes, 0, ByteArray, offset, bytes.length);
    return ByteArray;
}

public static String Decode(byte[] ByteArray ,int len) {
    return new String(ByteArray, 0, 2 * len, UTF_16);
}

Code Snippets

private static final Charset UTF_16 = Charset.forName("UTF-16BE");

public static byte[] Encode(byte[] ByteArray , int offset ,String str) {
    byte[] bytes = str.getBytes(UTF_16);
    System.arraycopy(bytes, 0, ByteArray, offset, bytes.length);
    return ByteArray;
}

public static String Decode(byte[] ByteArray ,int len) {
    return new String(ByteArray, 0, 2 * len, UTF_16);
}

Context

StackExchange Code Review Q#42649, answer score: 17

Revisions (0)

No revisions yet.