patternjavaModerate
Customised Java UTF-16
Viewed 0 times
utfjavacustomised
Problem
I have implemented customized encoding mechanism for javaUTF16. Does this implementation support all the characters?
public class Encoding {
public static void main(String[] args) {
byte [] arr = new byte[1000];
String str = "abcde" ; //even this encoding works supplementary characters
Encode(arr,0,str);
System.out.println(Decode(arr,str.length()));
}
public static byte[] Encode(byte[] ByteArray , int offset ,String str) {
char[] ch = str.toCharArray();
for(char c : ch) {
ByteArray[offset++] = (byte) (c >>> 8);
ByteArray[offset++] = (byte) (c & 0xff);
}
return ByteArray;
}
public static String Decode(byte[] ByteArray ,int len) {
char [] res = new char[len*2];
int i = 0;
int offset = 0;
while(i < len) {
res[i] = (char) ((ByteArray[offset++] << 8) | (ByteArray[offset++] & 0xff));
i++;
}
return new String(res);
}
}Solution
Question of Completeness
Yes, your code covers all Unicode characters, including the supplementary characters U+10000 to U+10FFFF, because you "inherit" that functionality from the way such characters would be stored in Java's
Unicode Character Representations
The
encapsulates) are based on the original Unicode specification, which
defined characters as fixed-width 16-bit entities. The Unicode
Standard has since been changed to allow for characters whose
representation requires more than 16 bits. The range of legal code
points is now U+0000 to U+10FFFF, known as Unicode scalar value.
(Refer to the definition of the U+n notation in the Unicode Standard.)
The set of characters from U+0000 to U+FFFF is sometimes referred to
as the Basic Multilingual Plane (BMP). Characters whose code points
are greater than U+FFFF are called supplementary characters. The Java
platform uses the UTF-16 representation in
characters are represented as a pair of
the high-surrogates range, (\uD800-\uDBFF), the second from the
low-surrogates range (\uDC00-\uDFFF).
A
code points, including the surrogate code points, or code units of the
UTF-16 encoding. An int value represents all Unicode code points,
including supplementary code points. […]
Reinventing the Wheel
Since you did not tag your question as reinventing-the-wheel, I'm obligated to mention that you could accomplish the task more simply using the built-in support for charsets.
Yes, your code covers all Unicode characters, including the supplementary characters U+10000 to U+10FFFF, because you "inherit" that functionality from the way such characters would be stored in Java's
String class:Unicode Character Representations
The
char data type (and therefore the value that a Character objectencapsulates) are based on the original Unicode specification, which
defined characters as fixed-width 16-bit entities. The Unicode
Standard has since been changed to allow for characters whose
representation requires more than 16 bits. The range of legal code
points is now U+0000 to U+10FFFF, known as Unicode scalar value.
(Refer to the definition of the U+n notation in the Unicode Standard.)
The set of characters from U+0000 to U+FFFF is sometimes referred to
as the Basic Multilingual Plane (BMP). Characters whose code points
are greater than U+FFFF are called supplementary characters. The Java
platform uses the UTF-16 representation in
char arrays and in theString and StringBuffer classes. In this representation, supplementarycharacters are represented as a pair of
char values, the first fromthe high-surrogates range, (\uD800-\uDBFF), the second from the
low-surrogates range (\uDC00-\uDFFF).
A
char value, therefore, represents Basic Multilingual Plane (BMP)code points, including the surrogate code points, or code units of the
UTF-16 encoding. An int value represents all Unicode code points,
including supplementary code points. […]
Reinventing the Wheel
Since you did not tag your question as reinventing-the-wheel, I'm obligated to mention that you could accomplish the task more simply using the built-in support for charsets.
private static final Charset UTF_16 = Charset.forName("UTF-16BE");
public static byte[] Encode(byte[] ByteArray , int offset ,String str) {
byte[] bytes = str.getBytes(UTF_16);
System.arraycopy(bytes, 0, ByteArray, offset, bytes.length);
return ByteArray;
}
public static String Decode(byte[] ByteArray ,int len) {
return new String(ByteArray, 0, 2 * len, UTF_16);
}Code Snippets
private static final Charset UTF_16 = Charset.forName("UTF-16BE");
public static byte[] Encode(byte[] ByteArray , int offset ,String str) {
byte[] bytes = str.getBytes(UTF_16);
System.arraycopy(bytes, 0, ByteArray, offset, bytes.length);
return ByteArray;
}
public static String Decode(byte[] ByteArray ,int len) {
return new String(ByteArray, 0, 2 * len, UTF_16);
}Context
StackExchange Code Review Q#42649, answer score: 17
Revisions (0)
No revisions yet.