patterncsharpMinor
Encode unicode codepoints to UTF-8 manually
Viewed 0 times
utfunicodemanuallyencodecodepoints
Problem
I want to encode unicode codepoints to UTF-8 manually. I wrote the following C# code. I tested it with some cases I know, but I would like to know if it's correct for all inputs. I know that Unicode codepoints are undefined beyond
Bonus question: Is there an build in way to do that?
0x10FFFF ,but I don't care about that. Therefore the output of my method might be more than 4 bytes.private byte[] CodePointToUtf8 (int codepoint)
{
if (codepoint > 27)),
(byte)(0x80 | (codepoint > 26))
};
} else if (codepoint > 28)),
(byte)(0x80 | (codepoint > 26)) ,
(byte)(0x80 | (codepoint > 26))
};
} else if (codepoint > 29)),
(byte)(0x80 | (codepoint > 26)),
(byte)(0x80 | (codepoint > 26)) ,
(byte)(0x80 | (codepoint > 26))
};
} else if (codepoint > 30)),
(byte)(0x80 | (codepoint > 26)),
(byte)(0x80 | (codepoint > 26)),
(byte)(0x80 | (codepoint > 26)) ,
(byte)(0x80 | (codepoint > 26))
};
} else {
return new byte[] {
(byte)(0xFC | (codepoint > 31)),
(byte)(0x80 | (codepoint > 26)),
(byte)(0x80 | (codepoint > 26)),
(byte)(0x80 | (codepoint > 26)),
(byte)(0x80 | (codepoint > 26)) ,
(byte)(0x80 | (codepoint > 26))
};
}
}Bonus question: Is there an build in way to do that?
Solution
Yes, the code is correct for all valid code points. I was first confused about the double shifts, since I have never seen them before, but they do their job well. Other authors typically do a single
Your code omits some validity checks:
Other than these, it is perfect.
I know for sure that this conversion is built-in into C#, I just don't know where. Try loading a file into a string using the UTF-8 encoding. During that loading, the built-in conversion code gets called.
>> followed by a bit mask, e.g. (codepoint >> 12) & 0x3F to skip the 12 bits to the right and take the next 6 bits. That way, the numbers can be verified more easily, since they are smaller. Plus, all the 01xxxxxx bytes have the same bitmask.Your code omits some validity checks:
codepointcould be
- codepoint
could be between0xD800and0xDFFF`
Other than these, it is perfect.
I know for sure that this conversion is built-in into C#, I just don't know where. Try loading a file into a string using the UTF-8 encoding. During that loading, the built-in conversion code gets called.
Context
StackExchange Code Review Q#149549, answer score: 3
Revisions (0)
No revisions yet.