HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Encode unicode codepoints to UTF-8 manually

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
utfunicodemanuallyencodecodepoints

Problem

I want to encode unicode codepoints to UTF-8 manually. I wrote the following C# code. I tested it with some cases I know, but I would like to know if it's correct for all inputs. I know that Unicode codepoints are undefined beyond 0x10FFFF ,but I don't care about that. Therefore the output of my method might be more than 4 bytes.

private byte[] CodePointToUtf8 (int codepoint)
{
    if (codepoint > 27)), 
            (byte)(0x80 | (codepoint > 26))
        };
    } else if (codepoint > 28)),
            (byte)(0x80 | (codepoint > 26)) ,
            (byte)(0x80 | (codepoint > 26))
        };
    } else if (codepoint > 29)),
            (byte)(0x80 | (codepoint > 26)),
            (byte)(0x80 | (codepoint > 26)) ,
            (byte)(0x80 | (codepoint > 26))
        };
    } else if (codepoint > 30)),
            (byte)(0x80 | (codepoint > 26)),
            (byte)(0x80 | (codepoint > 26)),
            (byte)(0x80 | (codepoint > 26)) ,
            (byte)(0x80 | (codepoint > 26))
        };
    } else {
        return new byte[] {
            (byte)(0xFC | (codepoint > 31)),
            (byte)(0x80 | (codepoint > 26)),
            (byte)(0x80 | (codepoint > 26)),
            (byte)(0x80 | (codepoint > 26)),
            (byte)(0x80 | (codepoint > 26)) ,
            (byte)(0x80 | (codepoint > 26))
        };
    }
}


Bonus question: Is there an build in way to do that?

Solution

Yes, the code is correct for all valid code points. I was first confused about the double shifts, since I have never seen them before, but they do their job well. Other authors typically do a single >> followed by a bit mask, e.g. (codepoint >> 12) & 0x3F to skip the 12 bits to the right and take the next 6 bits. That way, the numbers can be verified more easily, since they are smaller. Plus, all the 01xxxxxx bytes have the same bitmask.

Your code omits some validity checks:

  • codepoint could be



  • codepoint could be between 0xD800 and 0xDFFF`



Other than these, it is perfect.

I know for sure that this conversion is built-in into C#, I just don't know where. Try loading a file into a string using the UTF-8 encoding. During that loading, the built-in conversion code gets called.

Context

StackExchange Code Review Q#149549, answer score: 3

Revisions (0)

No revisions yet.