patterncMinor
UTF-8 encoding/decoding
Viewed 0 times
encodingdecodingutf
Problem
I have two functions, a collection of possible error codes, and a unit-testing framework.
The parsing of a character into its unary prefix and payload is handled by a few named functions and macros
I want to make these conversion functions really nice so I can use them as the basis of a language interpreter that's Unicode-capable.
It isn't factored into a library/header yet, but that's the obvious next step (and would complicate the presentation, I think).
minunit.h:
io.c:
```
//cf. http://www.ietf.org/rfc/rfc3629.txt p.3
#include
#include
//#include // ilog2
#include // log2
/*
>6),
*p++=msb(1)| (x & lsb(6));
else if (x >12),
*p++=msb(1)| ((x>>6) & lsb(6)),
*p++=msb(1)| (x & lsb(6));
else if (x >18),
*p++=msb(1)| ((x>>12) & lsb(6)),
*p++=msb(1)| ((x>>6) & lsb(6)),
*p++=msb(1)| (x & lsb(6));
else
if (errinfo) *errinfo |= code_point_out_of_range;
}
*p++=0;
}
return buf;
}
#include "minunit.h"
int tests_run = 0;
#define test_case(c) if(c)return #c;
static char *test_nlz(){
//int i;for(i=0;i, nlz %d, nlz~ %d\n",i,i,nlz(i),nlz(i^0xFF));
test_case(nlz(0)!=8)
test_case(nlz(1)!=7)
test_case(nlz(2)!=6)
test_case(nlz(4)!=5)
test_case(nlz(8)!=4)
test_case(nlz(1
The parsing of a character into its unary prefix and payload is handled by a few named functions and macros
nlz, nlo, lsb and msb for counting leading zeros or ones and generating masks of ones on the right or left. The nlz function could be re-written for speed or (maybe) for clarity; the current version is optimized for size to help abstract it away from the code that uses it.I want to make these conversion functions really nice so I can use them as the basis of a language interpreter that's Unicode-capable.
It isn't factored into a library/header yet, but that's the obvious next step (and would complicate the presentation, I think).
minunit.h:
/* file: minunit.h
cf.http://www.jera.com/techinfo/jtns/jtn002.html */
#define mu_assert(message, test) do { if (!(test)) return message; } while (0)
#define mu_run_test(test) do { char *message = test(); tests_run++; \
if (message) return message; } while (0)
extern int tests_run;io.c:
```
//cf. http://www.ietf.org/rfc/rfc3629.txt p.3
#include
#include
//#include // ilog2
#include // log2
/*
>6),
*p++=msb(1)| (x & lsb(6));
else if (x >12),
*p++=msb(1)| ((x>>6) & lsb(6)),
*p++=msb(1)| (x & lsb(6));
else if (x >18),
*p++=msb(1)| ((x>>12) & lsb(6)),
*p++=msb(1)| ((x>>6) & lsb(6)),
*p++=msb(1)| (x & lsb(6));
else
if (errinfo) *errinfo |= code_point_out_of_range;
}
*p++=0;
}
return buf;
}
#include "minunit.h"
int tests_run = 0;
#define test_case(c) if(c)return #c;
static char *test_nlz(){
//int i;for(i=0;i, nlz %d, nlz~ %d\n",i,i,nlz(i),nlz(i^0xFF));
test_case(nlz(0)!=8)
test_case(nlz(1)!=7)
test_case(nlz(2)!=6)
test_case(nlz(4)!=5)
test_case(nlz(8)!=4)
test_case(nlz(1
Solution
Bug - encoding negative values
During encoding to UTF-8, your input array is defined as a signed 32 bit type. If you have negative input values (which should be illegal), your code will go into the
Summary
Other than the minor bugs pointed out above, the program seems to work fairly well. Mainly, the problems I have with the code are with its readability and clarity rather than its functionality. Your testing is minimal, and you should write tests that cover all the edge cases and error conditions.
During encoding to UTF-8, your input array is defined as a signed 32 bit type. If you have negative input values (which should be illegal), your code will go into the
x int32_t
Why use int32_least_t? Your code operates on UCS-4 which are 4 byte values. So you should use int32_t to use exactly 4 byte codes instead of using "4 bytes or more".
Unused argument to utf8()
Your function utf8() has an argument an but it is never used. What is it for? Also, the input buffers to both your conversion functions should be marked const.
nlz()
I know you said you prefer this to be short but using floating point to compute this pains me. If you are using gnu, you can use __builtin_clz which is just as short. If you aren't using gnu, using your own handwritten function would end up being smaller than pulling in log2() from a math library.
Also, this function should take a uint8_t argument because it only works for one byte values.
Reduce indentation
Often, there are error cases in your code that you check for with an if statement. Instead of nesting your code one level deeper per error case, you can often return from the error case instead to prevent unnecessary indentation. For example, in your code:
if (!buf) if (errinfo) *errinfo |= buffer_alloc_fail;
if (buf) {
// Rest of function indented.
to this:
if (!buf) {
if (errinfo)
*errinfo |= buffer_alloc_fail;
return NULL;
}
// Rest of function not indented
Checking for the pointer repeatedly
In ucs4(), you do this a lot:
if (errinfo) *errinfo |= bad_following_character;
Rather than check if the pointer is non-NULL every time, just use a local to hold the error and set the output value at the end:
{
enum errinfo error = 0;
// ...
if (nlo(*p)!=1)
error |= bad_following_character;
// ...
if (errinfo)
*errinfo = error;
return buf;
}
Code style
There's a lot that I don't like about your code style. First of all, I feel like you sacrifice clarity for brevity a lot. For example, here:
// Nested assignments?
switch(pre=nlo(x=*p++)){
and here:
// Massively long and indecipherable line:
test_case(memcmp((int[]){97,98,99},ucs4(utf8((int[])
{97,98,99},3,NULL,NULL),3,NULL),3*sizeof(int)))
Also, you use nonstandard style, like here with the comma operator:
// Comma operator instead of brace?
else if (x >6),
*p++=msb(1)| (x & lsb(6));
and here with the inline table:
// Inline table?
if (x < ((int[]){0,0,0x80,0x800,0x10000})[pre])
Futhermore, you use short cryptic names that might mean the right thing to you, but to others might mean something else. For example, lsb(n) to you means "a mask of n least significant bits". To me, it means "the least significant byte of n". Plus, it's strange that lsb works with int sized values, but msb` only works with 8-bit values. On the other hand, you did comment these macros, so at least there was some explanation.Summary
Other than the minor bugs pointed out above, the program seems to work fairly well. Mainly, the problems I have with the code are with its readability and clarity rather than its functionality. Your testing is minimal, and you should write tests that cover all the edge cases and error conditions.
Code Snippets
if (!buf) if (errinfo) *errinfo |= buffer_alloc_fail;
if (buf) {
// Rest of function indented.if (!buf) {
if (errinfo)
*errinfo |= buffer_alloc_fail;
return NULL;
}
// Rest of function not indentedif (errinfo) *errinfo |= bad_following_character;{
enum errinfo error = 0;
// ...
if (nlo(*p)!=1)
error |= bad_following_character;
// ...
if (errinfo)
*errinfo = error;
return buf;
}// Nested assignments?
switch(pre=nlo(x=*p++)){Context
StackExchange Code Review Q#98838, answer score: 6
Revisions (0)
No revisions yet.