patterncMinor

UTF-8 encoding/decoding

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

encodingdecodingutf

Problem

I have two functions, a collection of possible error codes, and a unit-testing framework.

The parsing of a character into its unary prefix and payload is handled by a few named functions and macros nlz, nlo, lsb and msb for counting leading zeros or ones and generating masks of ones on the right or left. The nlz function could be re-written for speed or (maybe) for clarity; the current version is optimized for size to help abstract it away from the code that uses it.

I want to make these conversion functions really nice so I can use them as the basis of a language interpreter that's Unicode-capable.

It isn't factored into a library/header yet, but that's the obvious next step (and would complicate the presentation, I think).

minunit.h:

/* file: minunit.h
   cf.http://www.jera.com/techinfo/jtns/jtn002.html */
 #define mu_assert(message, test) do { if (!(test)) return message; } while (0)
 #define mu_run_test(test) do { char *message = test(); tests_run++; \
                                 if (message) return message; } while (0)
 extern int tests_run;

io.c:

```
//cf. http://www.ietf.org/rfc/rfc3629.txt p.3
#include
#include
//#include // ilog2
#include // log2

/*
>6),
*p++=msb(1)| (x & lsb(6));
else if (x >12),
*p++=msb(1)| ((x>>6) & lsb(6)),
*p++=msb(1)| (x & lsb(6));
else if (x >18),
*p++=msb(1)| ((x>>12) & lsb(6)),
*p++=msb(1)| ((x>>6) & lsb(6)),
*p++=msb(1)| (x & lsb(6));
else
if (errinfo) *errinfo |= code_point_out_of_range;
}
*p++=0;
}

return buf;
}

#include "minunit.h"
int tests_run = 0;

#define test_case(c) if(c)return #c;
static char *test_nlz(){
//int i;for(i=0;i, nlz %d, nlz~ %d\n",i,i,nlz(i),nlz(i^0xFF));
test_case(nlz(0)!=8)
test_case(nlz(1)!=7)
test_case(nlz(2)!=6)
test_case(nlz(4)!=5)
test_case(nlz(8)!=4)
test_case(nlz(1

Solution

Bug - encoding negative values

During encoding to UTF-8, your input array is defined as a signed 32 bit type. If you have negative input values (which should be illegal), your code will go into the

x  int32_t

Why use

int32_least_t? Your code operates on UCS-4 which are 4 byte values. So you should use int32_t

 to use exactly 4 byte codes instead of using "4 bytes or more".

Unused argument to utf8()

Your function

utf8() has an argument an but it is never used. What is it for? Also, the input buffers to both your conversion functions should be marked const

.

nlz()

I know you said you prefer this to be short but using floating point to compute this pains me.  If you are using gnu, you can use

__builtin_clz

 which is just as short.  If you aren't using gnu, using your own handwritten function would end up being smaller than pulling in

log2()

 from a math library.

Also, this function should take a

uint8_t

 argument because it only works for one byte values.

Reduce indentation

Often, there are error cases in your code that you check for with an

 statement.  Instead of nesting your code one level deeper per error case, you can often return from the error case instead to prevent unnecessary indentation.  For example, in your code:

if (!buf) if (errinfo) *errinfo |= buffer_alloc_fail;

if (buf) {
    // Rest of function indented.


to this:

if (!buf) {
    if (errinfo)
         *errinfo |= buffer_alloc_fail;
    return NULL;
}

// Rest of function not indented


Checking for the pointer repeatedly

In

ucs4()

, you do this a lot:

if (errinfo) *errinfo |= bad_following_character;


Rather than check if the pointer is non-NULL every time, just use a local to hold the error and set the output value at the end:

{
      enum errinfo error = 0;

      // ...
      if (nlo(*p)!=1)
          error |= bad_following_character;

      // ...
      if (errinfo)
          *errinfo = error;
      return buf;
  }


Code style

There's a lot that I don't like about your code style.  First of all, I feel like you sacrifice clarity for brevity a lot.  For example, here:

// Nested assignments?
       switch(pre=nlo(x=*p++)){


and here:

// Massively long and indecipherable line:
      test_case(memcmp((int[]){97,98,99},ucs4(utf8((int[])
              {97,98,99},3,NULL,NULL),3,NULL),3*sizeof(int)))


Also, you use nonstandard style, like here with the comma operator:

// Comma operator instead of brace?
        else if (x >6),
            *p++=msb(1)| (x & lsb(6));


and here with the inline table:

// Inline table?
       if (x < ((int[]){0,0,0x80,0x800,0x10000})[pre])


Futhermore, you use short cryptic names that might mean the right thing to you, but to others might mean something else.  For example,

lsb(n)

 to you means "a mask of n least significant bits".  To me, it means "the least significant byte of n".  Plus, it's strange that

lsb works with int sized values, but msb` only works with 8-bit values. On the other hand, you did comment these macros, so at least there was some explanation.

Summary

Other than the minor bugs pointed out above, the program seems to work fairly well. Mainly, the problems I have with the code are with its readability and clarity rather than its functionality. Your testing is minimal, and you should write tests that cover all the edge cases and error conditions.

Code Snippets

if (!buf) if (errinfo) *errinfo |= buffer_alloc_fail;

if (buf) {
    // Rest of function indented.

if (!buf) {
    if (errinfo)
         *errinfo |= buffer_alloc_fail;
    return NULL;
}

// Rest of function not indented

if (errinfo) *errinfo |= bad_following_character;

{
      enum errinfo error = 0;

      // ...
      if (nlo(*p)!=1)
          error |= bad_following_character;

      // ...
      if (errinfo)
          *errinfo = error;
      return buf;
  }

// Nested assignments?
       switch(pre=nlo(x=*p++)){

Context

StackExchange Code Review Q#98838, answer score: 6

Revisions (0)

No revisions yet.