HiveBrain v1.2.0
Get Started
← Back to all entries
snippetcModerate

Function to convert ISO-8859-1 to UTF-8

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
utfconvertfunctioniso8859

Problem

I wrote this function last year to convert between the two encodings and just found it. It takes a text buffer and its size, then converts to UTF-8 if there's enough space.

What should be changed to improve quality?

int iso88951_to_utf8(unsigned char *content, size_t max_size)
{
    unsigned char *copy;
    size_t conversion_count; //number of chars to convert / bytes to add

    copy = content;
    conversion_count = 0;

    //first run to see if there's enough space for the new bytes
    while(*content)
    {
        if(*content >= 0x80)
        {
            ++conversion_count;
        }
        ++content;
    }
    if(content - copy + conversion_count >= max_size)
    {
        return ERROR;
    }

    while(content >= copy && conversion_count)
    {
        //repositioning current characters to make room for new bytes
        if(*content > 6;    //first byte
        }
        --content;
    }
    return SUCCESS;

}

Solution

The character set is named ISO-8859-1, not ISO-8895-1. Rename your function accordingly.

Change the return value to be more informative:

  • Return 0 on success.



  • If max_size is too small, return the minimum value of max_size that would be sufficient to accommodate the output (including the trailing \0).



I would also change the parameter to take a signed char * to be a bit more natural.

I think that the implementation could look tidier if you dealt with pointers instead of offsets.

It would be nice if you NUL-terminated the result, so that the caller does not have to zero out the entire buffer before calling this function.

size_t iso8859_1_to_utf8(char *content, size_t max_size)
{
    char *src, *dst;

    //first run to see if there's enough space for the new bytes
    for (src = dst = content; *src; src++, dst++)
    {
        if (*src & 0x80)
        {
            // If the high bit is set in the ISO-8859-1 representation, then
            // the UTF-8 representation requires two bytes (one more than usual).
            ++dst;
        }
    }

    if (dst - content + 1 > max_size)
    {
        // Inform caller of the space required
        return dst - content + 1;
    }

    *(dst + 1) = '\0';
    while (dst > src)
    {
        if (*src & 0x80)
        {
            *dst-- = 0x80 | (*src & 0x3f);                     // trailing byte
            *dst-- = 0xc0 | (*((unsigned char *)src--) >> 6);  // leading byte
        }
        else
        {
            *dst-- = *src--;
        }
    }
    return 0;  // SUCCESS
}

Code Snippets

size_t iso8859_1_to_utf8(char *content, size_t max_size)
{
    char *src, *dst;

    //first run to see if there's enough space for the new bytes
    for (src = dst = content; *src; src++, dst++)
    {
        if (*src & 0x80)
        {
            // If the high bit is set in the ISO-8859-1 representation, then
            // the UTF-8 representation requires two bytes (one more than usual).
            ++dst;
        }
    }

    if (dst - content + 1 > max_size)
    {
        // Inform caller of the space required
        return dst - content + 1;
    }

    *(dst + 1) = '\0';
    while (dst > src)
    {
        if (*src & 0x80)
        {
            *dst-- = 0x80 | (*src & 0x3f);                     // trailing byte
            *dst-- = 0xc0 | (*((unsigned char *)src--) >> 6);  // leading byte
        }
        else
        {
            *dst-- = *src--;
        }
    }
    return 0;  // SUCCESS
}

Context

StackExchange Code Review Q#40780, answer score: 12

Revisions (0)

No revisions yet.