snippetcModerate
Function to convert ISO-8859-1 to UTF-8
Viewed 0 times
utfconvertfunctioniso8859
Problem
I wrote this function last year to convert between the two encodings and just found it. It takes a text buffer and its size, then converts to UTF-8 if there's enough space.
What should be changed to improve quality?
What should be changed to improve quality?
int iso88951_to_utf8(unsigned char *content, size_t max_size)
{
unsigned char *copy;
size_t conversion_count; //number of chars to convert / bytes to add
copy = content;
conversion_count = 0;
//first run to see if there's enough space for the new bytes
while(*content)
{
if(*content >= 0x80)
{
++conversion_count;
}
++content;
}
if(content - copy + conversion_count >= max_size)
{
return ERROR;
}
while(content >= copy && conversion_count)
{
//repositioning current characters to make room for new bytes
if(*content > 6; //first byte
}
--content;
}
return SUCCESS;
}Solution
The character set is named ISO-8859-1, not ISO-8895-1. Rename your function accordingly.
Change the return value to be more informative:
I would also change the parameter to take a signed
I think that the implementation could look tidier if you dealt with pointers instead of offsets.
It would be nice if you NUL-terminated the result, so that the caller does not have to zero out the entire buffer before calling this function.
Change the return value to be more informative:
- Return 0 on success.
- If
max_sizeis too small, return the minimum value ofmax_sizethat would be sufficient to accommodate the output (including the trailing\0).
I would also change the parameter to take a signed
char * to be a bit more natural.I think that the implementation could look tidier if you dealt with pointers instead of offsets.
It would be nice if you NUL-terminated the result, so that the caller does not have to zero out the entire buffer before calling this function.
size_t iso8859_1_to_utf8(char *content, size_t max_size)
{
char *src, *dst;
//first run to see if there's enough space for the new bytes
for (src = dst = content; *src; src++, dst++)
{
if (*src & 0x80)
{
// If the high bit is set in the ISO-8859-1 representation, then
// the UTF-8 representation requires two bytes (one more than usual).
++dst;
}
}
if (dst - content + 1 > max_size)
{
// Inform caller of the space required
return dst - content + 1;
}
*(dst + 1) = '\0';
while (dst > src)
{
if (*src & 0x80)
{
*dst-- = 0x80 | (*src & 0x3f); // trailing byte
*dst-- = 0xc0 | (*((unsigned char *)src--) >> 6); // leading byte
}
else
{
*dst-- = *src--;
}
}
return 0; // SUCCESS
}Code Snippets
size_t iso8859_1_to_utf8(char *content, size_t max_size)
{
char *src, *dst;
//first run to see if there's enough space for the new bytes
for (src = dst = content; *src; src++, dst++)
{
if (*src & 0x80)
{
// If the high bit is set in the ISO-8859-1 representation, then
// the UTF-8 representation requires two bytes (one more than usual).
++dst;
}
}
if (dst - content + 1 > max_size)
{
// Inform caller of the space required
return dst - content + 1;
}
*(dst + 1) = '\0';
while (dst > src)
{
if (*src & 0x80)
{
*dst-- = 0x80 | (*src & 0x3f); // trailing byte
*dst-- = 0xc0 | (*((unsigned char *)src--) >> 6); // leading byte
}
else
{
*dst-- = *src--;
}
}
return 0; // SUCCESS
}Context
StackExchange Code Review Q#40780, answer score: 12
Revisions (0)
No revisions yet.