HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavascriptModerate

Count byte length of string

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
countstringbytelength

Problem

I am looking for some guidance and optimization pointers for my custom JavaScript function which counts the bytes in a string rather than just chars. The website uses UTF-8 and I am looking to maintain IE8 compatibility.

```
/**
* Count bytes in string
*
* Count and return the number of bytes in a given string
*
* @access public
* @param string
* @return int
*/
function getByteLen(normal_val)
{
// Force string type
normal_val = String(normal_val);

// Split original string into array
var normal_pieces = normal_val.split('');
// Get length of original array
var normal_length = normal_pieces.length;

// Declare array for encoded normal array
var encoded_pieces = new Array();

// Declare array for individual byte pieces
var byte_pieces = new Array();

// Loop through normal pieces and convert to URL friendly format
for(var i = 0; i <= normal_length; i++)
{
if(normal_pieces[i] && normal_pieces[i] != '')
{
encoded_pieces[i] = encodeURI(normal_pieces[i]);
}
}

// Get length of encoded array
var encoded_length = encoded_pieces.length;

// Loop through encoded array
// Scan individual items for a %
// Split on % and add to byte array
// If no % exists then add to byte array
for(var i = 0; i <= encoded_length; i++)
{
if(encoded_pieces[i] && encoded_pieces[i] != '')
{
// % exists
if(encoded_pieces[i].indexOf('%') != -1)
{
// Split on %
var split_code = encoded_pieces[i].split('%');
// Get length
var split_length = split_code.length;

// Loop through pieces
for(var j = 0; j <= split_length; j++)
{
if(split_code[j] && split_code[j] != '')
{
// Push to byte array
byte_pieces.push(split_code[j]);

Solution

It would be a lot simpler to work out the length yourself rather than to interpret the results of encodeURI().

/**
 * Count bytes in a string's UTF-8 representation.
 *
 * @param   string
 * @return  int
 */
function getByteLen(normal_val) {
    // Force string type
    normal_val = String(normal_val);

    var byteLen = 0;
    for (var i = 0; i < normal_val.length; i++) {
        var c = normal_val.charCodeAt(i);
        byteLen += (c & 0xf800) == 0xd800 ? 2 :  // Code point is half of a surrogate pair
                   c < (1 <<  7) ? 1 :
                   c < (1 << 11) ? 2 : 3;
    }
    return byteLen;
}


JavaScript implementations may use either UCS-2 or UTF-16 to represent strings.

UCS-2 only supports Unicode code points up to U+FFFF, and such Unicode characters occupy 1, 2, or 3 bytes in their UTF-8 representation. This is not too tricky to handle.

However, as @Mac points out, UTF-16 surrogate pairs are a tricky special case. UTF-16 extends UCS-2 by adding support for code points U+10000 to U+10FFFF, which UTF-16 encodes using a pair of code points. The first code point of such a pair (called the "high surrogate") is in the range D800 to DBFF; it should always be followed by another code point (called the "low surrogate") is in the range DC00 to DFFF. Observe that the UTF-8 representation of any character in the range U+10000 to U+10FFFF would take 4 bytes. Therefore, any surrogate pair in UTF-16 would translate to a 4-byte UTF-8 representation. Or, we could say that whenever we encounter half of a surrogate pair (i.e., a code point is in the range from D800 to DFFF), just add two bytes to the UTF-8 length.

Code Snippets

/**
 * Count bytes in a string's UTF-8 representation.
 *
 * @param   string
 * @return  int
 */
function getByteLen(normal_val) {
    // Force string type
    normal_val = String(normal_val);

    var byteLen = 0;
    for (var i = 0; i < normal_val.length; i++) {
        var c = normal_val.charCodeAt(i);
        byteLen += (c & 0xf800) == 0xd800 ? 2 :  // Code point is half of a surrogate pair
                   c < (1 <<  7) ? 1 :
                   c < (1 << 11) ? 2 : 3;
    }
    return byteLen;
}

Context

StackExchange Code Review Q#37512, answer score: 15

Revisions (0)

No revisions yet.