patternjavascriptModerate
Count byte length of string
Viewed 0 times
countstringbytelength
Problem
I am looking for some guidance and optimization pointers for my custom JavaScript function which counts the bytes in a string rather than just chars. The website uses UTF-8 and I am looking to maintain IE8 compatibility.
```
/**
* Count bytes in string
*
* Count and return the number of bytes in a given string
*
* @access public
* @param string
* @return int
*/
function getByteLen(normal_val)
{
// Force string type
normal_val = String(normal_val);
// Split original string into array
var normal_pieces = normal_val.split('');
// Get length of original array
var normal_length = normal_pieces.length;
// Declare array for encoded normal array
var encoded_pieces = new Array();
// Declare array for individual byte pieces
var byte_pieces = new Array();
// Loop through normal pieces and convert to URL friendly format
for(var i = 0; i <= normal_length; i++)
{
if(normal_pieces[i] && normal_pieces[i] != '')
{
encoded_pieces[i] = encodeURI(normal_pieces[i]);
}
}
// Get length of encoded array
var encoded_length = encoded_pieces.length;
// Loop through encoded array
// Scan individual items for a %
// Split on % and add to byte array
// If no % exists then add to byte array
for(var i = 0; i <= encoded_length; i++)
{
if(encoded_pieces[i] && encoded_pieces[i] != '')
{
// % exists
if(encoded_pieces[i].indexOf('%') != -1)
{
// Split on %
var split_code = encoded_pieces[i].split('%');
// Get length
var split_length = split_code.length;
// Loop through pieces
for(var j = 0; j <= split_length; j++)
{
if(split_code[j] && split_code[j] != '')
{
// Push to byte array
byte_pieces.push(split_code[j]);
```
/**
* Count bytes in string
*
* Count and return the number of bytes in a given string
*
* @access public
* @param string
* @return int
*/
function getByteLen(normal_val)
{
// Force string type
normal_val = String(normal_val);
// Split original string into array
var normal_pieces = normal_val.split('');
// Get length of original array
var normal_length = normal_pieces.length;
// Declare array for encoded normal array
var encoded_pieces = new Array();
// Declare array for individual byte pieces
var byte_pieces = new Array();
// Loop through normal pieces and convert to URL friendly format
for(var i = 0; i <= normal_length; i++)
{
if(normal_pieces[i] && normal_pieces[i] != '')
{
encoded_pieces[i] = encodeURI(normal_pieces[i]);
}
}
// Get length of encoded array
var encoded_length = encoded_pieces.length;
// Loop through encoded array
// Scan individual items for a %
// Split on % and add to byte array
// If no % exists then add to byte array
for(var i = 0; i <= encoded_length; i++)
{
if(encoded_pieces[i] && encoded_pieces[i] != '')
{
// % exists
if(encoded_pieces[i].indexOf('%') != -1)
{
// Split on %
var split_code = encoded_pieces[i].split('%');
// Get length
var split_length = split_code.length;
// Loop through pieces
for(var j = 0; j <= split_length; j++)
{
if(split_code[j] && split_code[j] != '')
{
// Push to byte array
byte_pieces.push(split_code[j]);
Solution
It would be a lot simpler to work out the length yourself rather than to interpret the results of
JavaScript implementations may use either UCS-2 or UTF-16 to represent strings.
UCS-2 only supports Unicode code points up to U+FFFF, and such Unicode characters occupy 1, 2, or 3 bytes in their UTF-8 representation. This is not too tricky to handle.
However, as @Mac points out, UTF-16 surrogate pairs are a tricky special case. UTF-16 extends UCS-2 by adding support for code points U+10000 to U+10FFFF, which UTF-16 encodes using a pair of code points. The first code point of such a pair (called the "high surrogate") is in the range D800 to DBFF; it should always be followed by another code point (called the "low surrogate") is in the range DC00 to DFFF. Observe that the UTF-8 representation of any character in the range U+10000 to U+10FFFF would take 4 bytes. Therefore, any surrogate pair in UTF-16 would translate to a 4-byte UTF-8 representation. Or, we could say that whenever we encounter half of a surrogate pair (i.e., a code point is in the range from D800 to DFFF), just add two bytes to the UTF-8 length.
encodeURI()./**
* Count bytes in a string's UTF-8 representation.
*
* @param string
* @return int
*/
function getByteLen(normal_val) {
// Force string type
normal_val = String(normal_val);
var byteLen = 0;
for (var i = 0; i < normal_val.length; i++) {
var c = normal_val.charCodeAt(i);
byteLen += (c & 0xf800) == 0xd800 ? 2 : // Code point is half of a surrogate pair
c < (1 << 7) ? 1 :
c < (1 << 11) ? 2 : 3;
}
return byteLen;
}JavaScript implementations may use either UCS-2 or UTF-16 to represent strings.
UCS-2 only supports Unicode code points up to U+FFFF, and such Unicode characters occupy 1, 2, or 3 bytes in their UTF-8 representation. This is not too tricky to handle.
However, as @Mac points out, UTF-16 surrogate pairs are a tricky special case. UTF-16 extends UCS-2 by adding support for code points U+10000 to U+10FFFF, which UTF-16 encodes using a pair of code points. The first code point of such a pair (called the "high surrogate") is in the range D800 to DBFF; it should always be followed by another code point (called the "low surrogate") is in the range DC00 to DFFF. Observe that the UTF-8 representation of any character in the range U+10000 to U+10FFFF would take 4 bytes. Therefore, any surrogate pair in UTF-16 would translate to a 4-byte UTF-8 representation. Or, we could say that whenever we encounter half of a surrogate pair (i.e., a code point is in the range from D800 to DFFF), just add two bytes to the UTF-8 length.
Code Snippets
/**
* Count bytes in a string's UTF-8 representation.
*
* @param string
* @return int
*/
function getByteLen(normal_val) {
// Force string type
normal_val = String(normal_val);
var byteLen = 0;
for (var i = 0; i < normal_val.length; i++) {
var c = normal_val.charCodeAt(i);
byteLen += (c & 0xf800) == 0xd800 ? 2 : // Code point is half of a surrogate pair
c < (1 << 7) ? 1 :
c < (1 << 11) ? 2 : 3;
}
return byteLen;
}Context
StackExchange Code Review Q#37512, answer score: 15
Revisions (0)
No revisions yet.