HiveBrain v1.2.0
Get Started
← Back to all entries
principlecsharpMinor

Method to return a string of max length (in bytes vs. characters)

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
methodreturnlengthmaxbytescharactersstring

Problem

In my (c#) code, I need to generate a string (from a longer string) which when UTF-8 encoded, is no longer than a given max length (in bytes).

public static string OfMaxBytes(this string str, int maxByteLength)
    {
        return str.Aggregate("", (s, c) =>
        {
            if (Encoding.UTF8.GetByteCount(s + c) <= maxByteLength)
            {
                s += c;
            }
            return s;
        });
    }


Usage looks like:

var shortName = longName.OfMaxBytes(32);


Does this look like a correct and sensible implementation?

Solution

It depends what you mean by correct. Consider this program:

const string Input = "a\u0304\u0308bc\u0327";
var bytes = Encoding.UTF8.GetByteCount(Input);
Console.WriteLine("{0} ({1} bytes in UTF-8)", Input, bytes);
for (var i = 0; i <= bytes; i++)
{
    var result = Input.OfMaxBytes(i);
    Console.WriteLine("{0} \"{1}\" {2}", i, result, Input.StartsWith(result, StringComparison.Ordinal));
}


Here is what your solution gives:

ā̈bç (9 bytes in UTF-8)
0 "" True
1 "a" True
2 "ab" False
3 "ā" True
4 "āb" False
5 "ā̈" True
6 "ā̈b" True
7 "ā̈bc" True
8 "ā̈bc" True
9 "ā̈bç" True

Granted, you might not come across such an input very often, but I don't think that's the result you want.

Two other points:

  • You are building up a lot of intermediate strings. When you catch yourself doing that, see if you can use a StringBuilder instead.



  • You are iterating through the entire string, regardless of the value of maxBytes.



Here is what I would suggest:

public static string OfMaxBytes(this string input, int maxBytes)
{
    if (maxBytes == 0 || string.IsNullOrEmpty(input))
    {
        return string.Empty;
    }

    var encoding = Encoding.UTF8;
    if (encoding.GetByteCount(input) <= maxBytes)
    {
        return input;
    }

    var sb = new StringBuilder();
    var bytes = 0;
    var enumerator = StringInfo.GetTextElementEnumerator(input);
    while (enumerator.MoveNext())
    {
        var textElement = enumerator.GetTextElement();
        bytes += encoding.GetByteCount(textElement);
        if (bytes <= maxBytes)
        {
            sb.Append(textElement);
        }
        else
        {
            break;
        }
    }

    return sb.ToString();
}


Which gives this output:

ā̈bç (9 bytes in UTF-8)
0 "" True
1 "" True
2 "" True
3 "" True
4 "" True
5 "ā̈" True
6 "ā̈b" True
7 "ā̈b" True
8 "ā̈b" True
9 "ā̈bç" True

Code Snippets

const string Input = "a\u0304\u0308bc\u0327";
var bytes = Encoding.UTF8.GetByteCount(Input);
Console.WriteLine("{0} ({1} bytes in UTF-8)", Input, bytes);
for (var i = 0; i <= bytes; i++)
{
    var result = Input.OfMaxBytes(i);
    Console.WriteLine("{0} \"{1}\" {2}", i, result, Input.StartsWith(result, StringComparison.Ordinal));
}
public static string OfMaxBytes(this string input, int maxBytes)
{
    if (maxBytes == 0 || string.IsNullOrEmpty(input))
    {
        return string.Empty;
    }

    var encoding = Encoding.UTF8;
    if (encoding.GetByteCount(input) <= maxBytes)
    {
        return input;
    }

    var sb = new StringBuilder();
    var bytes = 0;
    var enumerator = StringInfo.GetTextElementEnumerator(input);
    while (enumerator.MoveNext())
    {
        var textElement = enumerator.GetTextElement();
        bytes += encoding.GetByteCount(textElement);
        if (bytes <= maxBytes)
        {
            sb.Append(textElement);
        }
        else
        {
            break;
        }
    }

    return sb.ToString();
}

Context

StackExchange Code Review Q#55103, answer score: 8

Revisions (0)

No revisions yet.