principlecsharpMinor
Method to return a string of max length (in bytes vs. characters)
Viewed 0 times
methodreturnlengthmaxbytescharactersstring
Problem
In my (c#) code, I need to generate a string (from a longer string) which when UTF-8 encoded, is no longer than a given max length (in bytes).
Usage looks like:
Does this look like a correct and sensible implementation?
public static string OfMaxBytes(this string str, int maxByteLength)
{
return str.Aggregate("", (s, c) =>
{
if (Encoding.UTF8.GetByteCount(s + c) <= maxByteLength)
{
s += c;
}
return s;
});
}Usage looks like:
var shortName = longName.OfMaxBytes(32);Does this look like a correct and sensible implementation?
Solution
It depends what you mean by correct. Consider this program:
Here is what your solution gives:
ā̈bç (9 bytes in UTF-8)
0 "" True
1 "a" True
2 "ab" False
3 "ā" True
4 "āb" False
5 "ā̈" True
6 "ā̈b" True
7 "ā̈bc" True
8 "ā̈bc" True
9 "ā̈bç" True
Granted, you might not come across such an input very often, but I don't think that's the result you want.
Two other points:
Here is what I would suggest:
Which gives this output:
ā̈bç (9 bytes in UTF-8)
0 "" True
1 "" True
2 "" True
3 "" True
4 "" True
5 "ā̈" True
6 "ā̈b" True
7 "ā̈b" True
8 "ā̈b" True
9 "ā̈bç" True
const string Input = "a\u0304\u0308bc\u0327";
var bytes = Encoding.UTF8.GetByteCount(Input);
Console.WriteLine("{0} ({1} bytes in UTF-8)", Input, bytes);
for (var i = 0; i <= bytes; i++)
{
var result = Input.OfMaxBytes(i);
Console.WriteLine("{0} \"{1}\" {2}", i, result, Input.StartsWith(result, StringComparison.Ordinal));
}Here is what your solution gives:
ā̈bç (9 bytes in UTF-8)
0 "" True
1 "a" True
2 "ab" False
3 "ā" True
4 "āb" False
5 "ā̈" True
6 "ā̈b" True
7 "ā̈bc" True
8 "ā̈bc" True
9 "ā̈bç" True
Granted, you might not come across such an input very often, but I don't think that's the result you want.
Two other points:
- You are building up a lot of intermediate strings. When you catch yourself doing that, see if you can use a
StringBuilderinstead.
- You are iterating through the entire string, regardless of the value of
maxBytes.
Here is what I would suggest:
public static string OfMaxBytes(this string input, int maxBytes)
{
if (maxBytes == 0 || string.IsNullOrEmpty(input))
{
return string.Empty;
}
var encoding = Encoding.UTF8;
if (encoding.GetByteCount(input) <= maxBytes)
{
return input;
}
var sb = new StringBuilder();
var bytes = 0;
var enumerator = StringInfo.GetTextElementEnumerator(input);
while (enumerator.MoveNext())
{
var textElement = enumerator.GetTextElement();
bytes += encoding.GetByteCount(textElement);
if (bytes <= maxBytes)
{
sb.Append(textElement);
}
else
{
break;
}
}
return sb.ToString();
}Which gives this output:
ā̈bç (9 bytes in UTF-8)
0 "" True
1 "" True
2 "" True
3 "" True
4 "" True
5 "ā̈" True
6 "ā̈b" True
7 "ā̈b" True
8 "ā̈b" True
9 "ā̈bç" True
Code Snippets
const string Input = "a\u0304\u0308bc\u0327";
var bytes = Encoding.UTF8.GetByteCount(Input);
Console.WriteLine("{0} ({1} bytes in UTF-8)", Input, bytes);
for (var i = 0; i <= bytes; i++)
{
var result = Input.OfMaxBytes(i);
Console.WriteLine("{0} \"{1}\" {2}", i, result, Input.StartsWith(result, StringComparison.Ordinal));
}public static string OfMaxBytes(this string input, int maxBytes)
{
if (maxBytes == 0 || string.IsNullOrEmpty(input))
{
return string.Empty;
}
var encoding = Encoding.UTF8;
if (encoding.GetByteCount(input) <= maxBytes)
{
return input;
}
var sb = new StringBuilder();
var bytes = 0;
var enumerator = StringInfo.GetTextElementEnumerator(input);
while (enumerator.MoveNext())
{
var textElement = enumerator.GetTextElement();
bytes += encoding.GetByteCount(textElement);
if (bytes <= maxBytes)
{
sb.Append(textElement);
}
else
{
break;
}
}
return sb.ToString();
}Context
StackExchange Code Review Q#55103, answer score: 8
Revisions (0)
No revisions yet.