patterncsharpMinor
Optimal string literal tokenizing algorithm
Viewed 0 times
optimaltokenizingliteralalgorithmstring
Problem
I have a tokenizer, where one of the token types happens to be a C# string type. It takes a TextReader as its input, which is created from a StringReader, and it needs to support the basic stuff C# strings supports, such as "jo dude", "joe \"dudes\"", "this is a backslash \\ token" and of course "this is carriage return \r\nthis is next line"
Currently my method looks like this;
It works I think, however, I suspect this is far from optimal, among other things, I hate that
The
Can anyone figure out the optimal solution for parsing a C# type of string literal, and optimize my method?
Please don't tell me to use some library, since I don't want to bring in the overhead of any libraries to perform a task I think should be easily possible to implement with some 10-20 lines of code.
For bonus points, I'd love to also have a solution to my "multiline string literal algo" too. This method should read string the same way C# reads them when
Currently my method looks like this;
private string ReadSingleLineStringLiteral ()
{
StringBuilder buffer = new StringBuilder ();
int nextChar = _reader.Read ();
while (nextChar != -1) {
buffer.Append ((char)nextChar);
if (nextChar == '"' &&
(buffer.Length == 1 ||
buffer [buffer.Length - 1] != '\\' ||
(buffer.Length - buffer.ToString ().TrimEnd ('\\').Length) % 2 == 0)) {
break;
}
nextChar = _reader.Read ();
}
if (buffer [buffer.Length - 1] != '"')
throw new ArgumentException ("unclosed string literal in hyperlisp file");
return buffer.ToString ().Substring (0, buffer.Length - 1)
.Replace ("\n", "\r\n") // normalizing carriage returns
.Replace ("\r\r\n", "\r\n")
.Replace ("\\\"", "\"")
.Replace ("\\\\", "\\");
}It works I think, however, I suspect this is far from optimal, among other things, I hate that
buffer.ToString ().TrimEnd ('\\').Length) % 2 part of it.The
_reader is the TextReader object, that at the entry of my methods are at the first character within the content of my string literal.Can anyone figure out the optimal solution for parsing a C# type of string literal, and optimize my method?
Please don't tell me to use some library, since I don't want to bring in the overhead of any libraries to perform a task I think should be easily possible to implement with some 10-20 lines of code.
For bonus points, I'd love to also have a solution to my "multiline string literal algo" too. This method should read string the same way C# reads them when
Solution
A rule of thumb for performance in C# is to avoid allocations. In this case, we should try to remove
Of course you need to set performance goals and profile the code yourself on realistic input. Single line strings can reasonably be expected to be quite short (say 200 chars?), so what you have already might be performant enough, but I'll illustrate an alternative way.
First let's make this a static method so it will be easier to unit test. As I mentioned in the comments, there's a bug in the code -- unit testing can help you catch such bugs.
And a convenience method for testing
Now let's just handle the simple case, where there are no escape characters. Let's write some unit tests
And write the simplest thing that gets them to pass
Single line strings shouldn't contain new lines, so let's add a new test case
And a matching case to our method
Now let's move on to escape sequences.
We add another case
And the corresponding method
There are more cases to handle, but this should be enough to get you started.
buffer.ToString ().TrimEnd ('\\'), as well as the calls to Substring and Replace that appear at the end.Of course you need to set performance goals and profile the code yourself on realistic input. Single line strings can reasonably be expected to be quite short (say 200 chars?), so what you have already might be performant enough, but I'll illustrate an alternative way.
First let's make this a static method so it will be easier to unit test. As I mentioned in the comments, there's a bug in the code -- unit testing can help you catch such bugs.
private static string ReadSingleLineStringLiteral(TextReader reader)And a convenience method for testing
private static string ReadSingleLineStringLiteral(string line)
{
using (var reader = new StringReader(line))
{
return ReadSingleLineStringLiteral(reader);
}
}Now let's just handle the simple case, where there are no escape characters. Let's write some unit tests
Assert.AreEqual("", ReadSingleLineStringLiteral("\""));
Assert.AreEqual("jo dude", ReadSingleLineStringLiteral("jo dude\""));
Assert.Throws(() => ReadSingleLineStringLiteral(""));And write the simplest thing that gets them to pass
var sb = new StringBuilder();
for (var c = reader.Read(); c != -1; c = reader.Read())
{
switch (c)
{
case '"':
return sb.ToString();
default:
sb.Append((char)c);
break;
}
}
throw new ArgumentException("Unexpected end of input");Single line strings shouldn't contain new lines, so let's add a new test case
Assert.Throws(() => ReadSingleLineStringLiteral("\n\""));And a matching case to our method
switch (c)
{
case '"':
return sb.ToString();
case '\n':
throw new ArgumentException("Single line string contains new line");
default:
sb.Append((char)c);
break;
}Now let's move on to escape sequences.
Assert.AreEqual("jo \"dude\"", ReadSingleLineStringLiteral("jo \\\"dude\\\"\""));
Assert.AreEqual("this is a backslash \\ token", ReadSingleLineStringLiteral("this is a backslash \\\\ token\""));
Assert.AreEqual("this is a carriage return \r\nthis is next line", ReadSingleLineStringLiteral("this is a carriage return \\r\\nthis is next line\""));We add another case
switch (c)
{
case '"':
return sb.ToString();
case '\\':
AppendEscapeCharacter(reader, sb);
break;
case '\n':
throw new ArgumentException("Single line string contains new line");
default:
sb.Append((char)c);
break;
}And the corresponding method
private static void AppendEscapeCharacter(TextReader reader, StringBuilder sb)
{
var c = reader.Read();
switch (c)
{
case -1:
throw new ArgumentException("Unexpected end of input");
case '"':
case '\'':
case '\\':
sb.Append((char)c);
break;
case 'n':
sb.Append('\n');
break;
case 'r':
sb.Append('\r');
break;
default:
throw new ArgumentException(string.Format("Invalid escape sequence '\\{0}'", (char)c));
}
}There are more cases to handle, but this should be enough to get you started.
Code Snippets
private static string ReadSingleLineStringLiteral(TextReader reader)private static string ReadSingleLineStringLiteral(string line)
{
using (var reader = new StringReader(line))
{
return ReadSingleLineStringLiteral(reader);
}
}Assert.AreEqual("", ReadSingleLineStringLiteral("\""));
Assert.AreEqual("jo dude", ReadSingleLineStringLiteral("jo dude\""));
Assert.Throws<ArgumentException>(() => ReadSingleLineStringLiteral(""));var sb = new StringBuilder();
for (var c = reader.Read(); c != -1; c = reader.Read())
{
switch (c)
{
case '"':
return sb.ToString();
default:
sb.Append((char)c);
break;
}
}
throw new ArgumentException("Unexpected end of input");Assert.Throws<ArgumentException>(() => ReadSingleLineStringLiteral("\n\""));Context
StackExchange Code Review Q#70671, answer score: 4
Revisions (0)
No revisions yet.