HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Optimal string literal tokenizing algorithm

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
optimaltokenizingliteralalgorithmstring

Problem

I have a tokenizer, where one of the token types happens to be a C# string type. It takes a TextReader as its input, which is created from a StringReader, and it needs to support the basic stuff C# strings supports, such as "jo dude", "joe \"dudes\"", "this is a backslash \\ token" and of course "this is carriage return \r\nthis is next line"

Currently my method looks like this;

private string ReadSingleLineStringLiteral ()
{
    StringBuilder buffer = new StringBuilder ();
    int nextChar = _reader.Read ();
    while (nextChar != -1) {
        buffer.Append ((char)nextChar);
        if (nextChar == '"' && 
            (buffer.Length == 1 || 
                 buffer [buffer.Length - 1] != '\\' || 
                 (buffer.Length - buffer.ToString ().TrimEnd ('\\').Length) % 2 == 0)) {
            break;
        }
        nextChar = _reader.Read ();
    }
    if (buffer [buffer.Length - 1] != '"')
        throw new ArgumentException ("unclosed string literal in hyperlisp file");
    return buffer.ToString ().Substring (0, buffer.Length - 1)
        .Replace ("\n", "\r\n") // normalizing carriage returns
        .Replace ("\r\r\n", "\r\n")
        .Replace ("\\\"", "\"")
        .Replace ("\\\\", "\\");
}


It works I think, however, I suspect this is far from optimal, among other things, I hate that buffer.ToString ().TrimEnd ('\\').Length) % 2 part of it.

The _reader is the TextReader object, that at the entry of my methods are at the first character within the content of my string literal.

Can anyone figure out the optimal solution for parsing a C# type of string literal, and optimize my method?

Please don't tell me to use some library, since I don't want to bring in the overhead of any libraries to perform a task I think should be easily possible to implement with some 10-20 lines of code.

For bonus points, I'd love to also have a solution to my "multiline string literal algo" too. This method should read string the same way C# reads them when

Solution

A rule of thumb for performance in C# is to avoid allocations. In this case, we should try to remove buffer.ToString ().TrimEnd ('\\'), as well as the calls to Substring and Replace that appear at the end.

Of course you need to set performance goals and profile the code yourself on realistic input. Single line strings can reasonably be expected to be quite short (say 200 chars?), so what you have already might be performant enough, but I'll illustrate an alternative way.

First let's make this a static method so it will be easier to unit test. As I mentioned in the comments, there's a bug in the code -- unit testing can help you catch such bugs.

private static string ReadSingleLineStringLiteral(TextReader reader)


And a convenience method for testing

private static string ReadSingleLineStringLiteral(string line)
{
    using (var reader = new StringReader(line))
    {
        return ReadSingleLineStringLiteral(reader);
    }
}


Now let's just handle the simple case, where there are no escape characters. Let's write some unit tests

Assert.AreEqual("", ReadSingleLineStringLiteral("\""));
Assert.AreEqual("jo dude", ReadSingleLineStringLiteral("jo dude\""));
Assert.Throws(() => ReadSingleLineStringLiteral(""));


And write the simplest thing that gets them to pass

var sb = new StringBuilder();
for (var c = reader.Read(); c != -1; c = reader.Read())
{
    switch (c)
    {
    case '"':
        return sb.ToString();
    default:
        sb.Append((char)c);
        break;
    }
}

throw new ArgumentException("Unexpected end of input");


Single line strings shouldn't contain new lines, so let's add a new test case

Assert.Throws(() => ReadSingleLineStringLiteral("\n\""));


And a matching case to our method

switch (c)
    {
    case '"':
        return sb.ToString();
    case '\n':
        throw new ArgumentException("Single line string contains new line");
    default:
        sb.Append((char)c);
        break;
    }


Now let's move on to escape sequences.

Assert.AreEqual("jo \"dude\"", ReadSingleLineStringLiteral("jo \\\"dude\\\"\""));
Assert.AreEqual("this is a backslash \\ token", ReadSingleLineStringLiteral("this is a backslash \\\\ token\""));
Assert.AreEqual("this is a carriage return \r\nthis is next line", ReadSingleLineStringLiteral("this is a carriage return \\r\\nthis is next line\""));


We add another case

switch (c)
    {
    case '"':
        return sb.ToString();
    case '\\':
        AppendEscapeCharacter(reader, sb);
        break;
    case '\n':
        throw new ArgumentException("Single line string contains new line");
    default:
        sb.Append((char)c);
        break;
    }


And the corresponding method

private static void AppendEscapeCharacter(TextReader reader, StringBuilder sb)
{
    var c = reader.Read(); 
    switch (c)
    {
    case -1:
        throw new ArgumentException("Unexpected end of input");
    case '"':
    case '\'':
    case '\\':
        sb.Append((char)c);
        break;
    case 'n':
        sb.Append('\n');
        break;
    case 'r':
        sb.Append('\r');
        break;
    default:
        throw new ArgumentException(string.Format("Invalid escape sequence '\\{0}'", (char)c));
    }
}


There are more cases to handle, but this should be enough to get you started.

Code Snippets

private static string ReadSingleLineStringLiteral(TextReader reader)
private static string ReadSingleLineStringLiteral(string line)
{
    using (var reader = new StringReader(line))
    {
        return ReadSingleLineStringLiteral(reader);
    }
}
Assert.AreEqual("", ReadSingleLineStringLiteral("\""));
Assert.AreEqual("jo dude", ReadSingleLineStringLiteral("jo dude\""));
Assert.Throws<ArgumentException>(() => ReadSingleLineStringLiteral(""));
var sb = new StringBuilder();
for (var c = reader.Read(); c != -1; c = reader.Read())
{
    switch (c)
    {
    case '"':
        return sb.ToString();
    default:
        sb.Append((char)c);
        break;
    }
}

throw new ArgumentException("Unexpected end of input");
Assert.Throws<ArgumentException>(() => ReadSingleLineStringLiteral("\n\""));

Context

StackExchange Code Review Q#70671, answer score: 4

Revisions (0)

No revisions yet.