patterncppMinor

Tokenizer for my programming language

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

codereview cpp stackoverflow c++c++11 lexical-analysis language-design

programmingfortokenizerlanguage

Problem

Here's my attempt at porting the Lua codebase for my programming language to C++(11). This is just the first step, the tokenizer, and I wanted to remove all the bad performance / practices / code before passing to the next steps.

I'm also still learning C++ as I go through this experience, so I wanted to get it reviewed to have a feedback on how I am going and to learn more.

Here's a formal definition of what a token is in my programming language in a syntax I hope looks like EBNF:

token ::= symbol | string | number | name;

symbol ::= '{' | '}' | '[' | ']' | '(' | ')' | '.' | ',' | ';' | ':' | '

A single line comment starts with a ~ and ends with a new line character. A block comment instead starts with ~{ and ends with ~}. Each opening bracket must have a matching closing bracket (they can be nested): an an example, a string like ~{ ~{ ~{ ~} ~{ ~{ ~} ~} won't be accepted because there are some unmatched opening brackets.

Strings aren't single-line: they can span multiple lines without the need to escape the newline with \ like in most languages.

But here's my actual code:

Path/include/error.hpp

#ifndef ERROR_HPP_INCLUDED
#define ERROR_HPP_INCLUDED

#include 
#include 

namespace patch {
    template 
    std::string to_string(const T &n) {
        std::stringstream stm;
        stm << n;
        return stm.str();
    }
}

class Error {
    public:
        std::string message;

        Error(std::string);
};

#endif//ERROR_HPP_INCLUDED


Path/src/error.cpp

```
#include "../include/error.hpp"

#include 

Error::Error(std::string new_mess | '?' | '!' | '#' | '_' | '\'';

string ::= '"' {(any_character | string_escape)} '"';

string_escape ::= c_escape | ('\\' digit [digit] [digit]);

number ::= [('+' | '-')] {digit} ('.' [digit] {digit});

digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';

name ::= name_char {(name_char | digit)};

name_char ::= //all printable characters which aren't a symbol, a digit or " and ~

A single line comment starts with a ~ and ends with a new line character. A block comment instead starts with ~{ and ends with ~}. Each opening bracket must have a matching closing bracket (they can be nested): an an example, a string like ~{ ~{ ~{ ~} ~{ ~{ ~} ~} won't be accepted because there are some unmatched opening brackets.

Strings aren't single-line: they can span multiple lines without the need to escape the newline with \ like in most languages.

But here's my actual code:

Path/include/error.hpp

%%CODEBLOCK_1%%

Path/src/error.cpp

```
#include "../include/error.hpp"

#include

Error::Error(std::string new_mess

Solution

Issues with EBNF

string ::= '"' {(any_character | string_escape)} '"';

Here for "any_character" you probably meant any character except " but that is not made explicit.

number ::= [('+' | '-')] {digit} ('.' [digit] {digit});
                         ^^^A^^^  ^^^^^^^B^^^^^^^^^^^^

For A: This means zero or more digits. Thats fine.

For B: Thats zero or one digit followed by zero or more digits.

So the following is a valid number +.

I believe you meant:

number ::= ['+' | '-'] {digit} ['.' digit {digit}];

Because this still allows + and - to be parsed as numbers. You really need to break this up into a couple of expressions to fully parse numbers.

number        ::= ['+' | '-'] NumberPart

NumberPart    ::= NumberInteger | NumberFloat
NumberInteger ::= digit {digit}
NumberFloat   ::= {digit} '.' digit {digit};

You can do it in one line if you really must. But I find it easier to read when you split it up a bit. Note: This is still not as comprehensive as those done by the C language as a decimal point must be followed by a digit but its pretty good.
An equivalent FLEX file

%x BLOCKCOMMENT
%x LINECOMMENT

/* You probably meant any character except " */
AnyStringCharacter          [^"]
Digit                       [0-9]
CEscape                     \\.

StringEscape                {CEscape}|\\{Digit}{Digit}{Digit}
Character                   {AnyStringCharacter}|{StringEscape}
LiteralString               "{Character}*"

Sign                        [+-]
NumberInteger               {Digit}+
NumberFloat                 {Digit}*\.{Digit}+
NumberPart                  {NumberInteger}|{NumberFloat}
LiteralNumber               {Sign}?{NumberPart}

IdentifierChar_First        [^]{}().,;:$?!#_\\[0123456789~"]
IdentifierChar              {IdentifierChar_First}|{Digit}
Identifier                  {IdentifierChar_First}{IdentifierChar}*

LineComment                 [^\n]*
BlockComment                [^~\n]*
EndOfLine                   \n

%%

\~                     {BEGIN(LINECOMMENT);}
\~\{                   {BEGIN(BLOCKCOMMENT);}

\~\}              {BEGIN(INITIAL);}
{EndOfLine}       {/*++line;*/}
{EndOfLine}        {BEGIN(INITIAL);/*++line;*/}

{BlockComment}    {/* Ignore Comment */}
\~                {/* Ignore ~ not followed by { */}
{LineComment}      {/* Ignore Comment */}

\{                          {return '{';}
\}                          {return '}';}
\[                          {return '[';}
\]                          {return ']';}
\(                          {return '(';}
\)                          {return ')';}
\.                          {return '.';}
\,                          {return ',';}
\;                          {return ';';}
\:                          {return ':';}
\$                          {return '

That's 67 lines compared to the nearly 500 for writing it yourself. And I am being generous as I could collapse all the symbols into a single line. This code is basically readable BNF so any computer scientist should be able to maintain it.
Code Review

There are so many of these lying around. You could have picked up a nearly standard one from boost boost::lexical_cast<>

namespace patch {
    template 
    std::string to_string(const T &n) {
        std::stringstream stm;
        stm << n;
        return stm.str();
    }
}


If this is an exception you should probably enhirit from one of the standard exceptions (like std::runtime_error).

class Error {
    public:
        std::string message;

        Error(std::string);  // Pass by const reference.
                             // If it needs building from a literal it works.
                             // But if already a string it will prevent the copy.
};


Seriously. That could have been inlined in the header file.

Error::Error(std::string new_message):
    message(new_message)
{}


Rather than use a void* to store your data use a union.

class Token {
    public:
        void* value = nullptr;
};


It expresses intent more clearly and also will remove all the casting issues that you are going to have in the rest of your code.

Never use C casts. Always use a C++ cast. They are easier to spot in the code and express intent much better.

delete (double*) value;
                delete (std::string*) value;


Being easy to spot is a good thing. Because I want to check more closely the dangerous casts but ignore the simpler casts.;}
\?                          {return '?';}
\!                          {return '!';}
\#                          {return '#';}
\_                          {return '_';}
\\                          {return '\\';}

{LiteralString}             {return yy::lex::literal_string;}
{LiteralNumber}             {return yy::lex::literal_number;}
{Identifier}                {return yy::lex::identifier;}

.                           {/* ERROR */}

That's 67 lines compared to the nearly 500 for writing it yourself. And I am being generous as I could collapse all the symbols into a single line. This code is basically readable BNF so any computer scientist should be able to maintain it.
Code Review

There are so many of these lying around. You could have picked up a nearly standard one from boost boost::lexical_cast<>

%%CODEBLOCK_5%%

If this is an exception you should probably enhirit from one of the standard exceptions (like std::runtime_error).

%%CODEBLOCK_6%%

Seriously. That could have been inlined in the header file.

%%CODEBLOCK_7%%

Rather than use a void* to store your data use a union.

%%CODEBLOCK_8%%

It expresses intent more clearly and also will remove all the casting issues that you are going to have in the rest of your code.

Never use C casts. Always use a C++ cast. They are easier to spot in the code and express intent much better.

%%CODEBLOCK_9%%

Being easy to spot is a good thing. Because I want to check more closely the dangerous casts but ignore the simpler casts.

Code Snippets

string ::= '"' {(any_character | string_escape)} '"';

number ::= [('+' | '-')] {digit} ('.' [digit] {digit});
                         ^^^A^^^  ^^^^^^^B^^^^^^^^^^^^

number ::= ['+' | '-'] {digit} ['.' digit {digit}];

number        ::= ['+' | '-'] NumberPart

NumberPart    ::= NumberInteger | NumberFloat
NumberInteger ::= digit {digit}
NumberFloat   ::= {digit} '.' digit {digit};

%x BLOCKCOMMENT
%x LINECOMMENT

/* You probably meant any character except " */
AnyStringCharacter          [^"]
Digit                       [0-9]
CEscape                     \\.

StringEscape                {CEscape}|\\{Digit}{Digit}{Digit}
Character                   {AnyStringCharacter}|{StringEscape}
LiteralString               "{Character}*"

Sign                        [+-]
NumberInteger               {Digit}+
NumberFloat                 {Digit}*\.{Digit}+
NumberPart                  {NumberInteger}|{NumberFloat}
LiteralNumber               {Sign}?{NumberPart}

IdentifierChar_First        [^]{}().,;:$?!#_\\[0123456789~"]
IdentifierChar              {IdentifierChar_First}|{Digit}
Identifier                  {IdentifierChar_First}{IdentifierChar}*

LineComment                 [^\n]*
BlockComment                [^~\n]*
EndOfLine                   \n

%%

<INITIAL>\~                     {BEGIN(LINECOMMENT);}
<INITIAL>\~\{                   {BEGIN(BLOCKCOMMENT);}

<BLOCKCOMMENT>\~\}              {BEGIN(INITIAL);}
<BLOCKCOMMENT>{EndOfLine}       {/*++line;*/}
<LINECOMMENT>{EndOfLine}        {BEGIN(INITIAL);/*++line;*/}

<BLOCKCOMMENT>{BlockComment}    {/* Ignore Comment */}
<BLOCKCOMMENT>\~                {/* Ignore ~ not followed by { */}
<LINECOMMENT>{LineComment}      {/* Ignore Comment */}

\{                          {return '{';}
\}                          {return '}';}
\[                          {return '[';}
\]                          {return ']';}
\(                          {return '(';}
\)                          {return ')';}
\.                          {return '.';}
\,                          {return ',';}
\;                          {return ';';}
\:                          {return ':';}
\$                          {return '$';}
\?                          {return '?';}
\!                          {return '!';}
\#                          {return '#';}
\_                          {return '_';}
\\                          {return '\\';}

{LiteralString}             {return yy::lex::literal_string;}
{LiteralNumber}             {return yy::lex::literal_number;}
{Identifier}                {return yy::lex::identifier;}

.                           {/* ERROR */}

Context

StackExchange Code Review Q#141710, answer score: 5

Revisions (0)

No revisions yet.