patternjavaModerate

Yet another lexical analyser

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

yetlexicalanalyseranother

Problem

The following program is a lexical analyser for a simple and small grammar.

Please read my code and answer me these questions:

Is my code easy to understand?

Is my code well organized?

Are there bugs in my code?

Is the lack of comments a problem?

Can you make my code more performatic without modifying all the structure?

A general criticism is also welcome.

A note about performance: The program is really bad in performance. I already coded lexical analysers much better in this point, but this one is easier to extend and understand, so I do not want to lose the current structure.

Token.java

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public enum Token {
    LET_KEYWORD ("let"),
    IF_KEYWORD ("if"),
    WHILE_KEYWORD ("while"),
    ELSE_KEYWORD ("else"),

    SEMICOLON (";"),
    COMMA (","),

    OPEN_BRACKET ("\\{"),
    CLOSE_BRACKET ("\\}"),
    OPEN_PARENTHESIS ("\\("),
    CLOSE_PARENTHESIS ("\\)"),

    EQUAL ("=="),
    DIFFERENT ("<>"),
    GREATER_EQUAL (">="),
    LESSER_EQUAL (""),
    LESSER ("<"),

    ADDITION ("\\+"),
    SUBTRACTION ("-"),
    MULTIPLICATION ("\\*"),
    DIVISION ("/"),
    MODULUS ("%"),

    STRING ("\"[^\"]+\""),
    NUMBER ("\\d+(\\.\\d+)?"),
    IDENTIFIER ("\\w+");    

    private final Pattern pattern;

    Token(String regex) {
        pattern = Pattern.compile("^" + regex);
    }

    int endOfMatch(String s) {
        Matcher m = pattern.matcher(s);

        if (m.find()) {
            return m.end();
        }

        return -1;
    }
}

Lexer.java

```
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.HashSet;
import java.util.Set;
import java.util.stream.Stream;

public class Lexer {
private StringBuilder input = new StringBuilder();
private Token token;
private String lexema;
private boolean exausthed = false;
private String errorMessage = "";
private Set blankChars = new HashSet();

public L

Solution

Nitpick

private boolean exausthed = false;

Pedantic, but this should be spelled exhausted.

Class vs. Object

private Set blankChars = new HashSet();

And then later in the constructor

blankChars.add('\r');
        blankChars.add('\n');
        blankChars.add((char) 8);
        blankChars.add((char) 9);
        blankChars.add((char) 11);
        blankChars.add((char) 12);
        blankChars.add((char) 32);

This creates a separate one of these for each instance of the class, but all of them have the same values. Instead

private static Set blankChars = new HashSet();

    static {
        blankChars.add('\r');
        blankChars.add('\n');
        blankChars.add((char) 8);
        blankChars.add((char) 9);
        blankChars.add((char) 11);
        blankChars.add((char) 12);
        blankChars.add((char) 32);
    }

Now we just have one copy for all instances of the class. The static block handles initialization.

Iterator

while (!lexer.isExausthed()) {
            System.out.printf("%-18s %s\n", lexer.currentToken(), lexer.currentLexema());
            lexer.moveAhead();
        }

This pattern almost matches an iterator, albeit with different names. Consider

while (lexer.hasNext()) {
            System.out.printf("%-18s %s\n", lexer.currentToken(), lexer.currentLexema());
            lexer.next();
        }

Now it's the same names but the behavior is slightly off.

while (lexer.hasNext()) {
            System.out.printf("%-18s %s\n", lexer.next(), lexer.currentLexema());
        }

And you'd drop the moveAhead from the constructor, as it is no longer necessary to prime the pump that way. Actually implementing Iterable would allow you to say

for (Token token : lexer) {
            System.out.printf("%-18s %s\n", token, lexer.currentLexema());
        }

Build for unit testing

private boolean findNextToken() {

This would be difficult to unit test.

static boolean findNextToken(StringBuilder input) {

This would be easier. Its visibility is less restricted and it is possible to call it as

findNextToken(new StringBuilder("foo"));

It doesn't rely on object state.

Performance

The thing that looks non-performant to me is

for (Token t : Token.values()) {
            int end = t.endOfMatch(input.toString());

            if (end != -1) {
                token = t;
                lexema = input.substring(0, end);
                input.delete(0, end);
                return true;
            }
        }

This seems inefficient. Rather than trying each token, consider building a data structure that goes the other way.

for (Token t : possibleTokens.get(input.charAt(0))) {

So for a given character, what tokens could possibly match? For example, if the first character is a "w", then the token might be while or it might be an identifier. It's not going to be a comma or relational operator. So that drops twenty-five comparisons down to just two. And in many cases, there would be only one.

Bug?

I didn't try it, but I think that you might mishandle an identifier like

let letter = "a";

I think that this would get tokenized as

let

let

ter

=

"a"

;

Presumably you'd prefer that the second lexema be letter.

You may want to make "let" be "let\\b" instead. That should force it not to match if the word continues. You should also do this for other keywords, although this would be a less common problem for while or else.

Are numbers valid identifiers?

IDENTIFIER ("\\w+");

This would match, e.g. 9f. But you'll never get a chance to see, as the 9 will get parsed as a number.

If that's an intentional behavior, you should comment it. Even better, unit test it. Then you are protected from a later edit putting IDENTIFIER before NUMBER.

Personally, I'd prefer that the regular expression only accept what it is supposed to accept. So if an identifier must start with a letter or underscore, the regular expression should capture that.

Code Snippets

private boolean exausthed = false;

private Set<Character> blankChars = new HashSet<Character>();

blankChars.add('\r');
        blankChars.add('\n');
        blankChars.add((char) 8);
        blankChars.add((char) 9);
        blankChars.add((char) 11);
        blankChars.add((char) 12);
        blankChars.add((char) 32);

private static Set<Character> blankChars = new HashSet<Character>();

    static {
        blankChars.add('\r');
        blankChars.add('\n');
        blankChars.add((char) 8);
        blankChars.add((char) 9);
        blankChars.add((char) 11);
        blankChars.add((char) 12);
        blankChars.add((char) 32);
    }

while (!lexer.isExausthed()) {
            System.out.printf("%-18s %s\n", lexer.currentToken(), lexer.currentLexema());
            lexer.moveAhead();
        }

Context

StackExchange Code Review Q#138725, answer score: 11

Revisions (0)

No revisions yet.