patternjavaModerate
Yet another lexical analyser
Viewed 0 times
yetlexicalanalyseranother
Problem
The following program is a lexical analyser for a simple and small grammar.
Please read my code and answer me these questions:
A general criticism is also welcome.
A note about performance: The program is really bad in performance. I already coded lexical analysers much better in this point, but this one is easier to extend and understand, so I do not want to lose the current structure.
Token.java
Lexer.java
```
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.HashSet;
import java.util.Set;
import java.util.stream.Stream;
public class Lexer {
private StringBuilder input = new StringBuilder();
private Token token;
private String lexema;
private boolean exausthed = false;
private String errorMessage = "";
private Set blankChars = new HashSet();
public L
Please read my code and answer me these questions:
- Is my code easy to understand?
- Is my code well organized?
- Are there bugs in my code?
- Is the lack of comments a problem?
- Can you make my code more performatic without modifying all the structure?
A general criticism is also welcome.
A note about performance: The program is really bad in performance. I already coded lexical analysers much better in this point, but this one is easier to extend and understand, so I do not want to lose the current structure.
Token.java
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public enum Token {
LET_KEYWORD ("let"),
IF_KEYWORD ("if"),
WHILE_KEYWORD ("while"),
ELSE_KEYWORD ("else"),
SEMICOLON (";"),
COMMA (","),
OPEN_BRACKET ("\\{"),
CLOSE_BRACKET ("\\}"),
OPEN_PARENTHESIS ("\\("),
CLOSE_PARENTHESIS ("\\)"),
EQUAL ("=="),
DIFFERENT ("<>"),
GREATER_EQUAL (">="),
LESSER_EQUAL (""),
LESSER ("<"),
ADDITION ("\\+"),
SUBTRACTION ("-"),
MULTIPLICATION ("\\*"),
DIVISION ("/"),
MODULUS ("%"),
STRING ("\"[^\"]+\""),
NUMBER ("\\d+(\\.\\d+)?"),
IDENTIFIER ("\\w+");
private final Pattern pattern;
Token(String regex) {
pattern = Pattern.compile("^" + regex);
}
int endOfMatch(String s) {
Matcher m = pattern.matcher(s);
if (m.find()) {
return m.end();
}
return -1;
}
}Lexer.java
```
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.HashSet;
import java.util.Set;
import java.util.stream.Stream;
public class Lexer {
private StringBuilder input = new StringBuilder();
private Token token;
private String lexema;
private boolean exausthed = false;
private String errorMessage = "";
private Set blankChars = new HashSet();
public L
Solution
Nitpick
Pedantic, but this should be spelled
Class vs. Object
And then later in the constructor
This creates a separate one of these for each instance of the class, but all of them have the same values. Instead
Now we just have one copy for all instances of the class. The static block handles initialization.
Iterator
This pattern almost matches an iterator, albeit with different names. Consider
Now it's the same names but the behavior is slightly off.
And you'd drop the
Build for unit testing
This would be difficult to unit test.
This would be easier. Its visibility is less restricted and it is possible to call it as
It doesn't rely on object state.
Performance
The thing that looks non-performant to me is
This seems inefficient. Rather than trying each token, consider building a data structure that goes the other way.
So for a given character, what tokens could possibly match? For example, if the first character is a "w", then the token might be
Bug?
I didn't try it, but I think that you might mishandle an identifier like
I think that this would get tokenized as
let
let
ter
=
"a"
;
Presumably you'd prefer that the second lexema be
You may want to make
Are numbers valid identifiers?
This would match, e.g.
If that's an intentional behavior, you should comment it. Even better, unit test it. Then you are protected from a later edit putting
Personally, I'd prefer that the regular expression only accept what it is supposed to accept. So if an identifier must start with a letter or underscore, the regular expression should capture that.
private boolean exausthed = false;Pedantic, but this should be spelled
exhausted. Class vs. Object
private Set blankChars = new HashSet();And then later in the constructor
blankChars.add('\r');
blankChars.add('\n');
blankChars.add((char) 8);
blankChars.add((char) 9);
blankChars.add((char) 11);
blankChars.add((char) 12);
blankChars.add((char) 32);This creates a separate one of these for each instance of the class, but all of them have the same values. Instead
private static Set blankChars = new HashSet();
static {
blankChars.add('\r');
blankChars.add('\n');
blankChars.add((char) 8);
blankChars.add((char) 9);
blankChars.add((char) 11);
blankChars.add((char) 12);
blankChars.add((char) 32);
}Now we just have one copy for all instances of the class. The static block handles initialization.
Iterator
while (!lexer.isExausthed()) {
System.out.printf("%-18s %s\n", lexer.currentToken(), lexer.currentLexema());
lexer.moveAhead();
}This pattern almost matches an iterator, albeit with different names. Consider
while (lexer.hasNext()) {
System.out.printf("%-18s %s\n", lexer.currentToken(), lexer.currentLexema());
lexer.next();
}Now it's the same names but the behavior is slightly off.
while (lexer.hasNext()) {
System.out.printf("%-18s %s\n", lexer.next(), lexer.currentLexema());
}And you'd drop the
moveAhead from the constructor, as it is no longer necessary to prime the pump that way. Actually implementing Iterable would allow you to say for (Token token : lexer) {
System.out.printf("%-18s %s\n", token, lexer.currentLexema());
}Build for unit testing
private boolean findNextToken() {This would be difficult to unit test.
static boolean findNextToken(StringBuilder input) {This would be easier. Its visibility is less restricted and it is possible to call it as
findNextToken(new StringBuilder("foo"));It doesn't rely on object state.
Performance
The thing that looks non-performant to me is
for (Token t : Token.values()) {
int end = t.endOfMatch(input.toString());
if (end != -1) {
token = t;
lexema = input.substring(0, end);
input.delete(0, end);
return true;
}
}This seems inefficient. Rather than trying each token, consider building a data structure that goes the other way.
for (Token t : possibleTokens.get(input.charAt(0))) {So for a given character, what tokens could possibly match? For example, if the first character is a "w", then the token might be
while or it might be an identifier. It's not going to be a comma or relational operator. So that drops twenty-five comparisons down to just two. And in many cases, there would be only one. Bug?
I didn't try it, but I think that you might mishandle an identifier like
let letter = "a";I think that this would get tokenized as
let
let
ter
=
"a"
;
Presumably you'd prefer that the second lexema be
letter. You may want to make
"let" be "let\\b" instead. That should force it not to match if the word continues. You should also do this for other keywords, although this would be a less common problem for while or else. Are numbers valid identifiers?
IDENTIFIER ("\\w+");This would match, e.g.
9f. But you'll never get a chance to see, as the 9 will get parsed as a number. If that's an intentional behavior, you should comment it. Even better, unit test it. Then you are protected from a later edit putting
IDENTIFIER before NUMBER. Personally, I'd prefer that the regular expression only accept what it is supposed to accept. So if an identifier must start with a letter or underscore, the regular expression should capture that.
Code Snippets
private boolean exausthed = false;private Set<Character> blankChars = new HashSet<Character>();blankChars.add('\r');
blankChars.add('\n');
blankChars.add((char) 8);
blankChars.add((char) 9);
blankChars.add((char) 11);
blankChars.add((char) 12);
blankChars.add((char) 32);private static Set<Character> blankChars = new HashSet<Character>();
static {
blankChars.add('\r');
blankChars.add('\n');
blankChars.add((char) 8);
blankChars.add((char) 9);
blankChars.add((char) 11);
blankChars.add((char) 12);
blankChars.add((char) 32);
}while (!lexer.isExausthed()) {
System.out.printf("%-18s %s\n", lexer.currentToken(), lexer.currentLexema());
lexer.moveAhead();
}Context
StackExchange Code Review Q#138725, answer score: 11
Revisions (0)
No revisions yet.