patternMinor
Simple flex-based lexer
Viewed 0 times
simpleflexbasedlexer
Problem
I am trying to learn flex and have created this simple program. The rule for comments works correctly for single line comments such as:
and this:
My code:
I'd be interested in comments on this and particularly ways to improve the code, such as for being able to detect multi-line comments.
// this is a commentand this:
/* this is also a comment */My code:
ID [A-Z][A-Za-z0-9]
KEYWORD if|else|then|for|fi|loop|pool|proc|func
OPERATOR "+"|"-"|"/"|"*"|"&"|"%"
PUNCTUATION ":"|","
%%
{ID} printf("An Id found:%s\n",yytext);
{KEYWORD} printf("An Keyword found:%s\n",yytext);
{OPERATOR} printf("An Operator found:%s\n",yytext);
{PUNCTUATION} printf("An Punctuation found:%s\n",yytext);
[/][/].*
"/*".*"*/"
%%
int yywrap(){
return 1;
}
main(){
yylex();
}I'd be interested in comments on this and particularly ways to improve the code, such as for being able to detect multi-line comments.
Solution
The code looks OK for what it does so far, but there are some things you might want to do to improve it:
Always use
It's not technically wrong to simply have
Think about explicitly handling whitespace
It's very common for a parser to need to ignore whitespace. If that's the case, it's usually good to do so explicitly with a rule just above the error-handling rule(s) I mention below.
Consider a "catch-all" rule for illegal tokens
Right now, pretty much any random character will be accepted. This might be fine, but especially while you're learning, you may find it useful to put a catch-all rule at the bottom of your list of rules:
Consider adding support for multiline comments
As your original (pre-edit) code had it, handling multiline comments is different but not too difficult. You can add this to your definitions (the first part of a
Then add these rules to the rules section (second part of a
This defines a start condition called
Finally, the last rule finds the closing pair of characters and switches back into the initial context. You will also often see
Always use
{} for production rulesIt's not technically wrong to simply have
printf(...) to the right of a rule, but when your lexer gets more complex (and when you start also using a parser) you may find it easier to troubleshoot if you always use {} to enclose production rules -- even empty ones.Think about explicitly handling whitespace
It's very common for a parser to need to ignore whitespace. If that's the case, it's usually good to do so explicitly with a rule just above the error-handling rule(s) I mention below.
[ \t\n]+ { /* ignore whitespace */ }Consider a "catch-all" rule for illegal tokens
Right now, pretty much any random character will be accepted. This might be fine, but especially while you're learning, you may find it useful to put a catch-all rule at the bottom of your list of rules:
. { printf("Bad character: %s\n", yytext); }Consider adding support for multiline comments
As your original (pre-edit) code had it, handling multiline comments is different but not too difficult. You can add this to your definitions (the first part of a
flex file):%x c_commentThen add these rules to the rules section (second part of a
flex file):"/*" { BEGIN(c_comment); }
[^*]* { }
"*"+[^*/]* { }
"*/" { printf("Ignored a multiline comment\n"); BEGIN(INITIAL); }This defines a start condition called
c_comment and switches into that condition when it finds the opening pair of characters for a comment. The next rule ignores everyting that is not a character. The next line ignores all characters that are not followed by a /. The point to these two rules is to match as many characters as possible. For performance reasons, you would generally want to write your lexer so that it matches strings that are as long as possible for each rule. This helps the lexer go faster. Finally, the last rule finds the closing pair of characters and switches back into the initial context. You will also often see
BEGIN(0) for that -- the statements are identical in function, but I prefer the more verbose BEGIN(INITIAL) form because I think it's easier to understand.Code Snippets
[ \t\n]+ { /* ignore whitespace */ }. { printf("Bad character: %s\n", yytext); }%x c_comment"/*" { BEGIN(c_comment); }
<c_comment>[^*]* { }
<c_comment>"*"+[^*/]* { }
<c_comment>"*/" { printf("Ignored a multiline comment\n"); BEGIN(INITIAL); }Context
StackExchange Code Review Q#73842, answer score: 7
Revisions (0)
No revisions yet.