HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Simple flex-based lexer

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
simpleflexbasedlexer

Problem

I am trying to learn flex and have created this simple program. The rule for comments works correctly for single line comments such as:

// this is a comment


and this:

/* this is also a comment */


My code:

ID [A-Z][A-Za-z0-9]
KEYWORD if|else|then|for|fi|loop|pool|proc|func
OPERATOR "+"|"-"|"/"|"*"|"&"|"%"
PUNCTUATION ":"|","
%%
{ID} printf("An Id found:%s\n",yytext);
{KEYWORD} printf("An Keyword found:%s\n",yytext);
{OPERATOR} printf("An Operator found:%s\n",yytext);
{PUNCTUATION} printf("An Punctuation found:%s\n",yytext);
[/][/].* 

"/*".*"*/"

%%
int yywrap(){
  return 1;
}
main(){
  yylex();
}


I'd be interested in comments on this and particularly ways to improve the code, such as for being able to detect multi-line comments.

Solution

The code looks OK for what it does so far, but there are some things you might want to do to improve it:

Always use {} for production rules

It's not technically wrong to simply have printf(...) to the right of a rule, but when your lexer gets more complex (and when you start also using a parser) you may find it easier to troubleshoot if you always use {} to enclose production rules -- even empty ones.

Think about explicitly handling whitespace

It's very common for a parser to need to ignore whitespace. If that's the case, it's usually good to do so explicitly with a rule just above the error-handling rule(s) I mention below.

[ \t\n]+   { /* ignore whitespace */ }


Consider a "catch-all" rule for illegal tokens

Right now, pretty much any random character will be accepted. This might be fine, but especially while you're learning, you may find it useful to put a catch-all rule at the bottom of your list of rules:

.   { printf("Bad character: %s\n", yytext); }


Consider adding support for multiline comments

As your original (pre-edit) code had it, handling multiline comments is different but not too difficult. You can add this to your definitions (the first part of a flex file):

%x c_comment


Then add these rules to the rules section (second part of a flex file):

"/*"   { BEGIN(c_comment); }
[^*]*        { }
"*"+[^*/]*   { }
"*/"         { printf("Ignored a multiline comment\n"); BEGIN(INITIAL); }


This defines a start condition called c_comment and switches into that condition when it finds the opening pair of characters for a comment. The next rule ignores everyting that is not a character. The next line ignores all characters that are not followed by a /. The point to these two rules is to match as many characters as possible. For performance reasons, you would generally want to write your lexer so that it matches strings that are as long as possible for each rule. This helps the lexer go faster.

Finally, the last rule finds the closing pair of characters and switches back into the initial context. You will also often see BEGIN(0) for that -- the statements are identical in function, but I prefer the more verbose BEGIN(INITIAL) form because I think it's easier to understand.

Code Snippets

[ \t\n]+   { /* ignore whitespace */ }
.   { printf("Bad character: %s\n", yytext); }
%x c_comment
"/*"   { BEGIN(c_comment); }
<c_comment>[^*]*        { }
<c_comment>"*"+[^*/]*   { }
<c_comment>"*/"         { printf("Ignored a multiline comment\n"); BEGIN(INITIAL); }

Context

StackExchange Code Review Q#73842, answer score: 7

Revisions (0)

No revisions yet.