patterncMinor
Handwritten lexer
Viewed 0 times
handwrittenlexerstackoverflow
Problem
I'm in the process of designing my own programming language and I have absolutely no experience in computer science other than the code I look at.
I created my first lexer, I've improved it enough that the lexer can actually lex its own source code.
What I'm asking here is how can I improve this lexer?
GitHub
lex.c
```
#include "lex.h"
#include
#include
#include
static int is_array_full(struct lexer *vec)
{
return (vec->count >= vec->size);
}
static void resize_array(struct lexer *vec)
{
vec->size array = realloc(vec->array, sizeof(struct s_token) * vec->size);
}
static void add_to_array(struct lexer vec, enum n_token tok, char codestring, unsigned int size)
{
if (is_array_full(vec))
resize_array(vec);
vec-
I created my first lexer, I've improved it enough that the lexer can actually lex its own source code.
What I'm asking here is how can I improve this lexer?
GitHub
#ifndef __LEX_H__
#define __LEX_H__
#define is_alphabetic(c) (\
((c) >= 'a' && (c) = 'A' && (c) = 'a' && (c) = 'A' && (c) = '0' && (c) = '0' && (c) = 'a' && (c) = 'A' && (c)
RightBitShift, // >>
RightBitShiftEqual, // >>=
GreaterEqual, // >=
NumConstantReal, // 453.54354
QuestionMark, // ?
HashSym, // #
Ampersand, // &
AndEqual, // &=
BoolAnd, // &&
Carot, // ^
XorEqual, // ^=
Percent, // %
ModuloEqual, // %=
ExclamationMark, // !
NotEqual, // !=
VerticalBar, // |
OrEqual, // |=
BoolOr, // ||
Tilde, // ~
StringConstant, // "352dfsgnj34"
CharConstant, // 's'
LeftSlash, // '\\'
Keyword, // struct
AtSign, // @
Ellipses, // ...
Arrow, // ->
NumConstantHexFloat, // 0x0.3p10
DollarSign, // $
};
struct s_token {
char word[512];
enum n_token toktype;
};
struct lexer {
unsigned int count, size;
struct s_token *array;
};
void tokenize_string(char*, struct lexer*);
void print_tokens_colored(struct lexer*);
#endiflex.c
```
#include "lex.h"
#include
#include
#include
static int is_array_full(struct lexer *vec)
{
return (vec->count >= vec->size);
}
static void resize_array(struct lexer *vec)
{
vec->size array = realloc(vec->array, sizeof(struct s_token) * vec->size);
}
static void add_to_array(struct lexer vec, enum n_token tok, char codestring, unsigned int size)
{
if (is_array_full(vec))
resize_array(vec);
vec-
Solution
This is a fairly straightforward way to write a lexer, and as such is really easy to read. Nice work! Here are my thoughts on how it could be improved:
Don't Reinvent the Wheel
In general, it's best not to write your own lexer unless you have a very simple domain-specific language. It's a good exercise, but in the real world, you should use a tool like lex. The input files are significantly smaller and easier to understand, and less likely to contain bugs. But I assume you're doing it to teach yourself about the process, which is a great idea!
Avoid macros
You use a lot of macros. There are 2 problems with this:
Long Functions Are Hard To Understand
In general, really long functions, and in particular really long loops, can be hard to understand. It takes significantly more effort to keep all the moving pieces in one's mind while reading the code. I recommend breaking up
This makes the intent of the code clearer and easier to follow. Chances are, those short function will get inlined automatically and it will be just as fast as it is now.
Handle Issues When They Occur
In
Handle Errors
What is problematic is that you don't handle the potential error from
Why Have Arbitrary Limits?
I notice that the
Use
Speaking of constants, you should mark function arguments as
Use The Appropriate
I notice that you're defining
Don't Reinvent the Wheel
In general, it's best not to write your own lexer unless you have a very simple domain-specific language. It's a good exercise, but in the real world, you should use a tool like lex. The input files are significantly smaller and easier to understand, and less likely to contain bugs. But I assume you're doing it to teach yourself about the process, which is a great idea!
Avoid macros
You use a lot of macros. There are 2 problems with this:
- Macros are dangerous - If someone calls one of these macros thinking it's a function, and they do a pre or post increment, for example, very weird things will happen and it will be very difficult to debug.
- Many of your macros mimic standard library functions, which is confusing. Just use the standard functions, and if you need additional functionality, write an inline or static wrapper function.
Long Functions Are Hard To Understand
In general, really long functions, and in particular really long loops, can be hard to understand. It takes significantly more effort to keep all the moving pieces in one's mind while reading the code. I recommend breaking up
tokenize_string() into dozens of smaller functions. For example, I might change the while loop to look something like this:while ( *iter != '\0' ) {
trim_leading_whitespace(&iter);
remove_C_comments(&iter);
remove_Cpp_comments(&iter);
handle_newline(&iter, vec);
//... etc.This makes the intent of the code clearer and easier to follow. Chances are, those short function will get inlined automatically and it will be just as fast as it is now.
Handle Issues When They Occur
In
add_to_array() you first check to see if the array is full and if so resize it. Then you do some things which may fill up the array. This seems backwards to me. I'd put the check and resize immediately after incrementing vec->count. It's true that you may end up allocating extra memory you don't need on very rare occasions where you have allocated exactly as much memory as you need up-front, but this situation is rare and unlikely to be problematic.Handle Errors
What is problematic is that you don't handle the potential error from
realloc(). If it can't allocate more memory, it will return NULL and vec->array will be invalid. (Some sources are saying it will remain pointing to the old location - that's also a problem as you didn't allocate any more memory, but vec->size says its larger. Either way, it's a problem.) In addition, there isn't a single call to free() anywhere in your code. You're going to be leaking a lot of memory!Why Have Arbitrary Limits?
I notice that the
s_token.word member is limited to 512 bytes. Why limit it? And if you are going to limit it, why is the limit 512? If there's some significance to 512, give it a named constant, as you've done with KEYWORDS for example.Use
const Where AppropriateSpeaking of constants, you should mark function arguments as
const if they aren't going to be changed by the function. For example, in add_to_array(), the token is copied and the string is also copied. That means that the originals are not changed, so you should mark tok, codestring and size as const. It tells both the compiler and the reader that they don't need to worry about those values changing in that function.Use The Appropriate
#definesI notice that you're defining
KNRM, KRED, etc. Shouldn't those be defined by some standard header? If not, the names could be a little clearer, or there could at least be a comment explaining what they are.Code Snippets
while ( *iter != '\0' ) {
trim_leading_whitespace(&iter);
remove_C_comments(&iter);
remove_Cpp_comments(&iter);
handle_newline(&iter, vec);
//... etc.Context
StackExchange Code Review Q#151427, answer score: 4
Revisions (0)
No revisions yet.