HiveBrain v1.2.0
Get Started
← Back to all entries
patterncMinor

Handwritten lexer

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
handwrittenlexerstackoverflow

Problem

I'm in the process of designing my own programming language and I have absolutely no experience in computer science other than the code I look at.

I created my first lexer, I've improved it enough that the lexer can actually lex its own source code.

What I'm asking here is how can I improve this lexer?

GitHub

#ifndef __LEX_H__
    #define __LEX_H__

#define is_alphabetic(c) (\
        ((c) >= 'a' && (c) = 'A' && (c) = 'a' && (c) = 'A' && (c) = '0' && (c) = '0' && (c) = 'a' && (c) = 'A' && (c) 
    RightBitShift,      // >>
    RightBitShiftEqual, // >>=
    GreaterEqual,       // >=
    NumConstantReal,    // 453.54354
    QuestionMark,       // ?
    HashSym,        // #
    Ampersand,      // &
    AndEqual,       // &=
    BoolAnd,        // &&
    Carot,          // ^
    XorEqual,       // ^=
    Percent,        // %
    ModuloEqual,        // %=
    ExclamationMark,    // !
    NotEqual,       // !=
    VerticalBar,        // |
    OrEqual,        // |=
    BoolOr,         // ||
    Tilde,          // ~
    StringConstant,     // "352dfsgnj34"
    CharConstant,       // 's'
    LeftSlash,      // '\\'
    Keyword,        // struct
    AtSign,         // @
    Ellipses,       // ...
    Arrow,          // ->
    NumConstantHexFloat,    // 0x0.3p10
    DollarSign,     // $
};

struct s_token {
    char        word[512];
    enum n_token    toktype;
};

struct lexer {
    unsigned int count, size;
    struct s_token *array;
};

void tokenize_string(char*, struct lexer*);
void print_tokens_colored(struct lexer*);

#endif


lex.c

```
#include "lex.h"
#include
#include
#include

static int is_array_full(struct lexer *vec)
{
return (vec->count >= vec->size);
}

static void resize_array(struct lexer *vec)
{
vec->size array = realloc(vec->array, sizeof(struct s_token) * vec->size);
}
static void add_to_array(struct lexer vec, enum n_token tok, char codestring, unsigned int size)
{
if (is_array_full(vec))
resize_array(vec);

vec-

Solution

This is a fairly straightforward way to write a lexer, and as such is really easy to read. Nice work! Here are my thoughts on how it could be improved:
Don't Reinvent the Wheel

In general, it's best not to write your own lexer unless you have a very simple domain-specific language. It's a good exercise, but in the real world, you should use a tool like lex. The input files are significantly smaller and easier to understand, and less likely to contain bugs. But I assume you're doing it to teach yourself about the process, which is a great idea!
Avoid macros

You use a lot of macros. There are 2 problems with this:

  • Macros are dangerous - If someone calls one of these macros thinking it's a function, and they do a pre or post increment, for example, very weird things will happen and it will be very difficult to debug.



  • Many of your macros mimic standard library functions, which is confusing. Just use the standard functions, and if you need additional functionality, write an inline or static wrapper function.



Long Functions Are Hard To Understand

In general, really long functions, and in particular really long loops, can be hard to understand. It takes significantly more effort to keep all the moving pieces in one's mind while reading the code. I recommend breaking up tokenize_string() into dozens of smaller functions. For example, I might change the while loop to look something like this:

while ( *iter != '\0' ) {
    trim_leading_whitespace(&iter);
    remove_C_comments(&iter);
    remove_Cpp_comments(&iter);
    handle_newline(&iter, vec);
    //... etc.


This makes the intent of the code clearer and easier to follow. Chances are, those short function will get inlined automatically and it will be just as fast as it is now.
Handle Issues When They Occur

In add_to_array() you first check to see if the array is full and if so resize it. Then you do some things which may fill up the array. This seems backwards to me. I'd put the check and resize immediately after incrementing vec->count. It's true that you may end up allocating extra memory you don't need on very rare occasions where you have allocated exactly as much memory as you need up-front, but this situation is rare and unlikely to be problematic.
Handle Errors

What is problematic is that you don't handle the potential error from realloc(). If it can't allocate more memory, it will return NULL and vec->array will be invalid. (Some sources are saying it will remain pointing to the old location - that's also a problem as you didn't allocate any more memory, but vec->size says its larger. Either way, it's a problem.) In addition, there isn't a single call to free() anywhere in your code. You're going to be leaking a lot of memory!
Why Have Arbitrary Limits?

I notice that the s_token.word member is limited to 512 bytes. Why limit it? And if you are going to limit it, why is the limit 512? If there's some significance to 512, give it a named constant, as you've done with KEYWORDS for example.
Use const Where Appropriate

Speaking of constants, you should mark function arguments as const if they aren't going to be changed by the function. For example, in add_to_array(), the token is copied and the string is also copied. That means that the originals are not changed, so you should mark tok, codestring and size as const. It tells both the compiler and the reader that they don't need to worry about those values changing in that function.
Use The Appropriate #defines

I notice that you're defining KNRM, KRED, etc. Shouldn't those be defined by some standard header? If not, the names could be a little clearer, or there could at least be a comment explaining what they are.

Code Snippets

while ( *iter != '\0' ) {
    trim_leading_whitespace(&iter);
    remove_C_comments(&iter);
    remove_Cpp_comments(&iter);
    handle_newline(&iter, vec);
    //... etc.

Context

StackExchange Code Review Q#151427, answer score: 4

Revisions (0)

No revisions yet.