HiveBrain v1.2.0
Get Started
← Back to all entries
patterncMinor

Lexer for a language I'm working on

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
workingforlanguagelexer

Problem

I've recently started working on making my own programming language and I just wrapped up its lexer. I'm too young to take any official training in C, compiler construction, or computer science so I'm having mixed feelings in the quality of my code. It seems rather sluggish when printing but I haven't really found out the time it takes from start to finish. I'm using Visual Studio 2015 and C11 I believe (if I'm not VS is stupid).

Here is an example of the grammar:int (#nested comment # double nest?? # #

int meme) {
    double v_a-r = 10.2123 + 200 * 1.;
    #symbol testing; also, comment!!!
    {} [] () - + * / -= += *= /= ^ %
    return var;
}


I included the token header instead of the lexer header because the Lexerc type should be fairly obvious and I didn't want to scare anyone off with more code than there already is.

Token.h

#include "token.h"
#include "lexer.h"
#include 
#include 

token_t* token_new(lexer_t* lexer, tk_type type) {
    token_t* token = malloc(sizeof(token_t));
    token->line = lexer->line;
    token->pos = lexer->pos;
    token->type = type;

    return token;
}

void token_print(token_t* token) {
    printf("\ntype: %i", token->line);
    printf("\tline: %i", token->line);
    printf("\tpos: %i", token->pos);

    if (token->type == _int)
        printf("\tint val: %i", token->num);
    else if (token->type == _dbl)
        printf("\tflt val: %d", token->flt);
    else
        printf("\tstr val: %s", token->str);
}

void token_free(token_t* token) {
    if (token->str != NULL)
        free(token->str);
    free(token->str);
}


Lexer.c:

```
#define _CRT_SECURE_NO_WARNINGS
#define DEBUG 1

#include "error.h"
#include "token.h"
#include
#include
#include
#include
#include

static const keyword_t keywords[] = {
// Primitive data types
{"int", _int},
{"double", _dbl},
{"enum", _enum},
{"void", _void},
{"char", _char},
{"string", _str},
{"bool", _bool},
{"const", _const},
{

Solution

It is very hard to review a lexer without formal definition of the language (honestly, I have very vague understanding how the nested comments are supposed to be structured). However, even without such definition, certain things are surely bugs. For example, in

case '%':
        token = token_new(lexer, _mod);
        token->str = "%";
        lexer_adv(lexer, 1);
        break;
    case '^':
        token = token_new(lexer, _mod);
        token->str = "^";
        lexer_adv(lexer, 1);
        break;


only the first case should create a _mod token. The

case '>':
        if (lexer_look(lexer, 1) == '<') {


also looks extremely suspicious.

In general, instead of huge (and very error prone) case statement it is recommended to extend the keywords table with operators and punctuation (make sure that long operators come before short ones), and loop over it same way you do for keywords.

I presume token.h is really token.c. An actual token.h with a token_t definition is missing.

I don't see how token_free is called, but I expect problems. It blindly attempts to free(token->str), even though some token strings have not been allocated, but point to string literals.

At the same time you may notice that a textual representation of a keyword, operator, or punctuation adds zero information to a token (it can be trivially recovered from the token type), and for them you can safely make token->str a null pointer.

You should get a warning for a non-void function returning without a value:

static char lexer_look(lexer_t* lexer, size_t ahead) {
    if (lexer->len ptr + ahead) 
        return;
    return lexer->src[lexer->ptr + ahead];
}

Code Snippets

case '%':
        token = token_new(lexer, _mod);
        token->str = "%";
        lexer_adv(lexer, 1);
        break;
    case '^':
        token = token_new(lexer, _mod);
        token->str = "^";
        lexer_adv(lexer, 1);
        break;
case '>':
        if (lexer_look(lexer, 1) == '<') {
static char lexer_look(lexer_t* lexer, size_t ahead) {
    if (lexer->len < lexer->ptr + ahead) 
        return;
    return lexer->src[lexer->ptr + ahead];
}

Context

StackExchange Code Review Q#138106, answer score: 5

Revisions (0)

No revisions yet.