patterncMinor

Turn characters into sensible things (tokenizing/lexing)

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

codereview lexical-analysis c stackoverflow

turnthingstokenizingintolexingsensiblecharacters

Problem

I'm writing a programming language. It's something I've been working on and off on for the last year or so, but recently, I've decided to buckle down and do it.

As part of the language, I wrote a tokenizer. Its job is to turn a stream of human-readable characters into data the language can do something with. That allows for a couple of neat abstractions, as well as separating my concerns (the parsing code doesn't have to worry about the actual in-text layout of the code, just the tokens; the tokenizer doesn't parse, it just tokenizes)

I realize I could have let yacc do the work, but half the point of this project is to fill out my lackluster resume be able to say "I made this from scratch".

Quick note beforehand (as opposed to the previous paragraphs): This uses StringBuilder. I've already put it up for review, and actually incorporated a few suggestions from that into this.

tokenizer.h

```
#ifndef CONCATEN_TOKENIZER_H
#define CONCATEN_TOKENIZER_H

#include
#include
#include

#ifndef TKNR_FILE_BUF_SIZE
# define TKNR_FILE_BUF_SIZE 256
#endif
#if TKNR_FILE_BUF_SIZE <= 0
# error TKNR_FILE_BUF_SIZE cannot be <= 0
#endif

struct FileSource {
FILE *fptr;
unsigned char next_chars[TKNR_FILE_BUF_SIZE];
size_t next_chars_pos;
size_t eof; // if we're at EOF, this marks where in next_chars it is
};
struct StringSource {
char *begin;
char *end;
char *cur_pos;
};
struct Tokenizer {
union {
struct StringSource string;
struct FileSource file;
} source;
bool is_from_file;
char next_char;
char *origin;
size_t origin_len;
size_t line, index;
int error;

bool just_started;
};

enum TokenType {
TKN_UNKNOWN, TKN_WORD, TKN_STRING, TKN_REGEX,
TKN_INTEGER, TKN_REAL, TKN_IDENTIFIER
};
struct Token {
char *raw;
size_t raw_len;
char *origin;
size_t origin_len;
size_t line, index;
enum TokenType type;
};

struct Token tkn_empty(size_t line, size_t index);
char *tkn_type_na

Solution

Bug in copy

I didn't look at all your code. I just scrolled to the bottom and took a look at the first big function, which happened to be tknr_next(). Here is some code from that function:

ret.origin = malloc(from->origin_len * sizeof(char));
if (!ret.origin) {
    from->error = NT_MALLOC_FAIL;
    goto error;
}
strcpy(ret.origin, from->origin);

There are a couple of problems here:

ret.origin is allocated a buffer that is too small by 1, because from->origin_len doesn't include the null terminating character (I checked the other code to make sure).

After making the copy of the origin buffer from from to ret, you never set ret->origin_len = from->origin_len. So ret->origin_len still has value 0. I'm not sure if this will cause problems later on, but I'm guessing that it might.

origin_len unset

Actually I just took another look at where origin_len comes from. I found that neither tknr_from_string() nor tknr_from_filepath() set ret->origin_len. So later in tknr_next() I believe that from->origin_len will always be 0, which seems like a problem unless I'm missing something.
*out = ret

You asked about whether out = ret is correct. This performs a struct copy, so even if ret will be destroyed when you return from the function, you have made a copy of all of its bytes to out. So it should be fine unless one of the fields of ret is a pointer to something local (and there aren't any fields like that).
Might be hard to extend

I'm not sure what your programming language's syntax looks like. But I will imagine that is looks like C for the moment. In the same function tknr_next(), you have this code:

if (next_char == '"') { // single-line string
    return get_string(from, &next_char, ret, out);
} else if ('0' <= next_char && next_char <= '9') {
    return get_number(from, &next_char, ret, out);
}
...

This is OK as a start, but I wonder about:

What if instead of a digit, you have a negative number like -5?

What if instead of a digit, you have an expression starting with a parentheses, like (5 + 5)?

I don't see any support for operators, like +, -, *, etc.

Will your language support comments? I don't see anything in your parser that would allow you to be in a "comment mode" where you parse malformed strings and such without generating an error.

In other words, I feel as if you are just getting started, and as you add additional features to your parser the code is going to get a lot more complicated. You might want to define a grammar for your language and then work off of the grammar specification. Even if you don't use yacc, having a formal grammar specification might force you to organize your code in ways you haven't thought of yet.

Honestly, if I had to write a language parser from scratch I wouldn't know how to do it either. Good luck, because this may turn into a bigger project than you have anticipated.

Code Snippets

ret.origin = malloc(from->origin_len * sizeof(char));
if (!ret.origin) {
    from->error = NT_MALLOC_FAIL;
    goto error;
}
strcpy(ret.origin, from->origin);

if (next_char == '"') { // single-line string
    return get_string(from, &next_char, ret, out);
} else if ('0' <= next_char && next_char <= '9') {
    return get_number(from, &next_char, ret, out);
}
...

Context

StackExchange Code Review Q#157166, answer score: 2

Revisions (0)

No revisions yet.