patterncMinor

HTML/XHTML/XML tokenization

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

codereview c xml stackoverflow parsing html

tokenizationxhtmlhtmlxml

Problem

I want to write a library to parse any valid or invalid HTML-like things. First of all, I am trying to build a lexer. Here is what I did;

```
/ A html lexer /
#include
#include
#include

void memcpy(void , const void *, size_t);

#define TOUPPER ('a' - 'A')

/* This lexer is UTF unaware on purpose, because characters that we
are interested in are already in
TAGSELFCLOSE, // />
CLOSETAGOPEN, //
CDATAOPEN, //
STRING, // anything else
WHITESPACE, // any combination of " ", "\r", "\n", "\t"
EQUAL // =
} token_t;

/ Lexer will produce these. /
typedef struct {
/* begin is included but end is exluded. For example;
Memory locations: 1 | 2 | 3 | 4 | 5 | 6 |
Chars: : t | o | k | e | n |

Begin will be 1, end will be 6.

CAUTION! This is not a null terminated string.
*/
token_t type;
const char *begin;
const char *end;
} Token;

/ This is the data used by our lexer. /
typedef struct lexer {
char SOF; / Start of file */
char END; / End of file */
char SOT; / Start of current token */
char pos; / Current position of lexer */

/* Lexer work as a state machine, this function denotes the current
state. */
void (state_func)(struct lexer );

/* This function gets a pointer to current token.

This is where lexer communicates with possible parsers. This
way, multiple parsers can be built upon this lexer.

Each time a new token is found, this function will be called.
*/
void (token_eater)(const Token );

/* Tokens that we emit are also kept here, so we don't need to malloc - free
each token. Parsers can do that if they require. */
Token token;
} Lexer;

typedef void (state_func)(struct lexer );
typedef void (token_eater)(const Token );

#define NULLSTATE (state_func)0
#define QUITLEX(LEXER

Solution

I don't write C, but I can tell this looks pretty neat.

Just a few notes:

The STRING token would probably be more accurately named LITERAL

The token_print procedure is meant for output; as closetagopen would be more readable as CloseTagOpen, I'd probably prefer to see them output in PascalCase; then you could remove the inconsistent whitespaces in the output values, such as in comment open which would become CommentOpen.

The lexer, [mock-up] parser, and the main method should probably be in separate files.

The 62 in data + 62 in the main method is a magic number. Not clear what it stands for.

Context

StackExchange Code Review Q#82496, answer score: 3

Revisions (0)

No revisions yet.