patterncMinor
HTML/XHTML/XML tokenization
Viewed 0 times
tokenizationxhtmlhtmlxml
Problem
I want to write a library to parse any valid or invalid HTML-like things. First of all, I am trying to build a lexer. Here is what I did;
```
/ A html lexer /
#include
#include
#include
void memcpy(void , const void *, size_t);
#define TOUPPER ('a' - 'A')
/* This lexer is UTF unaware on purpose, because characters that we
are interested in are already in
TAGSELFCLOSE, // />
CLOSETAGOPEN, //
CDATAOPEN, //
STRING, // anything else
WHITESPACE, // any combination of " ", "\r", "\n", "\t"
EQUAL // =
} token_t;
/ Lexer will produce these. /
typedef struct {
/* begin is included but end is exluded. For example;
Memory locations: 1 | 2 | 3 | 4 | 5 | 6 |
Chars: : t | o | k | e | n |
Begin will be 1, end will be 6.
CAUTION! This is not a null terminated string.
*/
token_t type;
const char *begin;
const char *end;
} Token;
/ This is the data used by our lexer. /
typedef struct lexer {
char SOF; / Start of file */
char END; / End of file */
char SOT; / Start of current token */
char pos; / Current position of lexer */
/* Lexer work as a state machine, this function denotes the current
state. */
void (state_func)(struct lexer );
/* This function gets a pointer to current token.
This is where lexer communicates with possible parsers. This
way, multiple parsers can be built upon this lexer.
Each time a new token is found, this function will be called.
*/
void (token_eater)(const Token );
/* Tokens that we emit are also kept here, so we don't need to malloc - free
each token. Parsers can do that if they require. */
Token token;
} Lexer;
typedef void (state_func)(struct lexer );
typedef void (token_eater)(const Token );
#define NULLSTATE (state_func)0
#define QUITLEX(LEXER
```
/ A html lexer /
#include
#include
#include
void memcpy(void , const void *, size_t);
#define TOUPPER ('a' - 'A')
/* This lexer is UTF unaware on purpose, because characters that we
are interested in are already in
TAGSELFCLOSE, // />
CLOSETAGOPEN, //
CDATAOPEN, //
STRING, // anything else
WHITESPACE, // any combination of " ", "\r", "\n", "\t"
EQUAL // =
} token_t;
/ Lexer will produce these. /
typedef struct {
/* begin is included but end is exluded. For example;
Memory locations: 1 | 2 | 3 | 4 | 5 | 6 |
Chars: : t | o | k | e | n |
Begin will be 1, end will be 6.
CAUTION! This is not a null terminated string.
*/
token_t type;
const char *begin;
const char *end;
} Token;
/ This is the data used by our lexer. /
typedef struct lexer {
char SOF; / Start of file */
char END; / End of file */
char SOT; / Start of current token */
char pos; / Current position of lexer */
/* Lexer work as a state machine, this function denotes the current
state. */
void (state_func)(struct lexer );
/* This function gets a pointer to current token.
This is where lexer communicates with possible parsers. This
way, multiple parsers can be built upon this lexer.
Each time a new token is found, this function will be called.
*/
void (token_eater)(const Token );
/* Tokens that we emit are also kept here, so we don't need to malloc - free
each token. Parsers can do that if they require. */
Token token;
} Lexer;
typedef void (state_func)(struct lexer );
typedef void (token_eater)(const Token );
#define NULLSTATE (state_func)0
#define QUITLEX(LEXER
Solution
I don't write C, but I can tell this looks pretty neat.
Just a few notes:
Just a few notes:
- The
STRINGtoken would probably be more accurately namedLITERAL
- The
token_printprocedure is meant for output; asclosetagopenwould be more readable asCloseTagOpen, I'd probably prefer to see them output in PascalCase; then you could remove the inconsistent whitespaces in the output values, such as incomment openwhich would becomeCommentOpen.
- The lexer, [mock-up] parser, and the
mainmethod should probably be in separate files.
- The
62indata + 62in themainmethod is a magic number. Not clear what it stands for.
Context
StackExchange Code Review Q#82496, answer score: 3
Revisions (0)
No revisions yet.