patternMinor

Why more and more handcraft lex do not use regular expression any more?

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

expressionwhylexhandcraftanymoreregularandusenot

Problem

Recently I read the source code of Clang, and found that it just do the lex thing by reading characters one by one and manually do the matching. But both my teacher when I'm in college and the book "Compilers Principles, Techniques, and Tools" take lots time to illustrate how to build a regular expression state machine, and I remember the tool Flex also build a state machine (correct me if I'm wrong). I think it's easy to lost in the manually matching code and if there's something changed in a language, it is a little difficult to update the matching code.

So why nowadays people prefer to manually do the matching rather than use regular expression? Is this because the table generated by the regex tool too big if the token table is complicated? Or is this an efficiency trade-off?

p.s. Evidence for "more and more":

-
Clang

-
Gcc (as long as they abandoned flex/bison)

OpenJDK

Python

Protobuf

What interesting is, the author of protobuf wants to use regular expression, but give up due to open source project get better keep independent, I think this counts a reason.

Solution

While I cannot give you a definitive answer, but I think there are a couple of reasons for handcrafting lexers and parsers:

-
Good error reporting and recovery: most of tools such as Lex that provide a declarative way of defining regular expressions are hard to customize for error reporting or recovery. When you hand-craft your lexer or parser, you have more control when you encounter an error.

-
Performance: while Lex is using Finite State Automata in their lexers, and they are efficient (assuming the automata can be loaded in memory) it may pay off to handcraft your lexer to have a tight control on both runtime and memory usage. Most regular expression engines in programming languages use backtracking and may have exponential worst case runtime.

-
Infrequent changes: the token definitions (and even syntax) of programming languages doesn't change that often. So I think in the prototyping phase it makes sense to use a tool to automate the process, but when the syntax became stable, it will pay off to handcraft the lexer and parser, both in terms of error reporting and performance.

Context

StackExchange Computer Science Q#104978, answer score: 4

Revisions (0)

No revisions yet.