snippetpythonMinor

Generate ANTLR fragments for Unicode character classes

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

antlrfragmentscharacterunicodegenerateforclasses

Problem

I've been working on an ANTLR grammar that defines some tokens in terms of Unicode character categories (fileformat.info page). The Unicode Consortium makes a full data tab-separated-values .txt available at http://www.unicode.org/notes/tn36/Categories.txt.

The below scripts work together to take Categories.txt and turn it into ANTLR Lexer fragments. extract_fragment.py Takes a character class code and turns it into a fragment, and gen_fragments.py uses the former script to generate a full ANLTR .g4 for all character categories.

extract_fragment.py

`# Categories.txt from http://www.unicode.org/notes/tn36/Categories.txt

from sys import argv

ANTLR_UNICODE_TOKEN_FORMAT = "'\\u{:04X}'"
TEN_BITS = 2 ** 10 - 1

def is_bmp(point):
"""
Is a character on the Basic Multilingual Plane?
(Can it be represented as one UTF-16 code point.)
:param point: The character to check
:return: truthy if the character lies on the BMP, falsy otherwise
"""
return ord(point) > 10))

def low_surrogate(point):
if is_bmp(point):
return None
return 0xDC00 + (TEN_BITS & (ord(point) - 0x10000))

def codepoints_from_for(file, code):
"""
Extract the characters of a given category from the source tsv.
:param file: The tab separated values file-like-object with category information, closed on finish
:param code: The character code to look for
:return: Generator of characters with the given category
"""
for line in file:
info = line.split('\t')
if info[1] == code:
yield chr(int(info[0], base=16))
file.close()

def codepoints_to_tuples(points):
"""
Collapse individual characters into range tuples.
Will never construct a range across changing first byte of surrogate pairs.
:param points: the characters to collapse
:return: Generator of (chr, chr) ranges from the param
"""
pair = [points[0], points[0]]
for point in points[1:]:
if ord(pair[1]) + 1 == or

Solution

print('fragment N  : Nd | Nl | No ;')

It probably doesn't matter, but there's an extraneous whitespace there, before the colon. And the whitespace before the semicolon isn't needed either, so you could output like this:

print('fragment N : Nd | Nl | No;')

Now, your script is generating fragment rules for a lexer grammar: I would expect to see a function whose role it is to output the fragment part.

I don't do any python, so consider this pseudo-code:

def printFragment(token, rule, comment)
    print('// ' + comment)
    print('fragment ' + token + ' : ' + rule + ';')
    print()

So instead of this:

print('// [C] Other')
print('fragment C  : Cc | Cf | /* Cn | Co | Cs */ ;')
print()

You could have that:

printFragment('C', 'Cc | Cf | /* Cn | Co | Cs */', '[C] Other')

Similarly, there could be a function dedicated to printing the extract output:

printExtracted('Cc', '[Cc] Other, Control')
printExtracted('Cf', '[Cf] Other, Format')
printExtracted('Cn', '[Cn] Other, Not Assigned')
printExtracted('Co', '[Co] Other, Private Use')
printExtracted('Cs', '[Cs] Other, Surrogate')

Looking at what extract does, it seems it could reuse that printFragment function... the idea is to try to find a way to write a function that prints a single lexer token's definition, in a standard way.

Consider putting the comments on the same line - this is a legal fragment definition:

fragment C : Cc | Cf /| Cn | Co | Cs /;           // [C] Other

Note I moved the /* marker up one character here, so as to avoid that empty alternative fragment C : Cc | Cf | ??? - not sure if that makes ANTLR warn or complain in any way, but it's always best to avoid grammar problems that can be avoided ;-)

If you can make all the comments aligned, the resulting lexer grammar will look pretty neat, with one fragment per line.

Code Snippets

print('fragment N  : Nd | Nl | No ;')

print('fragment N : Nd | Nl | No;')

def printFragment(token, rule, comment)
    print('// ' + comment)
    print('fragment ' + token + ' : ' + rule + ';')
    print()

print('// [C] Other')
print('fragment C  : Cc | Cf | /* Cn | Co | Cs */ ;')
print()

printFragment('C', 'Cc | Cf | /* Cn | Co | Cs */', '[C] Other')

Context

StackExchange Code Review Q#139386, answer score: 2

Revisions (0)

No revisions yet.