snippetpythonMinor
Generate ANTLR fragments for Unicode character classes
Viewed 0 times
antlrfragmentscharacterunicodegenerateforclasses
Problem
I've been working on an ANTLR grammar that defines some tokens in terms of Unicode character categories (fileformat.info page). The Unicode Consortium makes a full data tab-separated-values
The below scripts work together to take
extract_fragment.py
`# Categories.txt from http://www.unicode.org/notes/tn36/Categories.txt
from sys import argv
ANTLR_UNICODE_TOKEN_FORMAT = "'\\u{:04X}'"
TEN_BITS = 2 ** 10 - 1
def is_bmp(point):
"""
Is a character on the Basic Multilingual Plane?
(Can it be represented as one UTF-16 code point.)
:param point: The character to check
:return: truthy if the character lies on the BMP, falsy otherwise
"""
return ord(point) > 10))
def low_surrogate(point):
if is_bmp(point):
return None
return 0xDC00 + (TEN_BITS & (ord(point) - 0x10000))
def codepoints_from_for(file, code):
"""
Extract the characters of a given category from the source tsv.
:param file: The tab separated values file-like-object with category information, closed on finish
:param code: The character code to look for
:return: Generator of characters with the given category
"""
for line in file:
info = line.split('\t')
if info[1] == code:
yield chr(int(info[0], base=16))
file.close()
def codepoints_to_tuples(points):
"""
Collapse individual characters into range tuples.
Will never construct a range across changing first byte of surrogate pairs.
:param points: the characters to collapse
:return: Generator of (chr, chr) ranges from the param
"""
pair = [points[0], points[0]]
for point in points[1:]:
if ord(pair[1]) + 1 == or
.txt available at http://www.unicode.org/notes/tn36/Categories.txt.The below scripts work together to take
Categories.txt and turn it into ANTLR Lexer fragments. extract_fragment.py Takes a character class code and turns it into a fragment, and gen_fragments.py uses the former script to generate a full ANLTR .g4 for all character categories.extract_fragment.py
`# Categories.txt from http://www.unicode.org/notes/tn36/Categories.txt
from sys import argv
ANTLR_UNICODE_TOKEN_FORMAT = "'\\u{:04X}'"
TEN_BITS = 2 ** 10 - 1
def is_bmp(point):
"""
Is a character on the Basic Multilingual Plane?
(Can it be represented as one UTF-16 code point.)
:param point: The character to check
:return: truthy if the character lies on the BMP, falsy otherwise
"""
return ord(point) > 10))
def low_surrogate(point):
if is_bmp(point):
return None
return 0xDC00 + (TEN_BITS & (ord(point) - 0x10000))
def codepoints_from_for(file, code):
"""
Extract the characters of a given category from the source tsv.
:param file: The tab separated values file-like-object with category information, closed on finish
:param code: The character code to look for
:return: Generator of characters with the given category
"""
for line in file:
info = line.split('\t')
if info[1] == code:
yield chr(int(info[0], base=16))
file.close()
def codepoints_to_tuples(points):
"""
Collapse individual characters into range tuples.
Will never construct a range across changing first byte of surrogate pairs.
:param points: the characters to collapse
:return: Generator of (chr, chr) ranges from the param
"""
pair = [points[0], points[0]]
for point in points[1:]:
if ord(pair[1]) + 1 == or
Solution
print('fragment N : Nd | Nl | No ;')It probably doesn't matter, but there's an extraneous whitespace there, before the colon. And the whitespace before the semicolon isn't needed either, so you could output like this:
print('fragment N : Nd | Nl | No;')Now, your script is generating
fragment rules for a lexer grammar: I would expect to see a function whose role it is to output the fragment part.I don't do any python, so consider this pseudo-code:
def printFragment(token, rule, comment)
print('// ' + comment)
print('fragment ' + token + ' : ' + rule + ';')
print()So instead of this:
print('// [C] Other')
print('fragment C : Cc | Cf | /* Cn | Co | Cs */ ;')
print()You could have that:
printFragment('C', 'Cc | Cf | /* Cn | Co | Cs */', '[C] Other')Similarly, there could be a function dedicated to printing the
extract output:printExtracted('Cc', '[Cc] Other, Control')
printExtracted('Cf', '[Cf] Other, Format')
printExtracted('Cn', '[Cn] Other, Not Assigned')
printExtracted('Co', '[Co] Other, Private Use')
printExtracted('Cs', '[Cs] Other, Surrogate')Looking at what
extract does, it seems it could reuse that printFragment function... the idea is to try to find a way to write a function that prints a single lexer token's definition, in a standard way.Consider putting the comments on the same line - this is a legal fragment definition:
fragment C : Cc | Cf /| Cn | Co | Cs /; // [C] Other
Note I moved the
/* marker up one character here, so as to avoid that empty alternative fragment C : Cc | Cf | ??? - not sure if that makes ANTLR warn or complain in any way, but it's always best to avoid grammar problems that can be avoided ;-)If you can make all the comments aligned, the resulting lexer grammar will look pretty neat, with one fragment per line.
Code Snippets
print('fragment N : Nd | Nl | No ;')print('fragment N : Nd | Nl | No;')def printFragment(token, rule, comment)
print('// ' + comment)
print('fragment ' + token + ' : ' + rule + ';')
print()print('// [C] Other')
print('fragment C : Cc | Cf | /* Cn | Co | Cs */ ;')
print()printFragment('C', 'Cc | Cf | /* Cn | Co | Cs */', '[C] Other')Context
StackExchange Code Review Q#139386, answer score: 2
Revisions (0)
No revisions yet.