HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

What is the point of delimiters and whitespace handling

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
handlingthewhatpointwhitespaceanddelimiters

Problem

I see that language specifies reserved words, delimiters and whitespaces in the lexer section. Are delimiters just like reserved identifiers in the punctuation space or they have additioinal function regarding skipping the whitespace?

Namely, spec says


Each lexical element is either a delimiter, an identifier (which may
be a reserved word), an abstract literal, a character literal, a
string literal, a bit string literal or comment. In some cases an
explicit separator is required to separate adjacent lexical elements
(namely when, without separation, interpretation as a single lexical
element is possible). A separator is either a space character (SPACE
or NBSP), a format effector, or the end of a line.

I also see a definition

relative_pathname ::= { ^ . } partial_pathname


where . is a delimiter but ^ is not. I do not understand why the difference. Moreover, ^ is a special character that can be only a part of "string literal", 'char literal' or /extended identifier/ and I don't understand how to deliver this character to the path parser.

Anyway, I wonder what to do with pieces of text smashed to each other like 11'c' or "this is string litral"with_some_identifier. Mine current lexer produces string_literal followed by identifier. However, I feel that others don't do that. What is the common practice for lexing and whitespace skipping -- when whitespace is it mandatory and when is it optional? How do you specify that to the parser/lexer?
I ask because do not see that parsers/lexers specify a lot of whitespace or separators in the production rules. Despite this stuff must be ubequitos, in practice, I do not see it at all. In JavaCC, for instance, you just specify whitespace chars in SKIP and it does the rest itself. What is the convention? I see that parser combinators support lexical.reserved words and lexical.delimiters. What is the purpose?

I guess that I can supplement every definition of identifier, delimiter and litera

Solution

Now, is it right that only identifiers and literals have to be separated by delimiters or whitespace? How do I ensure that?

If by "right" you mean it is the case in every programming language, then no, it is not right, and probably no non-trivial lexical statement would be either.

In many languages, integer literals do not have to be separated from a following identifier; in other languages, they do. In most languages, <= is different from < =, so identifiers and numbers are not the only classes of tokens which require explicit separation. (If you don't consider < and = to be "delimiters", then you cannot say that identifiers need to be separated by whitespace or delimiters, either, since `aprecisely in those contexts where maximal-munch would produce a different tokenisation. In such languages, whitespace is always discarded (after tokenisation) and there is no need to ensure it is supplied.

There is no global institution which regulates computer language designers. Each language community develops according to its own philosophy, customs, eccentricities and equivocations, and there is no higher truth to which one can appeal.

Particular languages may be based on a different tokenisation model, so you cannot make general language-independent statements based on the behaviour of a single language. In the comments streams, it has been suggested that this question actually applies to VHDL and that VHDL does not conform to the maximal munch model. The text cited in the OP apparently comes from section 13.2 of the VHDL Reference Manual. However, it seems clear (even from the text quoted in the OP), that the design is not much different from maximal-munch. I repeat, with emphasis added:

The text of each design unit is a sequence of separate lexical elements. Each lexical element is either a delimiter, an identifier (which may be a reserved word), an abstract literal, a character literal, a string literal, a bit string literal, or a comment.

In some cases an explicit separator is required to separate adjacent lexical elements (namely when, without separation, interpretation as a single lexical element is possible). A separator is either a space character (SPACE or NBSP), a format effector, or the end of a line. A space character (SPACE or NBSP) is a separator except within a comment, a string literal, or a space character literal.

So that makes it clear that the only case in which separation is obligatory is where the concatenated tokens could be tokenised as a (longer) combined token, which is essentially the maximal-munch rule.

Context

StackExchange Computer Science Q#51487, answer score: 6

Revisions (0)

No revisions yet.