HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Lexing and parsing a language with juxtaposition as an operator

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
operatorwithjuxtapositionlexinglanguageparsingand

Problem

Normal human math notation treats juxtaposition as implied multiplication, e.g., $2x$ means $2$ multiplied by $x$. This does not seem to be a common feature of computer languages, although it was, for example, supposed to be included in the language Fortress (now a dead project).

Lexing and parsing such a language seems to me like it would be difficult within the usual lexer-parser framework. In a computer language, unlike human math notation, we want to have variable names that are more than one letter. So if we need to lex the string xyz, we have a problem, because this could, e.g., mean xyz if x and yz are variables, but it could also mean xyz if xy and z are variables. The lexer normally wouldn't have this type of information available, so it would have no way of resolving the ambiguity.

Are there ways of reducing this problem to a more standard form that can be handled by a traditional lexer-parser pair, or does it require qualitatively different techniques?

Solution

Juxtaposition is used in mathematics because most mathematical texts
use single character variables (possibly with diacritic signs,
subscripts, superscripts, etc...) so that there is no ambiguity.

A mathematician might write $2x$ to multiply $2$ and $x$. It would be
quite unusual to write $x2$ to multiply $x$ by $2$. Actually, as much
as I can see (but I am open to contrary examples) a numerical factor
always comes first in the product. This unstated rule implies that one
never has two numerical factors in succession, which would make it
difficult to interpret $22$, which could be $2\times 2$ or the single
number $22$. Of course, if two numerical factors in a row were to be
acceptable, they would be separated by a space to avoid ambiguity.

Regarding variables, the same has to be true. Several variables can be
factors of a product (well, a sequence of products). Either you have a
rule that all variables are 1 letter (a constraint that cannot be
imposed on numbers), or you maust have a way to separate the
variables, either by a space, or by a $\times$ sign.

First I would point out that this is a pure lexer problem: how do I
determine what variables are juxtaposed as the factors of a product.
The fact that the operator are omitted is not really an issue for the
parser (though it may call for some appropriate techniques depending
on other syntactic aspects of the language, and the way they interact
with the feature we are discussing).

Of course, you can start playing games, like the following which you seem to be actually suggesting. Since all
variables are supposed to be declared (that is also true in
mathematics), you could juxtapose variables arbitrarily, without
spaces, to form a product, and rely on the lexer to determine how to
cut the string so that each cut out substring is a declared variable.
Of course, depending on what variables you have declared, this can
have an ambiguous result (as you suggest in your question), and produce a lattice of variables instead of a
sequence of variables.

I created the expression "lattice of variables" after the expression "word
lattice" used in speech recognition for exactly the same problem in a
different setting. For example, given that you have declared the seven
variables: a b c d ab bc cd. Then the string abcd can be cut into
several sequences of variables, that can be represented by a loopless
finite state automaton like the following (which is what is called a
word lattice in speech processing).

/--- ab ---\
    /            \
    --- a --- b --\--cd --------\
-- /     \                       > -----
   \      \-- bc ------/--- d --/
    \                 /
     \-- ab --- c ---/


Creating this lattice is not difficult, using finite state techniques,
when the available variables are known. It would however probably
require some changes in standard lexers, as they usually identify
lexical element with simpler techniques, simpler use of context.

One way to do it is for the lexer to first identify the string as an
expression. It could then call a routine that would access the current
symbol table to find relevant identifiers and organise them as a trie
to cut the string into existing identifiers. Assuming no ambiguity,
the lexer could then give the successive identifiers to the parser,
interspersed with $\times$ symbols. So it is rather easy to integrate
into existing architectures.

To limit the ambiguity, one can first eliminates by type checking all
variables with a type incompatible with use as factor in a product.
Or possibly, if there are different kinds of products requiring
compatible types, this can be used to eliminate inconsistant sequences
of variables.

Remaining ambiguities would probably no be a major problem for the
parser (possibly depending on parsing techniques, typing ...). But of
course, the ambiguity has to be resolved at some point, if the
language is supposed to convey only one meaning.

One way to do it could be to notify an error whenever such an
ambiguity is found, with some details about the ambiguity, thus
requesting the user to disambiguate by adding a space or a $\times$
symbol.

Though this can be handled technically, I do not think I would
recommend it in a language. Even if unambiguous, this is not extremely
readable, notwithstanding the fact that the ancient languages such as
Latin and Greek used to be written that way.

An alternative could be to accept this kind of notation only with the
usual constraints (in mathematics) that I stated above: number must be
first and variables must have a single letter for implicit product. But ambiguity could
remain between a single variable $xx$ and the product $x\times x$
written without the $\times$ symbol. I guess it would not be very
hard, with standard technology.

I think I have exhausted my imagination on this topic. It is not
technically difficult, for what I can see.
But why would all this be worth the trouble?

Code Snippets

/--- ab ---\
    /            \
    --- a --- b --\--cd --------\
-- /     \                       > -----
   \      \-- bc ------/--- d --/
    \                 /
     \-- ab --- c ---/

Context

StackExchange Computer Science Q#28408, answer score: 3

Revisions (0)

No revisions yet.