When students learn about parser generators, they are often told that associativity and precedence issues can be solved in one of two ways.
The “stratified grammar” approach
In the stratified grammar approach, one tediously stratifies their grammar by creating a non-terminal per precedence level.
The stratification handles precedence by forcing operators binding less tightly to recognize the outermost structure of the input, before tighter operators get to split the smaller parts of the expression. For instance here, an expression is split at the outermost
plusminus level, according to its
MINUSs. Then, each operand of those is split into products and divisions according to the non-terminal
Associativity is handled by imbalancing each rule: a left-associative binary operator would try to parse something at its level for its left operand, while it would try to parse something at the next level for its right operand. Symmetrically for a right-associative binary operator. A non-associative binary operator would parse something at the next level for both its operands.
The “parser directives” approach
%left indicates that the binary operators are left-associative. Tokens mentioned on the same line have the same precedence (for instance,
MINUS), while tokens on different lines have different precedence (for instance,
TIMES has higher precedence than
I have always found the stratified grammar approach inelegant:
the actual grammar of the language is scattered around,
switching the precedence or associativity of two operators forces multiple non-local rewrites,
so does introducing a new operator at a different level.
Contrast the directives approach, where:
the actual grammar is as readable as it gets,
switching associativity is a keyword away and switching precedences a line switch away,
introducing a new operator is a simple as finding the proper precedence line in which to declare it, and adding a line for it in the grammar wherever we see fit.
Unfortunately, the “parser directives” approach comes with a couple caveats as soon as one introduces a token-less function application, and coming up with a solution for it requires reading the documentation of the parser generator, and understanding the solution basically requires understanding the algorithm running behind the scenes (my next post will be exactly on this subject).
Therefore, when we teach parser generators to our students, many of them end up choosing the “stratified approach”, to my great dismay, while the ones who choose my preferred approach struggle for hours trying to get function application to associate correctly.
On to parser combinators
Unfortunately, parser generators don’t seem very helpful when one wants to write extensible parsers. By “extensible”, I mean the kind of parser that languages like Haskell, Coq, Agda use, wherein a user may define new operators, that the compiler is subsequently expected to parse appropriately.
For this, one often turns to parser combinator libraries, which provide ways to build parsers as functions of the host language (here we will use Haskell). They then google “parsec tutorial” or “parsec lambda calculus” or one such query, and find a trove of simple examples. Unfortunately, most examples are too simple to actually demonstrate the intricacies of parsing a complex language. They will often:
- explain that one can deal with function application by creating a parser for “terms that are not applications”:
- explain that one can deal with binary operators using a library function and an operator table (for instance, Text.MegaParsec.Expr):
Now, I was trying to build the following dependently-typed language:
and I wanted to introduce the notation
τ1 → τ2, standing for
Pi "" τ1 τ2, where the empty string indicates the non-binding. I considered putting it in an operator table with right-associativity, and I ended up with something like:
But I should also be able to parse binding
(x : τ1) → τ2), at the same precedence as the non-binding variant. But it’s not quite a binary operator (here, my syntax almost makes it look like one, but if you wanted to use
∀ it would become problematic). So you’d want to remove
→ from your operator table, but then you’d need to either remove everything before it or after it and to manually order parsers for those operators somewhere else.
Somewhere else, the parts of your parser dealing with
let might look like:
(Note: we try to follow the discipline demonstrated in “try a <|> b” considered harmful.)
We need to introduce
lambdaletP to be able to recognize
let x = 1 in λ y . 2 or
λ x . let y = 1 in 2 without adding parentheses. Can you see how similar to our “stratified grammar” approach this looks? We have to create extra parsers per precedence level, and inner parsers refer to their “level parser” explicitly. This code suffers from the very same issues as the stratified parser generator grammar: switching precedences or introducing new levels will require tedious editing and reordering, and the structure of the parser is lost in the code.
A more modular approach
If we take a step back, we can see that our issues arise from our parsers needing to call:
the parser at their own precedence level,
and the parser at the next precedence level.
Abstracting away the latter is easy: we can take it as input, and our caller will have to tell us who’s next.
Abstracting away the former might seem hard: if our parser explicitly calls itself recursively, it prevents other parsers from existing at its own level. So we would like to receive a parser for the things at our level as input, but we would like to be part of this parser too!
There is a classic trick to achieve this, called explicit open recursion: our parser will pass itself as input to its components! Let’s see how this works on a simple example, making
letlambdaP as an input (they call it
Now we can define:
Therefore, to define parsers that are modular in what exists at their precedence level and what exists at their next precedence level, we adopt the discipline of parameterizing our parsers by two parsers:
From now on, a parser should (in general) not call itself recursively, but rather use the parser received as first argument. Similarly, it should not call the parser for the next precedence construct, but rather use the parser received as second argument. This is best demonstrated by the parser for the syntactic construct
τ1 → τ2:
→ operator is right-associative. This means,
τ1 → τ2 → τ3 is to be parsed as
τ1 → (τ2 → τ3), and not as
(τ1 → τ2) → τ3. To achieve this, one must make sure not to allow parsing “naked” arrows on the left, by only parsing things at the next precedence level. This is why we call
nextP before the call to
symbol. However, we want to allow parsing “naked” arrows on the right side, for instance
τ2 → τ3 in our example. So, we call
selfP to achieve this. We don’t want to call
arrowP recursively here, because we might later want to have other parsers at the same level. It is in fact the case here, since the parser for the binding arrow will be at this level, allowing us to parse
τ1 → (x : τ2) → τ3 → τ4.
Note that modular parsers can call both parsers received as input, but may also call only one of them, or neither, as illustrated in the following examples. For instance, a non-associative binary operator will only call
Constructs that are allowed to appear without parentheses within themselves will only call
selfP, for instance
let x = 1 in let y = 2 in 3:
And parsers for atoms will not call either:
Now, for our tour de force, we will combine all those parsers into a parser for the full language. A language will be defined by a list of list of modular parsers, where rows are precedence levels:
Each row should be turned in a parser for the given precedence level, which tries all the options on the row, or falls back to the next row:
Given a row
ps, and a handle on the next precedence parser
nextP, we want to choose among all
ModularParsers. But remember that they need to receive the parser for the current precedence level as their first input, that is, the parser we are currently in the process of building. Thanks to the open recursion trick and the fixpoint operator, we can obtain a handle to the very parser we are building as
selfP, and pass it to each modular parser
p as their first argument.
nextP is simply passed along too. Note that we also append
nextP to our list of choices, thus defaulting to the next row when all else fails.
We should now be able to fold the entire table by:
choiceOrNextPon each row, transforming our table into a list
[\nextP -> firstRow, \nextP -> secondRow, \nextP -> thirdRow, ...],
then chaining all these parsers together,
and making sure that the final parser is allowed to call the first parser, as long as it’s done between parentheses.
We write a function
foldP to do so:
foldr, we can see that we are building the expression:
And that’s it, we can now build our parser:
All you need to do to change the precedence of operators is reorder lines in
parserTable. All you need to do to introduce a new precedence level is create a new line in the table. The overall structure of the language is immediately readable from the
One step further
It’s still not simple to recognize what productions are binary operators, and to change their associativity. We had to make this decision within
arrowP by plugging
nextP according to our intent. We can lift this decision into the parser table by creating three operators for left-, right-, and non-associative binary operators:
binOpRP is the easiest to start with: it attemps to parse something at the next level for the left operand, then to parse the infix symbol, after which it can commit and parse the right operand. It calls the output constructor with
r (for instance,
Pi "" l r).
binOpLP would look like
binOpRP, where the
l line would read
l <- selfP and the
r line would read
r <- nextP. However, this would result in a infinite loop of calls to
selfP. The usual trick is to use
binOpNP would look like
binOpRP, where the
l line would read
l <- nextP and the
r line would remain unchanged. I cheat here by reusing the existing code of
binOpRP, plugging in
nextP for both arguments.
We can finally rewrite our parser table as:
We can even package this all up into a neat data type:
It is then easy to define a function:
and to update
so that we can finally write our parser table as:
We are almost back to what was provided by
Text.MegaParsec.Expr, except that we have additional constructors for arbitrary
Atom parsers, and we can extend it with constructors to fit our future needs.
About this approach
In this post, I assumed that all the parsers in the table fail without consuming input. You might need to sprinkle some carefully-chosen
trys if that’s not the case for your parsers.
You might wonder why I put the
parens case in
foldP, rather than add a line
SelfNextParser $ \ _selfP nextP -> parens nextP at the end of the table. The issue is that for this final line of the parser, we don’t want to run
choiceOrNextP on it, because it should not default to
nextP! One could deal with this though, for instance by mapping
choiceOrNextP on all but the last entries in the table. The two approaches are equivalent.
I have not seen this technique mentioned in literature or in code or blogs in the wild. If this was already documented elsewhere, please leave me a pointer in the comments.
I am also interested in pushing this further and having one declarative structure used for generating both the parser and the pretty-printer. I am currently exploring a language which uses this idea here (a more complex version of the running example is in lib/Parsing.hs, and you can see that the generated parser is tested against the pretty-printer with HUnit, SmallCheck and QuickCheck in test/Main.hs).
I’m not sure this is worthy of a publication, but I might write this all up nicely in an article format and upload it to arXiv in the future.