Parser generators and function application
TLDR: If you want to parse function application f a b
correctly,
you need to, counter-intuitively, give some precedence to tokens such
as LPAREN
, VAR
, NUM
, LAMBDA
, etc.
I promised this post two posts ago, but it’s this time of the year where our students need it, so here comes!
The rest of this post is somewhat generic in what parser generator in
the yacc
family you use, for instance, OCaml’s menhir
and
Haskell’s happy
.
As I mentioned in the earlier post, there are two ways to deal with precedence when building parser generators. I called them the “stratified grammar” and the “parser directives” approaches.
When showing toy programs, both methods work pretty well. But in my opinion, the “parser directives” approach is much better, as it keeps the grammar easy to read and separates the concerns of precedence and associativity. But when comes the time to implement a language with token-less function application, things become weird:
This will parse a b c
as a (b c)
, contrary to the wanted (a b)
c
. Now if you try to find a solution to this online, you might
find
people discouraging you from using the “parser directives” approach,
or
wrong answers to beginners,
or
non-solutions.
Most answers usually tell you to add a ghost %token APP
, and to add
%prec APP
to the production expr expr
. Often, they will suggest
to add a %left APP
line to the precedence list, which is useless by
itself, and also misleading. This is all on the right way to the
correct solution, but not enough.
The correct solution
Unfortunately, to find the correct solution, I had to read the documentation carefully, and understand some low-level details of the implementation.
If you do so, you will learn that shift-reduce conflicts are resolved by comparing the declared precedence of the token to be shifted with the declared precedence of the production rule to be reduced. It is not really mentioned in the documentation, but in the absence of declaring a token’s precedence, they will have maximum precedence. You will also learn that a rule’s precedence is that of its rightmost non-terminal.
Therefore when presented with the shift-reduce conflict:
The parser will compare the declared precedence of token VAR
with
the declared precedence of the application rule, which the %prec APP
phrase lets you override to be that of the ghost token APP
.
So if you never declared a precedence for VAR
or made it higher than
that of APP
, the parser will not do what you want.
So the correct solution is the following:
It is pointless to write %left APP
(instead of %nonassoc APP
), as
the left
is only used to resolve a conflict between a token (or set
of tokens at the same precedence level) and itself (themselves), as
in:
where the PLUS
token is in shift-reduce conflict with itself, in
which case %left
means “reduce” and %right
means “shift”. But we
know there will NEVER be a conflict like:
since APP
is a ghost token! So %nonassoc APP
is enough, and will
not mislead people into thinking that this annotation decides the
associativity of the function application rule.
Now, the reason why you have to list all the other tokens, is that
there will be one shift-reduce conflict per token appearing at the
start of an expr
. For instance, if you only have the partial
solution:
Then a b c
will correctly parse to (a b) c
, but a b (c d)
will incorrectly parse to a (b (c d))
, because in the following
shift-reduce conflict:
the parser had to choose between the precedence of the application
rule APP
, and the precedence of the incoming token LPAREN
. Since
you never declared it, it had higher precedence.
That’s it for parsing function application!
EDIT: To be clear, you might also have tokens that need to be at
higher precedence than APP
. For instance, if you want f a -> b
to
parse as (f a) -> b
rather than f (a -> b)
, you will want:
Those tokens should not be valid start tokens for an expression.
One caveat of the “parser directives” approach
Even though I advocate strongly for the “parser directives” approach, it still has one issue that prevents us from being as modular as we’d hope. Consider the following:
A functional programmer would see this repetition, and want to abstract this away into a new non-terminal:
This looks really cool! Unfortunately, this breaks our parser. The
reason is that, earlier, each rule had its own precedence, determined
by its rightmost terminal (PLUS
for the plus rule, etc.). But now,
our general binop
production only contains non-terminals. So the
parser has no way to distinguish PLUS
from TIMES
at this level of
abstraction, and will therefore do the same thing on such shift-reduce
conflicts as:
But it should decide differently depending on what the binop
was!
The only solution I can think of to mitigate this is to stop halfway through this refactoring:
Now our shift-reduce conflicts look like:
In the first case, the plusminus
rule has precedence PLUSMINUS
,
which is higher than that of PLUS
, so the parser correctly reduces
(ensuring left-associativity of addition and subtraction).
In the second case, the plusminus
rule (still) has precedence
PLUSMINUS
, but it is lower than that of TIMES
, ensuring the
precedence of multiplication over addition and subtraction.
But honestly, this halfway refactoring makes the grammar look more complicated that the redundant original one, so I don’t think it’s really worth it.
EDIT: Reddit user
/u/Drupyog
mentions the
existence of a keyword %inline
to handle this in Menhir. Nice!
That’s all for today!
© 2024 Coq en Stock ― Powered by Jekyll and Textlog theme