Discarding white-space (or any other matched rule) by not a creating node. #66

BertramRaven · 2022-07-20T10:56:20Z

Consider a simple grammar for a language which expects white-space as a delimiter.
e.g. let @x = 5 let @q= 10

# for illustration only!
grammar SimpleLang
   script          <- kw+
   kw              <- ws* let ws*
   let             <- "let" ws+ var ws* "=" ws* value
   var             <- "@" [A-Za-z] [A-Za-z0-9]* 
   value           <-  string / number / var / objname
   string          <- "\"" ("\\" . / [^"])* "\""
   number          <- [1-9] [0-9]*
   objname         <- [A-Za-z] [A-Za-z0-9]*
   ws              <- [\s]

In order to reduce the number of white-space TreeNodes to one per white-space sequence we can create rules for white-space as:

ws        <- [\s]*
wsp       <- [\s]+

However, the white-space itself is of no importance to the parser, it just uses memory and consumes time being ignored when executing nodes.

Is it wise to add a symbol on the parser rule which means consume but do not add a TreeNode for this match?
e.g.

# "@" means consume but do not add a parser node.
@ws        <- [\s]*
@wsp       <- [\s]+

Another use may be for language specific annotation.

# an action which adds an annotation in a separate programmer defined structure but does not add a parser node.
@annotate  <- "@@" comment %make_annotation

Note: this is different to using the current muting '@' as that mutes parts of an expression rather than an entire rule. Yes, it would be possible to add muting to all uses of wp or wps, but that makes the grammar appear messy.

Thoughts?

The text was updated successfully, but these errors were encountered:

jcoglan · 2022-07-20T13:07:51Z

Placing the @ operator on rule names would not make sense in general. It is currently supported as an annotation to expressions appearing within a sequence, where it causes the expression to be dropped from that sequence's elements array in the resulting tree node, or from the array passed to any user-defined actions using the % operator.

In contrast, saying that a rule should never generate a tree node, wherever it is used, would break any use of the rule in a non-sequence position, where a tree node is required. For example, consider this grammar:

a <- b
@b <- "b"

If rule b does not generate a tree node, then rule a cannot either, and we cannot return anything as a result from this parser. It is only safe to drop tree nodes when they are part of a sequence, so that's where the @ is supported.

I also think this would produce confusing behaviour where the effect of using a rule isn't clear without reading the rule itself, and makes it unclear how many child elements a sequence will actually have when parsed. This is currently clear from reading just the sequence rule itself, whereas this addition would require reading any rules the sequence uses, and so on recursively.

I'm not sure what you mean by this example:

@annotate  <- "@@" comment %make_annotation

The effect of the %make_annotation suffix is already that "@@" comment will not generate a tree node by itself, but make_annotation() will be invoked to produce whatever value the user wants as a result of parsing this expression. Invoking make_annotation() to construct a parse node and then throwing the result away because the rule name is prefixed with @ would be wasteful, so I'm not clear on what you're actually asking for here.

BertramRaven · 2022-07-21T13:35:34Z

Ignoring the last example and looking just at the first, It is unclear why @ on rule names would "not make sense in general."
Consider the case where white-space is required (+ or *) in order for a language to be parseable but should not create a TreeNode because it has no meaning to the AST, only the parsing of a script - the canopy website and source code on github provide several examples of muting clauses which are unimportant to the AST; they even state this to make the purpose clear.
Such muting is especially important for saving memory (one TreeNode per white-space character in the early examples!) and cpu time when traversing the tree.

The example you provide against the use of @ on rule names is a simple error in the intent of the person crafting the definition not a proof for exclusion. (Think back to the use of predicate analysis rule generation you will have encountered at university - the exclusion of a rule which leads to the null functional response is an error)

a <- b # result is a null node which must not be null as 'b' is not optional and no other clause is specified. Output a suitably irritating error message.
@b <- "b"

Saying this would require recursive scanning of the grammar by the developer is not true - running the generator tells us exactly where the error is located and is only of the type specified above. Further, such issues are already a part of canopy.

a <- b?
b <- "b"

In the case I present using @ on the rule name would be equivalent to placing the @ on every usage of that rule.
Compare:

grammar verboseExample
    script <- "this" @ws "is" @ws "just" @ws "making" @ws "a" @ws "list" @ws "of" @ws "words"
    ws    <- [/s]*

with

grammar terseExample
    script <- "this" ws "is" ws "just" ws "making" ws "a" ws "list" ws "of" ws "words"
    @ws    <- [/s]*

Now extrapolate that to a full language grammar with dozens of keywords, clauses, and sub clauses.
There are many examples of questions asked in such places as stack-exchange about canopy where I see what can only be described as an "@" explosion.

As it is, I have implemented this in my version of canopy - it is saving me hours of debugging of the definition of a query and analysis language which I am developing in parallel with my own peg parser generator which creates rust code (code complete as of a week ago). It is also saving my not so young eyes from going @'y when I try to find a needling bug in an @ stack. ;-)

As before, I ask for an academically complete reason why using @ on rule names will lead to side-effects which are not present when using @ muting in the rule definition before I make a complete fool of myself.

jcoglan · 2022-07-21T21:56:05Z

In general, Canopy assumes that all parsing expressions return a value if they match the input. An expression that doesn't match produces the special internal value FAILURE, and any other value (including null and undefined) is assumed to indicate a successful match. Since expressions may produce any possible user-defined value via an %action, all normal values are assumed to indicate success.

So to begin with, we would need another special internal value to signal that an expression matched, but produced no value. Let's call this value NONE. This value must never become accessible outside the parser -- it should never end up forming part of a parse tree and be seen by the calling program.

In Canopy grammars, a reference to a named rule, for example your ws rule, can appear in the following expression forms:

As an expression by itself: ws
As part of a quantifying expression: ws*, ws+, ws?
As the subject of a look-ahead: &ws, !ws
As part of a sequence: "a" ws "b" (this creates a named element on the sequence node)
As a branch of a choice: "a" / ws / "b"
As the root expression of a grammar

The basic intention of rules is just to give a name to a parsing expression, so it can be referenced in other expressions. In general it should be safe to replace any reference to a rule with its definition, except for rules that are self-recursive for which this would be impossible. You should get exactly the same parsing result by using a rule by name vs using its definition directly. This makes grammars act like normal programs and is important for debugging and refactoring.

So for this proposal to make sense, all the above possible uses of rule references would need to be able to accommodate an expression that produces no value. For some of these uses, this causes a problem:

If the rule is used as the root of the grammar, then the parser cannot return anything, not even null which is a normal value we allow user-defined actions to return. There is no possible return value for the parser.
If it is used with a quantifier (ws* or ws+) then the resulting node's elements array would need to be empty; we cannot allow nodes with elements like [NONE, NONE, NONE] as this leaks an internal value. We would have to return an empty elements array, which loses information about the number of times the rule matched. Maybe you're fine opting into that, but it's not clear from the use of the rule itself that that's what you'll get; you have to know how ws is defined, which may be difficult in a large grammar.
Lookaheads aren't a problem; we can teach those that NONE is a matching value, and lookaheads already do not produce meaningful nodes.
Using them in sequences isn't necessarily a problem as again we can teach the sequence parser that NONE is a matching value that should not halt the sequence matching. Any NONE values are excluded from the elements array just like how @ currently works. However it does mean you cannot tell how many child elements a sequence expression produces, without knowing how all the rules it uses are defined, recursively. It makes local reasoning about the grammar harder.
If a muted rule is used in one branch of a choice, it completely breaks the ability for anything that uses that choice rule to predict what it will produce.

Consider this grammar:

a <- "x" b "y"
b <- c / d
c <- "c"
@d <- "d"

This produces a situation where rule b might produce a node, or it might produce NONE. This in turn means we cannot predict how many child elements the sequence expression "x" b "y" will have -- the node matching "y" might be the second element, or it might be the third. All existing Canopy parsers assume that all the child nodes of a sequence match occur with consistent indexes, and this breaks that assumption and makes it much harder to write code that reliably handles such situations.

This is why I've emphasised that @ is really an operation on sequence construction, not on expressions in general. It instructs the parser to omit certain nodes from sequence expressions, which is the only place where it really makes sense. Just like how you're only allowed to add labels to expressions when they're part of a sequence.

In theory we could add logic to the grammar parser that checks that muted rules are only ever used in sequence position, either directly or indirectly. However I expect this would be complicated to implement, especially considering recursive and cyclic rules, and it's likely to produce confusing error messages. In general I'm not keen to add these sorts of non-local effects to the system, and I'd rather not make rule definitions and more special than they already are.

The main alternative idea I have is to add an operator that indicates which nodes you want to keep from a sequence, rather than those you want to discard. I don't have a lot of usability feedback to go on here. Do you have examples of some of the stack exchange threads to mentioned where people are having trouble with this feature?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discarding white-space (or any other matched rule) by not a creating node. #66

Discarding white-space (or any other matched rule) by not a creating node. #66

BertramRaven commented Jul 20, 2022 •

edited

Loading

jcoglan commented Jul 20, 2022

BertramRaven commented Jul 21, 2022 •

edited

Loading

jcoglan commented Jul 21, 2022

Discarding white-space (or any other matched rule) by not a creating node. #66

Discarding white-space (or any other matched rule) by not a creating node. #66

Comments

BertramRaven commented Jul 20, 2022 • edited Loading

jcoglan commented Jul 20, 2022

BertramRaven commented Jul 21, 2022 • edited Loading

jcoglan commented Jul 21, 2022

BertramRaven commented Jul 20, 2022 •

edited

Loading

BertramRaven commented Jul 21, 2022 •

edited

Loading