Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting buffers that can ignore mark requests #81

Open
ifazk opened this issue Jul 20, 2019 · 11 comments
Open

Supporting buffers that can ignore mark requests #81

ifazk opened this issue Jul 20, 2019 · 11 comments

Comments

@ifazk
Copy link

ifazk commented Jul 20, 2019

It can sometimes be useful for buffers ignore a mark action. For example, a lexbuffer can decide that it would be inappropriate to mark at the current position because it is not a grapheme cluster boundary.

Is this a use case that the maintainers of this project would be willing to support?

@ifazk
Copy link
Author

ifazk commented Jul 20, 2019

On a more technical note, as far as I understand, we would only need to make a small change to call_state in ppx_sedlex.ml. There, instead of returning a best final state, we mark the best final state and immediately backtrack. I'm trying this out in a personal fork, but I need to do a bit more testing before I can send a pull request.

@pmetzger
Copy link
Member

Let us know. That said, it seems peculiar that a regular expression would mark somewhere that you're not allowed to mark. Marking is below the level of the UI after all.

@ifazk
Copy link
Author

ifazk commented Jul 24, 2019

Here's an example of a situation where a regular expression would mark where you don't want it to mark. Suppose you have the regular expression leg and the utf8 encoded string for "leg̈" (in ocaml syntax that would be "leg\u{0308}"), a specially designed buffer can ignore the mark after the g because it knows that g is not at a character boundary.

@pmetzger
Copy link
Member

I don't understand your example. Is the issue that you don't want to match leg when you have leg̈?

@ifazk
Copy link
Author

ifazk commented Jul 26, 2019

Yes, that my issue. I don't want a match to be successful in the middle of a user-perceived character. Most people would consider to be a single character instead of two characters consisting of g and \u{0308}.

@pmetzger
Copy link
Member

Okay, so that's not "supporting buffers that can ignore mark requests", that's noting a possible bug in the way the regexp engine works. Have you read the Unicode Consortium TR 18 document? I haven't looked at it recently, but what does it claim the correct behavior is supposed to be?

@pmetzger
Copy link
Member

(A quick scan of TR18 seems to indicate that what you want is an implementation of 2.2.1 Grapheme Cluster Mode but I'm not sure that's precisely what you're asking for.)

@ifazk
Copy link
Author

ifazk commented Jul 27, 2019

What I'm precisely asking for: I want backtracking to be the only way to take a semantic action.

I want sedlex to always mark when it is in a final DFA state, and always backtrack when it can no longer transition to new DFA states. The only case where we don't currently do this is when we are at a final DFA state and there are no transitions to new DFA states. In this case, we directly take a semantic action, instead of marking and then backtracking (see the code that #83 changes).

Always using backtrack to perform semantic actions, allows custom Sedlexing buffers to control the positions at which semantic actions are allowed to take place. In the case where we are at a final DFA state and there are no more transitions out of this state, a custom Sedlexing buffer can decide whether a semantic action should take place at this position, and if not, it backtrack to the last position it would have allowed a semantic action to take place.

@ifazk
Copy link
Author

ifazk commented Jul 27, 2019

My personal use case for this: I have a custom Sedlexing buffer which is aware of Grapheme Cluster boundaries, and only stores marked state at code point positions which are at boundaries. If we try to mark at a position which is not a boundary, it just ignores the Sedlexing.mark.

This does not give us Grapheme Cluster Mode. According to the last paragraph of 2.2.1, in Grapheme Cluster Mode, we should treat /(x\u{308})/ and /(x)(\u{308})/ differently, and /(x)(\u{308})/ should never match. I'm just using sedlex as the regex engine, and it makes no distinction between /(x\u{308})/ and /(x)(\u{308})/.


Supporting any of RL2.2 Extended Grapheme Clusters, RL2.3 Default Word Boundaries or Grapheme Cluster Mode is beyond the scope of sedlex. All of these require the regex engine itself to understand grapheme clusters, words or boundaries, and I don't think it is worth the engineering effort to teach these to sedlex.

I think always using backtracking for semantic actions is a middle ground where using custom Sedlexing buffers we can only allow semantic actions to happen at boundaries.

@pmetzger
Copy link
Member

pmetzger commented Jul 28, 2019

Okay, given that what you wan't isn't Grapheme Cluster Mode, I have to confess, and perhaps it's just that I'm thick, that I don't understand what it is that you want to do, even having read the above a few times. Perhaps you could give an example of what the resulting syntax expressing your improvement would be in a real sedlex specification?

@ifazk
Copy link
Author

ifazk commented Jul 29, 2019

Thanks for being so patient with me! I'm going to try to explain myself again.

Suppose I have the following code:

match%sedlex buf with
| "let" -> Token.LET
| "=" -> Token.EQ
| eof -> Token.EOF
| _ -> raise (Token.error @@ Sedlexing.positions buf)

and the input stream "let\u{308}".

For the above, sedlex will not call mark or backtrack at t and instead just directly take a semantic action and return Token.LET.

Here, instead of directly taking a semantic action at t, I want sedlex to call mark and then backtrack and then take the appropriate semantic action according to backtrack.

In my custom buffer, if we try to call mark at the t, my custom buffer won't actually store anything, since t does not occur before a grapheme cluster boundary. This way, when sedlex calls backtrack, I raise an error instead of returning Token.LET.

If we had the input stream "let=" instead, since t occurs at a boundary, my custom buffer will store the state, and when backtrack is called, will return Token.LET.

Instead of having sedlex worry about boundaries, I'm trying to move the burden to buffers. Currently, while a custom buffer can control what mark and backtrack do, it cannot prevent semantic actions from happening at all non-boundary codepoints, because if there are no transitions out of a DFA state, sedlex directly takes a semantic action instead of going through the buffer first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants