[squeak-dev] The Inbox: Regex-Core-ct.70.mcz

christoph.thiede at student.hpi.uni-potsdam.de christoph.thiede at student.hpi.uni-potsdam.de
Wed Oct 20 22:39:08 UTC 2021


For an extensive list of patterns that can be parsed now, please see Regex-Tests-Core-ct.26.

Some notes for later:

- I was struggling to produce formal evidence that there can be no matching chain of links where the first occurence of a link would not advance the cursor but the second occurence would actually constitute a match. The only example of which I could think would be some side effect like ^a(\1?)*$ to match 'aa', but apparently no other popular regex engine is able to handle this, too. Also, support for backreferences is yet on my to-do list. :-) Can you think of any other violation?
- I already added the word "greedy" to the comment of RxMatcher >> #matchAgainstBranch:nextLink:alternativeLink:. Yes, adding support for non-greedy operators such as ?? or *+ would be a funny follow-up project, too. :D

Further reading:

[1] https://pypi.org/project/regex/
[2] https://262.ecma-international.org/5.1/#sec-15.10
[3] http://regexstorm.net/tester
[4] https://regex101.com/
[5] https://www.regular-expressions.info/ 

Best,
Christoph

---
Sent from Squeak Inbox Talk

On 2021-10-20T22:26:41+00:00, commits at source.squeak.org wrote:

> A new version of Regex-Core was added to project The Inbox:
> http://source.squeak.org/inbox/Regex-Core-ct.70.mcz
> 
> ==================== Summary ====================
> 
> Name: Regex-Core-ct.70
> Author: ct
> Time: 21 October 2021, 12:26:40.595482 am
> UUID: 806ed75d-4a4b-de49-80d4-ff40093c91f9
> Ancestors: Regex-Core-ct.63
> 
> Adds support for nullable closures and eliminates the eponymous error message. Expressions like these are now parsed and matched correctly:
> 
>     (a|b?)*
>     (a*)*
>     (a+){,3}
>     ()*|
>     (?<=a)?
> 
> This design decision is inspired by the behavior of several existing regex implementations (such as Python's regex module [1], the ECMAScript standard [2], .NET [3], or Java [4]). While some of the examples from above might be pretty useless in practice, especially the first example - conditional nullability - was a serious limitation in many cases before. See regex101.com and https://www.regular-expressions.info/ for further information.
> 
> Implementation summary:
> Matching nullable closures is non-trivial as, considering the second example from above, the matcher might try to match the inner piece again and again while capturing empty strings for every repetition of the outer piece. We avoid such infinite loops by remembering all open branch positions temporarily while matching a certain path of links in the matcher. If an open branch is reached again at the same position, a match will not possible and is refused immediately. In the parser, #signalNullableClosureParserError and all sends to it are eliminated or replaced by finer-granular validity checks.
> 
> Depends indeed on Regex-Core-ct.63 to avoid another merge conflict in the parser.
> 
> =============== Diff against Regex-Core-ct.63 ===============
> 
> Item was changed:
> Object subclass: #RxMatcher
> +     instanceVariableNames: 'matcher ignoreCase startOptimizer stream markerPositions previousMarkerPositions markerCount branchPositions lastResult firstTryMatch'
> -     instanceVariableNames: 'matcher ignoreCase startOptimizer stream markerPositions previousMarkerPositions markerCount lastResult firstTryMatch'
>     classVariableNames: 'Cr Lf NullCharacter'
>     poolDictionaries: ''
>     category: 'Regex-Core'!
> 
> + !RxMatcher commentStamp: 'ct 10/20/2021 23:47' prior: 0!
> - !RxMatcher commentStamp: 'ul 8/28/2015 14:18' prior: 0!
> -- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
> --
> This is a recursive regex matcher. Not strikingly efficient, but simple. Also, keeps track of matched subexpressions. The life cycle goes as follows:
> 
> 1. Initialization. Accepts a syntax tree (presumably produced by RxParser) and compiles it into a matcher built of other classes in this category.
> 
> 2. Matching. Accepts a stream or a string and returns a boolean indicating whether the whole stream or its prefix -- depending on the message sent -- matches the regex.
> 
> 3. Subexpression query. After a successful match, and before any other match, the matcher may be queried about the range of specific stream (string) positions that matched to certain parenthesized subexpressions of the original expression.
> 
> Any number of queries may follow a successful match, and any number or matches may follow a successful initialization.
> 
> Note that `matcher' is actually a sort of a misnomer. The actual matcher is a web of Rxm* instances built by RxMatcher during initialization. RxMatcher is just the interface facade of this network. It is also a builder of it, and also provides a stream-like protocol to easily access the stream being matched.
> 
> Instance variables:
>     matcher                    <RxmLink> The entry point into the actual matcher.
>     igoreCase                    <Boolean> Whether the matching algorithm should be case sensitive or not.
>     startOptimizer                <RxMatchOptimizer> An object which can quickly decide whether the next character can be the prefix of a match or not.
>     stream                        <Stream> The stream currently being matched against.
>     markerPositions            <Array of: nil | Integer | OrderedCollection> Positions of markers' matches.
>     previousMarkerPositions    <Array of: nil | Integer | OrderedCollection> Positions of markers from the previous #tryMatch send.
>     markerCount                <Integer> Number of markers.
> +     branchPositions            <Array of: OrderedCollection of: Integer> Temporary positions of looped branches.
>     lastResult                     <Boolean> Whether the latest match attempt succeeded or not.
>     firtTryMatch                <Boolean> True if there hasn't been any send of #tryMatch during the current matching.!
> 
> Item was changed:
> ----- Method: RxMatcher>>initializeMarkerPositions (in category 'initialize-release') -----
> initializeMarkerPositions
> 
> +     self flag: #refactor. "When all merge conflicts from the inbox have been resolved, consider renaming this to #initializeMatcherPositions."
> +     
>     markerPositions := Array new: markerCount.
>     previousMarkerPositions := Array new: markerCount..
>     3 to: markerCount do: [ :index |
>         markerPositions at: index put: (OrderedCollection new: 1).
> +         previousMarkerPositions at: index put: (OrderedCollection new: 1) ].
> +     
> +     branchPositions := Dictionary new.!
> -         previousMarkerPositions at: index put: (OrderedCollection new: 1) ].!
> 
> Item was added:
> + ----- Method: RxMatcher>>matchAgainstBranch:nextLink:alternativeLink: (in category 'matching') -----
> + matchAgainstBranch: anRxmBranch nextLink: nextLink alternativeLink: alternativeLink
> +     "Greedy matching."
> + 
> +     | position result |
> +     "Detect infinite empty backloops from nested links."
> +     position := stream position.
> +     branchPositions
> +         at: anRxmBranch
> +         ifPresent: [:positions | (positions includes: position) ifTrue: [
> +             "We already had this situation earlier in the current stack, don't retry again"
> +             ^ false]].
> +     
> +     "Remember the current position for this branch to enable detection of infinite empty backloops."
> +     (branchPositions at: anRxmBranch ifAbsentPut: [OrderedCollection new]) add: position.
> +     
> +     result := (nextLink matchAgainst: self)
> +         or: [alternativeLink notNil and: [alternativeLink matchAgainst: self]].
> +     
> +     (branchPositions at: anRxmBranch) remove: position.
> +     
> +     ^ result!
> 
> Item was changed:
> ----- Method: RxMatcher>>resetMarkerPositions (in category 'private') -----
> resetMarkerPositions
>     "Reset the marker positions. This method should only be sent from #tryMatch. When this is after the first #tryMatch send, then the marker positions must be swapped."
> 
> +     self flag: #refactor. "When all merge conflicts from the inbox have been resolved, consider renaming this to #resetMatcherPositions."
> +     
>     firstTryMatch
>         ifTrue: [ firstTryMatch := false ]
>         ifFalse: [
>             | temp |
>             temp := previousMarkerPositions.
>             previousMarkerPositions := markerPositions.
>             markerPositions := temp ].
>     markerPositions
>         at: 1 put: nil;
>         at: 2 put: nil.
>     3 to: markerCount do: [ :index | 
> +         (markerPositions at: index) resetTo: 1 ].
> +     
> +     branchPositions removeAll.!
> -         (markerPositions at: index) resetTo: 1 ]!
> 
> Item was changed:
> ----- Method: RxParser>>atom (in category 'recursive descent') -----
> atom
>     "An atom is one of a lot of possibilities, see below."
> 
>     | atom |
>     (lookahead == nil 
> +     or: [ lookahead == $| ] 
> +     or: [ lookahead == $) ])
> -     or: [ lookahead == $| 
> -     or: [ lookahead == $)
> -     or: [ lookahead == $*
> -     or: [ lookahead == $+ 
> -     or: [ lookahead == $? ]]]]])
>         ifTrue: [ ^RxsEpsilon new ].
> +     
> +     (lookahead == $*
> +     or: [ lookahead == $+ ]
> +     or: [ lookahead == $? ])
> +         ifTrue: [ ^self signalParseError: 'nothing to repeat' ].
> +     
> -         
>     lookahead == $( 
>         ifTrue: [
>             ^ self group ].
>     
>     lookahead == $[
>         ifTrue: [
>             "<atom> ::= '[' <characterSet> ']' "
>             self match: $[.
>             atom := self characterSet.
>             self match: $].
>             ^atom ].
>     
>     lookahead == $: 
>         ifTrue: [
>             "<atom> ::= ':' <messagePredicate> ':' "
>             self match: $:.
>             atom := self messagePredicate.
>             self match: $:.
>             ^atom ].
>     
>     lookahead == $. 
>         ifTrue: [
>             "any non-whitespace character"
>             self next.
>             ^RxsContextCondition new beAny].
>     
>     lookahead == $^ 
>         ifTrue: [
>             "beginning of line condition"
>             self next.
>             ^RxsContextCondition new beBeginningOfLine].
>     
>     lookahead == $$ 
>         ifTrue: [
>             "end of line condition"
>             self next.
>             ^RxsContextCondition new beEndOfLine].
>         
>     lookahead == $\ 
>         ifTrue: [
>             "<atom> ::= '\' <character>"
>             self next ifNil: [ self signalParseError: 'bad quotation' ].
>             (BackslashConstants includesKey: lookahead) ifTrue: [
>                 atom := RxsCharacter with: (BackslashConstants at: lookahead).
>                 self next.
>                 ^ atom].
>             self
>                 ifSpecial: lookahead
>                 then: [:node | self next. ^ node]].
>         
>     "If passed through the above, the following is a regular character."
>     atom := RxsCharacter with: lookahead.
>     self next.
>     ^atom!
> 
> Item was changed:
> ----- Method: RxParser>>piece (in category 'recursive descent') -----
> piece
>     "<piece> ::= <atom> | <atom>* | <atom>+ | <atom>? | <atom>{<number>,<number>}"
> 
>     | atom |
>     atom := self atom.
>     
>     lookahead == $*
>         ifTrue: [ 
>             self next.
> -             atom isNullable
> -                 ifTrue: [ self signalNullableClosureParserError ].
>             ^ RxsPiece new initializeStarAtom: atom ].
> 
>     lookahead == $+
>         ifTrue: [ 
>             self next.
> -             atom isNullable
> -                 ifTrue: [ self signalNullableClosureParserError ].
>             ^ RxsPiece new initializePlusAtom: atom ].
> 
>     lookahead == $?
>         ifTrue: [ 
>             self next.
> -             atom isNullable
> -                 ifTrue: [ self signalNullableClosureParserError ].
>             ^ RxsPiece new initializeOptionalAtom: atom ].
>     
>     lookahead == ${
>         ifTrue: [
>             ^ self quantifiedAtom: atom ].
>         
>     ^ RxsPiece new initializeAtom: atom!
> 
> Item was changed:
> ----- Method: RxParser>>quantifiedAtom: (in category 'recursive descent') -----
> quantifiedAtom: atom
>     "Parse a quanitifer expression which can have one of the following forms
>         {<min>,<max>} match <min> to <max> occurences
>         {<minmax>} which is the same as with repeated limits: {<number>,<number>}
>         {<min>,} match at least <min> occurences
>         {,<max>} match maximally <max> occurences, which is the same as {0,<max>}"
>     | min max |
>     self next.
>     lookahead == $,
>         ifTrue: [ min := 0 ]
>         ifFalse: [
>             max := min := (self inputUpToAny: ',}' errorMessage: ' no terminating "}"') asUnsignedInteger ].
>     lookahead == $,
>         ifTrue: [
>             self next.
>             max := (self inputUpToAny: ',}' errorMessage: ' no terminating "}"') asUnsignedInteger ].    
>     self match: $}.
> -     atom isNullable
> -         ifTrue: [ self signalNullableClosureParserError ].
>     (max notNil and: [ max < min ])
>         ifTrue: [ self signalParseError: ('wrong quantifier, expected ', min asString, ' <= ', max asString) ].
>     ^ RxsPiece new 
>         initializeAtom: atom
>         min: min
>         max: max!
> 
> Item was removed:
> - ----- Method: RxParser>>signalNullableClosureParserError (in category 'private') -----
> - signalNullableClosureParserError
> -     self signalParseError: ' nullable closure'.!
> 
> Item was changed:
> ----- Method: RxmBranch>>matchAgainst: (in category 'matching') -----
> matchAgainst: aMatcher
>     "Match either `next' or `alternative'. Fail if the alternative is nil."
> 
> +     ^ aMatcher matchAgainstBranch: self nextLink: next alternativeLink: alternative!
> -     (next matchAgainst: aMatcher) ifTrue: [ ^true ].
> -     ^(alternative ifNil: [ ^false ]) matchAgainst: aMatcher!
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20211021/ce83320c/attachment.html>


More information about the Squeak-dev mailing list