[squeak-dev] The Inbox: Regex-Core-ct.56.mcz

Thiede, Christoph Christoph.Thiede at student.hpi.uni-potsdam.de
Fri Apr 17 11:51:33 UTC 2020


Hi Levente,


thanks for the review! Yeah, you're right, there must be some (maybe historic) bug when parsing lookaround expressions. I will have a look at this as soon as possible.

However, I'm quite busy at the moment, so please forgive me if I won't be able to fix this until next week ...


> For bonus points, it would be great to support non-capturing groups too

> (?:). :)

Yeah, this is a great idea! I already had implemented this in my image some time ago, but I did not yet commit it to the inbox (tests are not yet complete).
Regex is a really exciting domain. I'm also having named captures in my pipeline. I'm looking forward to completing work on these features! :-)

Best,
Christoph

________________________________
Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Levente Uzonyi <leves at caesar.elte.hu>
Gesendet: Mittwoch, 15. April 2020 10:30:49
An: squeak-dev at lists.squeakfoundation.org
Betreff: Re: [squeak-dev] The Inbox: Regex-Core-ct.56.mcz

Hi Christoph,

I finally had a look at these changes. I didn't look at the
implementation in depth, but I found two issues:

1) the lookarounds behave as capturing groups. They even introduce
spurious subexpressions. Here's an example:

'(?<=a)b' asRegex search: 'ab'; in: [ :regex |
         (1 to: regex subexpressionCount)
                 collect: [ :index | regex subexpressions: index ]
                 as: Array ].

which gives #(#('b') #('') #('a')) instead of just #(#('b')).

A regex with a regular capturing group

'(a)b' asRegex search: 'ab'; in: [ :regex |
         (1 to: regex subexpressionCount)
                 collect: [ :index | regex subexpressions: index ]
                 as: Array ]

gives the expected #(#('ab') #('a')).

2) lookarounds can be quantified, and when they are, they behave
unexpectedly. For example, the following raises an error during #search:

'(?<=a)?b' asRegex search: 'ab'

IMO quantification doesn't make sense here. There should be an error
raised by #asRegex when a lookaround group is quantified. If anything
needs to be quantified, it can be done inside the lookaround group.

When the lookarounds are used as expected, they seem to work as expected.
But I think these issues need to be addressed before these changes are
integrated.
For bonus points, it would be great to support non-capturing groups too
(?:). :)


Levente

On Fri, 6 Mar 2020, commits at source.squeak.org wrote:

> Christoph Thiede uploaded a new version of Regex-Core to project The Inbox:
> http://source.squeak.org/inbox/Regex-Core-ct.56.mcz
>
> ==================== Summary ====================
>
> Name: Regex-Core-ct.56
> Author: ct
> Time: 6 March 2020, 8:32:34.316886 pm
> UUID: 22955049-895d-ca43-8919-1d9f44b851f9
> Ancestors: Regex-Core-ct.55
>
> Implements lookbehinds (both positive and negative) in Regular Expressions for Squeak.
>
> '(?<=(?<!n''t\s+)l[o]ve\s+)\w+' asRegex matchesIn: 'I love Squeak. I don''t love C++. I love Smalltalk.'.
>
> - Honor lookbehind syntax in RxParser (see #lookAround)
> - Add #forward argument to nodes/links messages. Extend RxsLookaround state by #forward boolean.
> - Create RxmLookahead as universal class for both lookahead and lookbehind links. Remove RxmLookahead.
> - Implement actual lookbehind logic in RxMatcher >> #matchAgainstLookbehind:positive:nextLink:.
>
> Again, I decided not to give further support for the #forward-less versions of the relevant messages, nor for the class RxmLookahead. First, I'm not sure whether "lookahead" is a universally useful default assumption for "lookaround". Also, these selectors have been introduced just a few hours ago in Regex-Core-ct.55, so there should not be a high demand for backward compatibility. Second, all these things are hidden behind the RxMatcher facade, so foreign clients should not really depend on them. (Bad pun: Don't look-behind the facade ...) Opinions on this topic are highly appreciated.
>
> This commit depends indeed on Regex-Core-ct.55. Please review carefully! The lookbehind matching implementation uses an intuitive algorithm. I give absolutely no guarantee that this is an efficient implementation, but for the beginning, even a superpolynomial complexity should be better than no implementation at all, shouldn't it? :-) Further information about lookbehinds can be found here: https://www.regular-expressions.info/lookaround.html
>
> =============== Diff against Regex-Core-ct.55 ===============
>
> Item was added:
> + ----- Method: RxMatchOptimizer>>syntaxLookaround:forward:positive: (in category 'double dispatch') -----
> + syntaxLookaround: lookaroundNode forward: forward positive: positive
> +      "Do nothing."!
>
> Item was removed:
> - ----- Method: RxMatchOptimizer>>syntaxLookaround:positive: (in category 'double dispatch') -----
> - syntaxLookaround: lookaroundNode positive: positive
> -      "Do nothing."!
>
> Item was added:
> + ----- Method: RxMatcher>>matchAgainstLookbehind:positive:nextLink: (in category 'matching') -----
> + matchAgainstLookbehind: lookbehind positive: positive nextLink: anRmxLink
> +
> +      | position matchesLookbehind |
> +      position := stream position.
> +      matchesLookbehind := (position to: 0 by: -1)
> +              anySatisfy: [:index |
> +                      stream position: index.
> +                      (lookbehind matchAgainst: self)
> +                              and: [stream position = position]].
> +      matchesLookbehind = positive
> +              ifFalse: [^ false].
> +      stream position: position.
> +      ^ anRmxLink matchAgainst: self!
>
> Item was added:
> + ----- Method: RxMatcher>>syntaxLookaround:forward:positive: (in category 'double dispatch') -----
> + syntaxLookaround: lookaroundNode forward: forwardBoolean positive: positiveBoolean
> +      "Double dispatch from the syntax tree.
> +      Special link can handle lookarounds (look ahead and look behind, positive and negative)."
> +      | piece |
> +      piece := lookaroundNode piece dispatchTo: self.
> +      ^ RxmLookaround with: piece forward: forwardBoolean positive: positiveBoolean!
>
> Item was removed:
> - ----- Method: RxMatcher>>syntaxLookaround:positive: (in category 'double dispatch') -----
> - syntaxLookaround: lookaroundNode positive: positiveBoolean
> -      "Double dispatch from the syntax tree.
> -      Special link can handle lookarounds (look ahead, positive and negative)."
> -      | piece |
> -      piece := lookaroundNode piece dispatchTo: self.
> -      ^ RxmLookahead with: piece positive: positiveBoolean!
>
> Item was changed:
>  ----- Method: RxParser>>lookAround (in category 'recursive descent') -----
>  lookAround
>        "Parse a lookaround expression after: (?<lookaround>)
>        <lookaround> ::= !!<regex> | =<regex>"
> +      | lookbehind positive |
> +      ('!!=<' includes: lookahead) ifFalse: [
> -      | positive |
> -      ('!!=' includes: lookahead) ifFalse: [
>                ^ self signalParseError: 'Invalid lookaround expression ?', lookahead asString].
> +      lookbehind := lookahead == $<
> +              ifTrue: [self next];
> +              yourself.
>        positive := lookahead == $=.
>        self next.
>        ^ RxsLookaround
>                with: self regex
> +              forward: lookbehind not
>                positive: positive!
>
> Item was removed:
> - RxmLink subclass: #RxmLookahead
> -      instanceVariableNames: 'lookahead positive'
> -      classVariableNames: ''
> -      poolDictionaries: ''
> -      category: 'Regex-Core'!
> -
> - !RxmLookahead commentStamp: 'ct 3/6/2020 18:29' prior: 0!
> - Instance holds onto a lookahead which matches but does not consume anything.
> -
> - Instance Variables
> -      lookahead:              <RxmLink>
> -      positive:               <Boolean>
> - !
>
> Item was removed:
> - ----- Method: RxmLookahead class>>with:positive: (in category 'instance creation') -----
> - with: aPiece positive: aBoolean
> -
> -      ^self new lookahead: aPiece positive: aBoolean!
>
> Item was removed:
> - ----- Method: RxmLookahead>>lookahead:positive: (in category 'accessing') -----
> - lookahead: anRxmLink positive: aBoolean
> -      lookahead := anRxmLink.
> -      positive := aBoolean.!
>
> Item was removed:
> - ----- Method: RxmLookahead>>matchAgainst: (in category 'matching') -----
> - matchAgainst: aMatcher
> -      "Match if the predicate block evaluates to true when given the
> -      current stream character as the argument."
> -
> -      ^aMatcher matchAgainstLookahead: lookahead positive: positive nextLink: next!
>
> Item was removed:
> - ----- Method: RxmLookahead>>postCopy (in category 'copying') -----
> - postCopy
> -
> -      super postCopy.
> -      lookahead := lookahead copy!
>
> Item was removed:
> - ----- Method: RxmLookahead>>postCopyUsing: (in category 'copying') -----
> - postCopyUsing: anIdentityDictionary
> -
> -      super postCopyUsing: anIdentityDictionary.
> -      lookahead := lookahead copyUsing: anIdentityDictionary!
>
> Item was removed:
> - ----- Method: RxmLookahead>>terminateWith: (in category 'building') -----
> - terminateWith: aNode
> -      lookahead terminateWith: aNode.
> -      super terminateWith: aNode.!
>
> Item was added:
> + RxmLink subclass: #RxmLookaround
> +      instanceVariableNames: 'forward positive lookaround'
> +      classVariableNames: ''
> +      poolDictionaries: ''
> +      category: 'Regex-Core'!
> +
> + !RxmLookaround commentStamp: 'ct 3/6/2020 19:45' prior: 0!
> + Instance holds onto a lookaround which matches but does not consume anything.
> +
> + Instance Variables
> +      lookbehind:             <RxmLink>
> +      forward:                <Boolean>
> +      positive:               <Boolean>!
>
> Item was added:
> + ----- Method: RxmLookaround class>>with:forward:positive: (in category 'instance creation') -----
> + with: aPiece forward: forwardBoolean positive: positiveBoolean
> +
> +      ^self new lookaround: aPiece forward: forwardBoolean positive: positiveBoolean!
>
> Item was added:
> + ----- Method: RxmLookaround>>lookaround:forward:positive: (in category 'accessing') -----
> + lookaround: anRxmLink forward: forwardBoolean positive: positiveBoolean
> +      lookaround := anRxmLink.
> +      forward := forwardBoolean.
> +      positive := positiveBoolean.!
>
> Item was added:
> + ----- Method: RxmLookaround>>matchAgainst: (in category 'matching') -----
> + matchAgainst: aMatcher
> +      "Match if the predicate block evaluates to true when given the current stream character as the argument."
> +
> +      ^ forward
> +              ifTrue: [aMatcher matchAgainstLookahead: lookaround positive: positive nextLink: next]
> +              ifFalse: [aMatcher matchAgainstLookbehind: lookaround positive: positive nextLink: next]!
>
> Item was added:
> + ----- Method: RxmLookaround>>postCopy (in category 'copying') -----
> + postCopy
> +
> +      super postCopy.
> +      lookaround := lookaround copy!
>
> Item was added:
> + ----- Method: RxmLookaround>>postCopyUsing: (in category 'copying') -----
> + postCopyUsing: anIdentityDictionary
> +
> +      super postCopyUsing: anIdentityDictionary.
> +      lookaround := lookaround copyUsing: anIdentityDictionary!
>
> Item was added:
> + ----- Method: RxmLookaround>>terminateWith: (in category 'building') -----
> + terminateWith: aNode
> +      lookaround terminateWith: aNode.
> +      super terminateWith: aNode.!
>
> Item was changed:
>  RxsNode subclass: #RxsLookaround
> +      instanceVariableNames: 'piece forward positive'
> -      instanceVariableNames: 'piece positive'
>        classVariableNames: ''
>        poolDictionaries: ''
>        category: 'Regex-Core'!
>
> + !RxsLookaround commentStamp: 'ct 3/6/2020 18:31' prior: 0!
> + Lookaround is used for lookaheads and lookbehinds. They are used to check if the input matches a certain subexpression without consuming any characters (e.g. not advancing the match position).
> - !RxsLookaround commentStamp: '<historical>' prior: 0!
> - I lookaround is used for lookaheads and lookbehinds. They are used to check if the input matches a certain subexpression without consuming any characters (e.g. not advancing the match position).
>
>  Lookarounds can be positive or negative. If they are positive the condition fails if the subexpression fails, if they are negative it is inverse.!
>
> Item was added:
> + ----- Method: RxsLookaround class>>with:forward:positive: (in category 'instance creation') -----
> + with: aRxsRegex forward: forwardBoolean positive: positiveBoolean
> +      ^ self new
> +              initializePiece: aRxsRegex
> +              forward: forwardBoolean
> +              positive: positiveBoolean!
>
> Item was removed:
> - ----- Method: RxsLookaround class>>with:positive: (in category 'instance creation') -----
> - with: aRxsRegex positive: positiveBoolean
> -      ^ self new
> -              initializePiece: aRxsRegex
> -              positive: positiveBoolean!
>
> Item was added:
> + ----- Method: RxsLookaround>>beLookahead (in category 'initialize-release') -----
> + beLookahead
> +      forward := true!
>
> Item was added:
> + ----- Method: RxsLookaround>>beLookbehind (in category 'initialize-release') -----
> + beLookbehind
> +      forward := false!
>
> Item was changed:
>  ----- Method: RxsLookaround>>dispatchTo: (in category 'accessing') -----
>  dispatchTo: aBuilder
>        "Inform the matcher of the kind of the node, and it will do whatever it has to."
> +      ^aBuilder syntaxLookaround: self forward: self forward positive: self positive!
> -      ^aBuilder syntaxLookaround: self positive: self positive!
>
> Item was added:
> + ----- Method: RxsLookaround>>forward (in category 'accessing') -----
> + forward
> +
> +      ^ forward!
>
> Item was added:
> + ----- Method: RxsLookaround>>initializePiece:forward:positive: (in category 'initialize-release') -----
> + initializePiece: anRsxPiece forward: forwardBoolean positive: positiveBoolean
> +
> +      piece := anRsxPiece.
> +      forward := forwardBoolean.
> +      positive := positiveBoolean.!
>
> Item was removed:
> - ----- Method: RxsLookaround>>initializePiece:positive: (in category 'initialize-release') -----
> - initializePiece: anRsxPiece positive: positiveBoolean
> -
> -      piece := anRsxPiece.
> -      positive := positiveBoolean.!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20200417/8a81a80b/attachment-0001.html>


More information about the Squeak-dev mailing list