Christoph Thiede uploaded a new version of Regex-Core to project The Trunk: http://source.squeak.org/trunk/Regex-Core-ct.63.mcz
==================== Summary ====================
Name: Regex-Core-ct.63 Author: ct Time: 23 August 2021, 6:35:34.901768 pm UUID: d8bde553-3c08-514b-af2e-6817361b0737 Ancestors: Regex-Core-mt.61
Adds support for non-capturing groups. Also fixes a bug while parsing lookaround-like regexes (see #testLookaroundParser, Regex-Tests-Core-ct.18).
=============== Diff against Regex-Core-mt.61 ===============
Item was changed: ----- Method: RxMatcher>>syntaxRegex: (in category 'double dispatch') ----- syntaxRegex: regexNode + "Double dispatch from the syntax tree. + Regex node is a chain of branches to be tried. Should compile this into a bundle of parallel branches, between two marker nodes." + + | startNode endNode alternatives | + regexNode isCapturing ifFalse: [ + ^ regexNode branch dispatchTo: self]. - "Double dispatch from the syntax tree. - Regex node is a chain of branches to be tried. Should compile this - into a bundle of parallel branches, between two marker nodes." + startNode := RxmMarker new index: self allocateMarker. + endNode := RxmMarker new index: self allocateMarker. - | startIndex endIndex endNode alternatives | - startIndex := self allocateMarker. - endIndex := self allocateMarker. - endNode := RxmMarker new index: endIndex. alternatives := self hookBranchOf: regexNode onto: endNode. + ^ startNode - ^(RxmMarker new index: startIndex) pointTailTo: alternatives; yourself!
Item was changed: ----- Method: RxParser>>atom (in category 'recursive descent') ----- atom "An atom is one of a lot of possibilities, see below."
| atom | (lookahead == nil or: [ lookahead == $| or: [ lookahead == $) or: [ lookahead == $* or: [ lookahead == $+ or: [ lookahead == $? ]]]]]) ifTrue: [ ^RxsEpsilon new ]. lookahead == $( ifTrue: [ + ^ self group ]. - "<atom> ::= '(' <regex> ')' " - self match: $(. - atom := self regex. - self match: $). - ^atom ]. lookahead == $[ ifTrue: [ "<atom> ::= '[' <characterSet> ']' " self match: $[. atom := self characterSet. self match: $]. ^atom ]. lookahead == $: ifTrue: [ "<atom> ::= ':' <messagePredicate> ':' " self match: $:. atom := self messagePredicate. self match: $:. ^atom ]. lookahead == $. ifTrue: [ "any non-whitespace character" self next. ^RxsContextCondition new beAny]. lookahead == $^ ifTrue: [ "beginning of line condition" self next. ^RxsContextCondition new beBeginningOfLine]. lookahead == $$ ifTrue: [ "end of line condition" self next. ^RxsContextCondition new beEndOfLine]. lookahead == $\ ifTrue: [ "<atom> ::= '' <character>" self next ifNil: [ self signalParseError: 'bad quotation' ]. (BackslashConstants includesKey: lookahead) ifTrue: [ atom := RxsCharacter with: (BackslashConstants at: lookahead). self next. + ^ atom]. + self + ifSpecial: lookahead + then: [:node | self next. ^ node]]. - ^atom]. - self ifSpecial: lookahead - then: [:node | self next. ^node]]. "If passed through the above, the following is a regular character." atom := RxsCharacter with: lookahead. self next. ^atom!
Item was added: + ----- Method: RxParser>>group (in category 'recursive descent') ----- + group + + self match: $(. + lookahead == $? + ifFalse: [ + | group | + "<group> ::= '(' <regex> ')' " + group := self regex. + self match: $). + ^ group]. + + self next. + + lookahead == $: + ifTrue: [ + "non-capturing group" + "<group> ::= '(?:' <regex> ')' " + | group | + self next. + group := self regex. + group beNonCapturing. + self match: $). + ^ group]. + + ('<=!!' includes: lookahead) + ifTrue: [ + | lookaround | + lookaround := self lookAround. + self match: $). + ^ lookaround ]. + + ^ self signalParseError!
Item was changed: ----- Method: RxParser>>lookAround (in category 'recursive descent') ----- lookAround "Parse a lookaround expression after: (?<lookaround>) + <lookaround> ::= !!<regex> | =<regex> + Positive lookahead: ?= + Negative lookahead: ?!! + Positive lookbehind: ?<= + Negative lookbehind: ?<!!" + - <lookaround> ::= !!<regex> | =<regex>" | lookbehind positive | + ('<!!=' includes: lookahead) ifFalse: [ - ('!!=<' includes: lookahead) ifFalse: [ ^ self signalParseError: 'Invalid lookaround expression ?', lookahead asString]. + lookbehind := lookahead == $< ifTrue: [self next]; yourself. + + ('!!=' includes: lookahead) ifFalse: [ + ^ self signalParseError: 'Invalid lookaround expression']. + positive := lookahead == $=. self next. ^ RxsLookaround with: self regex forward: lookbehind not positive: positive!
Item was changed: ----- Method: RxParser>>piece (in category 'recursive descent') ----- piece "<piece> ::= <atom> | <atom>* | <atom>+ | <atom>? | <atom>{<number>,<number>}"
| atom | atom := self atom. lookahead == $* ifTrue: [ self next. atom isNullable ifTrue: [ self signalNullableClosureParserError ]. ^ RxsPiece new initializeStarAtom: atom ].
lookahead == $+ ifTrue: [ self next. atom isNullable ifTrue: [ self signalNullableClosureParserError ]. ^ RxsPiece new initializePlusAtom: atom ].
lookahead == $? ifTrue: [ self next. atom isNullable + ifTrue: [ self signalNullableClosureParserError ]. - ifTrue: [ - ^ self lookAround ]. ^ RxsPiece new initializeOptionalAtom: atom ]. lookahead == ${ ifTrue: [ ^ self quantifiedAtom: atom ]. ^ RxsPiece new initializeAtom: atom!
Item was changed: RxsNode subclass: #RxsRegex + instanceVariableNames: 'branch regex isCapturing' - instanceVariableNames: 'branch regex' classVariableNames: '' poolDictionaries: '' category: 'Regex-Core'!
!RxsRegex commentStamp: 'Tbn 11/12/2010 23:15' prior: 0! -- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov -- The body of a parenthesized thing, or a top-level expression, also an atom.
Instance variables: branch <RxsBranch> regex <RxsRegex | RxsEpsilon>!
Item was added: + ----- Method: RxsRegex>>beNonCapturing (in category 'accessing') ----- + beNonCapturing + + isCapturing := false.!
Item was added: + ----- Method: RxsRegex>>initialize (in category 'initialize-release') ----- + initialize + + super initialize. + isCapturing := true.!
Item was added: + ----- Method: RxsRegex>>isCapturing (in category 'accessing') ----- + isCapturing + + ^ isCapturing!
squeak-dev@lists.squeakfoundation.org