Christoph Thiede uploaded a new version of Regex-Tests-Core to project The Trunk:
http://source.squeak.org/trunk/Regex-Tests-Core-ct.21.mcz
==================== Summary ====================
Name: Regex-Tests-Core-ct.21
Author: ct
Time: 23 August 2021, 6:58:40.775768 pm
UUID: 0a406c4c-99bb-d845-a7d9-29ccad2a8717
Ancestors: Regex-Tests-Core-mt.16
Adds regression tests for quantifier sequences.
Note that in theory, it could be possible to support things like a++, but this would require a redesign of the parser, so for the moment, let's be at least consistent with this limitation. (In the current trunk, 'a+{2}' asRegex matches: 'a{2}'.)
=============== Diff against Regex-Tests-Core-mt.16 ===============
Item was added:
+ ----- Method: RxParserTest>>testQuantifierSequence (in category 'tests') -----
+ testQuantifierSequence
+
+ self
+ should: ['a??' asRegex] raise: RegexSyntaxError;
+ should: ['a++' asRegex] raise: RegexSyntaxError;
+ should: ['a**' asRegex] raise: RegexSyntaxError;
+ should: ['a{1,2}{3,4}' asRegex] raise: RegexSyntaxError.
+
+ self
+ should: ['a?{1,2}' asRegex] raise: RegexSyntaxError.!
Christoph Thiede uploaded a new version of Regex-Tests-Core to project The Trunk:
http://source.squeak.org/trunk/Regex-Tests-Core-ct.15.mcz
==================== Summary ====================
Name: Regex-Tests-Core-ct.15
Author: ct
Time: 8 July 2021, 1:31:08.27536 am
UUID: 5f4cb449-e92d-5142-ac1f-bc593d05a1e9
Ancestors: Regex-Tests-Core-ct.12
Tests #escapeRegex from Regex-Core-ct.61.
=============== Diff against Regex-Tests-Core-ct.12 ===============
Item was added:
+ ----- Method: RxParserTest>>testEscapeString (in category 'tests') -----
+ testEscapeString
+
+ | string |
+ string := 'Hello world, how are you? (This is a test - special characters *very much* intended \-.-/ )'.
+ self assert: (string matchesRegex: string escapeRegex).
+ self assert: (string includesSubstring: 'Hello world, how are you?')
+ "no all-out escaping"!
Christoph Thiede uploaded a new version of Regex-Core to project The Trunk:
http://source.squeak.org/trunk/Regex-Core-ct.73.mcz
==================== Summary ====================
Name: Regex-Core-ct.73
Author: ct
Time: 28 October 2021, 6:12:10.889233 am
UUID: d2e4b99c-af69-ef4f-afaa-4a052eb6ea07
Ancestors: Regex-Core-mt.61
Removes longly obsolete constant on RxsNode. #printOn:withIndent: is not even in the symbol table of the default image ...
=============== Diff against Regex-Core-mt.61 ===============
Item was removed:
- ----- Method: RxsNode>>indentCharacter (in category 'constants') -----
- indentCharacter
- "Normally, #printOn:withIndent: method in subclasses
- print several characters returned by this method to indicate
- the tree structure."
-
- ^$+!
Christoph Thiede uploaded a new version of Regex-Core to project The Trunk:
http://source.squeak.org/trunk/Regex-Core-ct.72.mcz
==================== Summary ====================
Name: Regex-Core-ct.72
Author: ct
Time: 28 October 2021, 6:08:37.958233 am
UUID: 56167c47-7586-744e-84e1-a0bd38e6a6c9
Ancestors: Regex-Core-mt.61
Makes RxMatcher polymorphic with itself, so you can do {'foo' matchesRegex: '\w+' asRegex} without getting tedious MNUs
=============== Diff against Regex-Core-mt.61 ===============
Item was added:
+ ----- Method: RxMatcher>>asRegex (in category 'converting') -----
+ asRegex
+
+ ^ self!
Christoph Thiede uploaded a new version of Regex-Core to project The Trunk:
http://source.squeak.org/trunk/Regex-Core-ct.70.mcz
==================== Summary ====================
Name: Regex-Core-ct.70
Author: ct
Time: 21 October 2021, 12:26:40.595482 am
UUID: 806ed75d-4a4b-de49-80d4-ff40093c91f9
Ancestors: Regex-Core-ct.63
Adds support for nullable closures and eliminates the eponymous error message. Expressions like these are now parsed and matched correctly:
(a|b?)*
(a*)*
(a+){,3}
()*|
(?<=a)?
This design decision is inspired by the behavior of several existing regex implementations (such as Python's regex module, the ECMAScript standard, .NET, or Java). While some of the examples from above might be pretty useless in practice, especially the first example - conditional nullability - was a serious limitation in many cases before. See regex101.com and https://www.regular-expressions.info/ for further information.
Implementation summary:
Matching nullable closures is non-trivial as, considering the second example from above, the matcher might try to match the inner piece again and again while capturing empty strings for every repetition of the outer piece. We avoid such infinite loops by remembering all open branch positions temporarily while matching a certain path of links in the matcher. If an open branch is reached again at the same position, a match will not possible and is refused immediately. In the parser, #signalNullableClosureParserError and all sends to it are eliminated or replaced by finer-granular validity checks.
Depends indeed on Regex-Core-ct.63 to avoid another merge conflict in the parser.
=============== Diff against Regex-Core-mt.61 ===============
Item was changed:
Object subclass: #RxMatcher
+ instanceVariableNames: 'matcher ignoreCase startOptimizer stream markerPositions previousMarkerPositions markerCount branchPositions lastResult firstTryMatch'
- instanceVariableNames: 'matcher ignoreCase startOptimizer stream markerPositions previousMarkerPositions markerCount lastResult firstTryMatch'
classVariableNames: 'Cr Lf NullCharacter'
poolDictionaries: ''
category: 'Regex-Core'!
+ !RxMatcher commentStamp: 'ct 10/20/2021 23:47' prior: 0!
- !RxMatcher commentStamp: 'ul 8/28/2015 14:18' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
This is a recursive regex matcher. Not strikingly efficient, but simple. Also, keeps track of matched subexpressions. The life cycle goes as follows:
1. Initialization. Accepts a syntax tree (presumably produced by RxParser) and compiles it into a matcher built of other classes in this category.
2. Matching. Accepts a stream or a string and returns a boolean indicating whether the whole stream or its prefix -- depending on the message sent -- matches the regex.
3. Subexpression query. After a successful match, and before any other match, the matcher may be queried about the range of specific stream (string) positions that matched to certain parenthesized subexpressions of the original expression.
Any number of queries may follow a successful match, and any number or matches may follow a successful initialization.
Note that `matcher' is actually a sort of a misnomer. The actual matcher is a web of Rxm* instances built by RxMatcher during initialization. RxMatcher is just the interface facade of this network. It is also a builder of it, and also provides a stream-like protocol to easily access the stream being matched.
Instance variables:
matcher <RxmLink> The entry point into the actual matcher.
igoreCase <Boolean> Whether the matching algorithm should be case sensitive or not.
startOptimizer <RxMatchOptimizer> An object which can quickly decide whether the next character can be the prefix of a match or not.
stream <Stream> The stream currently being matched against.
markerPositions <Array of: nil | Integer | OrderedCollection> Positions of markers' matches.
previousMarkerPositions <Array of: nil | Integer | OrderedCollection> Positions of markers from the previous #tryMatch send.
markerCount <Integer> Number of markers.
+ branchPositions <Array of: OrderedCollection of: Integer> Temporary positions of looped branches.
lastResult <Boolean> Whether the latest match attempt succeeded or not.
firtTryMatch <Boolean> True if there hasn't been any send of #tryMatch during the current matching.!
Item was changed:
----- Method: RxMatcher>>initializeMarkerPositions (in category 'initialize-release') -----
initializeMarkerPositions
+ self flag: #refactor. "When all merge conflicts from the inbox have been resolved, consider renaming this to #initializeMatcherPositions."
+
markerPositions := Array new: markerCount.
previousMarkerPositions := Array new: markerCount..
3 to: markerCount do: [ :index |
markerPositions at: index put: (OrderedCollection new: 1).
+ previousMarkerPositions at: index put: (OrderedCollection new: 1) ].
+
+ branchPositions := Dictionary new.!
- previousMarkerPositions at: index put: (OrderedCollection new: 1) ].!
Item was added:
+ ----- Method: RxMatcher>>matchAgainstBranch:nextLink:alternativeLink: (in category 'matching') -----
+ matchAgainstBranch: anRxmBranch nextLink: nextLink alternativeLink: alternativeLink
+ "Greedy matching."
+
+ | position result |
+ "Detect infinite empty backloops from nested links."
+ position := stream position.
+ branchPositions
+ at: anRxmBranch
+ ifPresent: [:positions | (positions includes: position) ifTrue: [
+ "We already had this situation earlier in the current stack, don't retry again"
+ ^ false]].
+
+ "Remember the current position for this branch to enable detection of infinite empty backloops."
+ (branchPositions at: anRxmBranch ifAbsentPut: [OrderedCollection new]) add: position.
+
+ result := (nextLink matchAgainst: self)
+ or: [alternativeLink notNil and: [alternativeLink matchAgainst: self]].
+
+ (branchPositions at: anRxmBranch) remove: position.
+
+ ^ result!
Item was changed:
----- Method: RxMatcher>>resetMarkerPositions (in category 'private') -----
resetMarkerPositions
"Reset the marker positions. This method should only be sent from #tryMatch. When this is after the first #tryMatch send, then the marker positions must be swapped."
+ self flag: #refactor. "When all merge conflicts from the inbox have been resolved, consider renaming this to #resetMatcherPositions."
+
firstTryMatch
ifTrue: [ firstTryMatch := false ]
ifFalse: [
| temp |
temp := previousMarkerPositions.
previousMarkerPositions := markerPositions.
markerPositions := temp ].
markerPositions
at: 1 put: nil;
at: 2 put: nil.
3 to: markerCount do: [ :index |
+ (markerPositions at: index) resetTo: 1 ].
+
+ branchPositions removeAll.!
- (markerPositions at: index) resetTo: 1 ]!
Item was changed:
----- Method: RxMatcher>>syntaxRegex: (in category 'double dispatch') -----
syntaxRegex: regexNode
+ "Double dispatch from the syntax tree.
+ Regex node is a chain of branches to be tried. Should compile this into a bundle of parallel branches, between two marker nodes."
+
+ | startNode endNode alternatives |
+ regexNode isCapturing ifFalse: [
+ ^ regexNode branch dispatchTo: self].
- "Double dispatch from the syntax tree.
- Regex node is a chain of branches to be tried. Should compile this
- into a bundle of parallel branches, between two marker nodes."
+ startNode := RxmMarker new index: self allocateMarker.
+ endNode := RxmMarker new index: self allocateMarker.
- | startIndex endIndex endNode alternatives |
- startIndex := self allocateMarker.
- endIndex := self allocateMarker.
- endNode := RxmMarker new index: endIndex.
alternatives := self hookBranchOf: regexNode onto: endNode.
+ ^ startNode
- ^(RxmMarker new index: startIndex)
pointTailTo: alternatives;
yourself!
Item was changed:
----- Method: RxParser>>atom (in category 'recursive descent') -----
atom
"An atom is one of a lot of possibilities, see below."
| atom |
(lookahead == nil
+ or: [ lookahead == $| ]
+ or: [ lookahead == $) ])
- or: [ lookahead == $|
- or: [ lookahead == $)
- or: [ lookahead == $*
- or: [ lookahead == $+
- or: [ lookahead == $? ]]]]])
ifTrue: [ ^RxsEpsilon new ].
+
+ (lookahead == $*
+ or: [ lookahead == $+ ]
+ or: [ lookahead == $? ])
+ ifTrue: [ ^self signalParseError: 'nothing to repeat' ].
+
-
lookahead == $(
ifTrue: [
+ ^ self group ].
- "<atom> ::= '(' <regex> ')' "
- self match: $(.
- atom := self regex.
- self match: $).
- ^atom ].
lookahead == $[
ifTrue: [
"<atom> ::= '[' <characterSet> ']' "
self match: $[.
atom := self characterSet.
self match: $].
^atom ].
lookahead == $:
ifTrue: [
"<atom> ::= ':' <messagePredicate> ':' "
self match: $:.
atom := self messagePredicate.
self match: $:.
^atom ].
lookahead == $.
ifTrue: [
"any non-whitespace character"
self next.
^RxsContextCondition new beAny].
lookahead == $^
ifTrue: [
"beginning of line condition"
self next.
^RxsContextCondition new beBeginningOfLine].
lookahead == $$
ifTrue: [
"end of line condition"
self next.
^RxsContextCondition new beEndOfLine].
lookahead == $\
ifTrue: [
"<atom> ::= '\' <character>"
self next ifNil: [ self signalParseError: 'bad quotation' ].
(BackslashConstants includesKey: lookahead) ifTrue: [
atom := RxsCharacter with: (BackslashConstants at: lookahead).
self next.
+ ^ atom].
+ self
+ ifSpecial: lookahead
+ then: [:node | self next. ^ node]].
- ^atom].
- self ifSpecial: lookahead
- then: [:node | self next. ^node]].
"If passed through the above, the following is a regular character."
atom := RxsCharacter with: lookahead.
self next.
^atom!
Item was added:
+ ----- Method: RxParser>>group (in category 'recursive descent') -----
+ group
+
+ self match: $(.
+ lookahead == $?
+ ifFalse: [
+ | group |
+ "<group> ::= '(' <regex> ')' "
+ group := self regex.
+ self match: $).
+ ^ group].
+
+ self next.
+
+ lookahead == $:
+ ifTrue: [
+ "non-capturing group"
+ "<group> ::= '(?:' <regex> ')' "
+ | group |
+ self next.
+ group := self regex.
+ group beNonCapturing.
+ self match: $).
+ ^ group].
+
+ ('<=!!' includes: lookahead)
+ ifTrue: [
+ | lookaround |
+ lookaround := self lookAround.
+ self match: $).
+ ^ lookaround ].
+
+ ^ self signalParseError!
Item was changed:
----- Method: RxParser>>lookAround (in category 'recursive descent') -----
lookAround
"Parse a lookaround expression after: (?<lookaround>)
+ <lookaround> ::= !!<regex> | =<regex>
+ Positive lookahead: ?=
+ Negative lookahead: ?!!
+ Positive lookbehind: ?<=
+ Negative lookbehind: ?<!!"
+
- <lookaround> ::= !!<regex> | =<regex>"
| lookbehind positive |
+ ('<!!=' includes: lookahead) ifFalse: [
- ('!!=<' includes: lookahead) ifFalse: [
^ self signalParseError: 'Invalid lookaround expression ?', lookahead asString].
+
lookbehind := lookahead == $<
ifTrue: [self next];
yourself.
+
+ ('!!=' includes: lookahead) ifFalse: [
+ ^ self signalParseError: 'Invalid lookaround expression'].
+
positive := lookahead == $=.
self next.
^ RxsLookaround
with: self regex
forward: lookbehind not
positive: positive!
Item was changed:
----- Method: RxParser>>piece (in category 'recursive descent') -----
piece
"<piece> ::= <atom> | <atom>* | <atom>+ | <atom>? | <atom>{<number>,<number>}"
| atom |
atom := self atom.
lookahead == $*
ifTrue: [
self next.
- atom isNullable
- ifTrue: [ self signalNullableClosureParserError ].
^ RxsPiece new initializeStarAtom: atom ].
lookahead == $+
ifTrue: [
self next.
- atom isNullable
- ifTrue: [ self signalNullableClosureParserError ].
^ RxsPiece new initializePlusAtom: atom ].
lookahead == $?
ifTrue: [
self next.
- atom isNullable
- ifTrue: [
- ^ self lookAround ].
^ RxsPiece new initializeOptionalAtom: atom ].
lookahead == ${
ifTrue: [
^ self quantifiedAtom: atom ].
^ RxsPiece new initializeAtom: atom!
Item was changed:
----- Method: RxParser>>quantifiedAtom: (in category 'recursive descent') -----
quantifiedAtom: atom
"Parse a quanitifer expression which can have one of the following forms
{<min>,<max>} match <min> to <max> occurences
{<minmax>} which is the same as with repeated limits: {<number>,<number>}
{<min>,} match at least <min> occurences
{,<max>} match maximally <max> occurences, which is the same as {0,<max>}"
| min max |
self next.
lookahead == $,
ifTrue: [ min := 0 ]
ifFalse: [
max := min := (self inputUpToAny: ',}' errorMessage: ' no terminating "}"') asUnsignedInteger ].
lookahead == $,
ifTrue: [
self next.
max := (self inputUpToAny: ',}' errorMessage: ' no terminating "}"') asUnsignedInteger ].
self match: $}.
- atom isNullable
- ifTrue: [ self signalNullableClosureParserError ].
(max notNil and: [ max < min ])
ifTrue: [ self signalParseError: ('wrong quantifier, expected ', min asString, ' <= ', max asString) ].
^ RxsPiece new
initializeAtom: atom
min: min
max: max!
Item was removed:
- ----- Method: RxParser>>signalNullableClosureParserError (in category 'private') -----
- signalNullableClosureParserError
- self signalParseError: ' nullable closure'.!
Item was changed:
----- Method: RxmBranch>>matchAgainst: (in category 'matching') -----
matchAgainst: aMatcher
"Match either `next' or `alternative'. Fail if the alternative is nil."
+ ^ aMatcher matchAgainstBranch: self nextLink: next alternativeLink: alternative!
- (next matchAgainst: aMatcher) ifTrue: [ ^true ].
- ^(alternative ifNil: [ ^false ]) matchAgainst: aMatcher!
Item was changed:
RxsNode subclass: #RxsRegex
+ instanceVariableNames: 'branch regex isCapturing'
- instanceVariableNames: 'branch regex'
classVariableNames: ''
poolDictionaries: ''
category: 'Regex-Core'!
!RxsRegex commentStamp: 'Tbn 11/12/2010 23:15' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
The body of a parenthesized thing, or a top-level expression, also an atom.
Instance variables:
branch <RxsBranch>
regex <RxsRegex | RxsEpsilon>!
Item was added:
+ ----- Method: RxsRegex>>beNonCapturing (in category 'accessing') -----
+ beNonCapturing
+
+ isCapturing := false.!
Item was added:
+ ----- Method: RxsRegex>>initialize (in category 'initialize-release') -----
+ initialize
+
+ super initialize.
+ isCapturing := true.!
Item was added:
+ ----- Method: RxsRegex>>isCapturing (in category 'accessing') -----
+ isCapturing
+
+ ^ isCapturing!