[squeak-dev] The Trunk: Regex-Core-ct.70.mcz

commits at source.squeak.org commits at source.squeak.org
Wed Oct 5 18:06:44 UTC 2022


Christoph Thiede uploaded a new version of Regex-Core to project The Trunk:
http://source.squeak.org/trunk/Regex-Core-ct.70.mcz

==================== Summary ====================

Name: Regex-Core-ct.70
Author: ct
Time: 21 October 2021, 12:26:40.595482 am
UUID: 806ed75d-4a4b-de49-80d4-ff40093c91f9
Ancestors: Regex-Core-ct.63

Adds support for nullable closures and eliminates the eponymous error message. Expressions like these are now parsed and matched correctly:

	(a|b?)*
	(a*)*
	(a+){,3}
	()*|
	(?<=a)?

This design decision is inspired by the behavior of several existing regex implementations (such as Python's regex module, the ECMAScript standard, .NET, or Java). While some of the examples from above might be pretty useless in practice, especially the first example - conditional nullability - was a serious limitation in many cases before. See regex101.com and https://www.regular-expressions.info/ for further information.

Implementation summary:
Matching nullable closures is non-trivial as, considering the second example from above, the matcher might try to match the inner piece again and again while capturing empty strings for every repetition of the outer piece. We avoid such infinite loops by remembering all open branch positions temporarily while matching a certain path of links in the matcher. If an open branch is reached again at the same position, a match will not possible and is refused immediately. In the parser, #signalNullableClosureParserError and all sends to it are eliminated or replaced by finer-granular validity checks.

Depends indeed on Regex-Core-ct.63 to avoid another merge conflict in the parser.

=============== Diff against Regex-Core-mt.61 ===============

Item was changed:
  Object subclass: #RxMatcher
+ 	instanceVariableNames: 'matcher ignoreCase startOptimizer stream markerPositions previousMarkerPositions markerCount branchPositions lastResult firstTryMatch'
- 	instanceVariableNames: 'matcher ignoreCase startOptimizer stream markerPositions previousMarkerPositions markerCount lastResult firstTryMatch'
  	classVariableNames: 'Cr Lf NullCharacter'
  	poolDictionaries: ''
  	category: 'Regex-Core'!
  
+ !RxMatcher commentStamp: 'ct 10/20/2021 23:47' prior: 0!
- !RxMatcher commentStamp: 'ul 8/28/2015 14:18' prior: 0!
  -- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
  --
  This is a recursive regex matcher. Not strikingly efficient, but simple. Also, keeps track of matched subexpressions.  The life cycle goes as follows:
  
  1. Initialization. Accepts a syntax tree (presumably produced by RxParser) and compiles it into a matcher built of other classes in this category.
  
  2. Matching. Accepts a stream or a string and returns a boolean indicating whether the whole stream or its prefix -- depending on the message sent -- matches the regex.
  
  3. Subexpression query. After a successful match, and before any other match, the matcher may be queried about the range of specific stream (string) positions that matched to certain parenthesized subexpressions of the original expression.
  
  Any number of queries may follow a successful match, and any number or matches may follow a successful initialization.
  
  Note that `matcher' is actually a sort of a misnomer. The actual matcher is a web of Rxm* instances built by RxMatcher during initialization. RxMatcher is just the interface facade of this network.  It is also a builder of it, and also provides a stream-like protocol to easily access the stream being matched.
  
  Instance variables:
  	matcher					<RxmLink> The entry point into the actual matcher.
  	igoreCase					<Boolean> Whether the matching algorithm should be case sensitive or not.
  	startOptimizer				<RxMatchOptimizer> An object which can quickly decide whether the next character can be the prefix of a match or not.
  	stream						<Stream> The stream currently being matched against.
  	markerPositions			<Array of: nil | Integer | OrderedCollection> Positions of markers' matches.
  	previousMarkerPositions	<Array of: nil |  Integer | OrderedCollection> Positions of markers from the previous #tryMatch send.
  	markerCount				<Integer> Number of markers.
+ 	branchPositions			<Array of: OrderedCollection of: Integer> Temporary positions of looped branches.
  	lastResult 					<Boolean> Whether the latest match attempt succeeded or not.
  	firtTryMatch				<Boolean> True if there hasn't been any send of #tryMatch during the current matching.!

Item was changed:
  ----- Method: RxMatcher>>initializeMarkerPositions (in category 'initialize-release') -----
  initializeMarkerPositions
  
+ 	self flag: #refactor. "When all merge conflicts from the inbox have been resolved, consider renaming this to #initializeMatcherPositions."
+ 	
  	markerPositions := Array new: markerCount.
  	previousMarkerPositions := Array new: markerCount..
  	3 to: markerCount do: [ :index |
  		markerPositions at: index put: (OrderedCollection new: 1).
+ 		previousMarkerPositions at: index put: (OrderedCollection new: 1) ].
+ 	
+ 	branchPositions := Dictionary new.!
- 		previousMarkerPositions at: index put: (OrderedCollection new: 1) ].!

Item was added:
+ ----- Method: RxMatcher>>matchAgainstBranch:nextLink:alternativeLink: (in category 'matching') -----
+ matchAgainstBranch: anRxmBranch nextLink: nextLink alternativeLink: alternativeLink
+ 	"Greedy matching."
+ 
+ 	| position result |
+ 	"Detect infinite empty backloops from nested links."
+ 	position := stream position.
+ 	branchPositions
+ 		at: anRxmBranch
+ 		ifPresent: [:positions | (positions includes: position) ifTrue: [
+ 			"We already had this situation earlier in the current stack, don't retry again"
+ 			^ false]].
+ 	
+ 	"Remember the current position for this branch to enable detection of infinite empty backloops."
+ 	(branchPositions at: anRxmBranch ifAbsentPut: [OrderedCollection new]) add: position.
+ 	
+ 	result := (nextLink matchAgainst: self)
+ 		or: [alternativeLink notNil and: [alternativeLink matchAgainst: self]].
+ 	
+ 	(branchPositions at: anRxmBranch) remove: position.
+ 	
+ 	^ result!

Item was changed:
  ----- Method: RxMatcher>>resetMarkerPositions (in category 'private') -----
  resetMarkerPositions
  	"Reset the marker positions. This method should only be sent from #tryMatch. When this is after the first #tryMatch send, then the marker positions must be swapped."
  
+ 	self flag: #refactor. "When all merge conflicts from the inbox have been resolved, consider renaming this to #resetMatcherPositions."
+ 	
  	firstTryMatch
  		ifTrue: [ firstTryMatch := false ]
  		ifFalse: [
  			| temp |
  			temp := previousMarkerPositions.
  			previousMarkerPositions := markerPositions.
  			markerPositions := temp ].
  	markerPositions
  		at: 1 put: nil;
  		at: 2 put: nil.
  	3 to: markerCount do: [ :index | 
+ 		(markerPositions at: index) resetTo: 1 ].
+ 	
+ 	branchPositions removeAll.!
- 		(markerPositions at: index) resetTo: 1 ]!

Item was changed:
  ----- Method: RxMatcher>>syntaxRegex: (in category 'double dispatch') -----
  syntaxRegex: regexNode
+ 	"Double dispatch from the syntax tree.
+ 	Regex node is a chain of branches to be tried. Should compile this into a bundle of parallel branches, between two marker nodes."
+ 
+ 	| startNode endNode alternatives |
+ 	regexNode isCapturing ifFalse: [
+ 		^ regexNode branch dispatchTo: self].
- 	"Double dispatch from the syntax tree. 
- 	Regex node is a chain of branches to be tried. Should compile this 
- 	into a bundle of parallel branches, between two marker nodes." 
  	
+ 	startNode := RxmMarker new index: self allocateMarker.
+ 	endNode := RxmMarker new index: self allocateMarker.
- 	| startIndex endIndex endNode alternatives |
- 	startIndex := self allocateMarker.
- 	endIndex := self allocateMarker.
- 	endNode := RxmMarker new index: endIndex.
  	alternatives := self hookBranchOf: regexNode onto: endNode.
+ 	^ startNode
- 	^(RxmMarker new index: startIndex)
  		pointTailTo: alternatives;
  		yourself!

Item was changed:
  ----- Method: RxParser>>atom (in category 'recursive descent') -----
  atom
  	"An atom is one of a lot of possibilities, see below."
  
  	| atom |
  	(lookahead == nil 
+ 	or: [ lookahead == $| ] 
+ 	or: [ lookahead == $) ])
- 	or: [ lookahead == $| 
- 	or: [ lookahead == $)
- 	or: [ lookahead == $*
- 	or: [ lookahead == $+ 
- 	or: [ lookahead == $? ]]]]])
  		ifTrue: [ ^RxsEpsilon new ].
+ 	
+ 	(lookahead == $*
+ 	or: [ lookahead == $+ ]
+ 	or: [ lookahead == $? ])
+ 		ifTrue: [ ^self signalParseError: 'nothing to repeat' ].
+ 	
- 		
  	lookahead == $( 
  		ifTrue: [
+ 			^ self group ].
- 			"<atom> ::= '(' <regex> ')' "
- 			self match: $(.
- 			atom := self regex.
- 			self match: $).
- 			^atom ].
  	
  	lookahead == $[
  		ifTrue: [
  			"<atom> ::= '[' <characterSet> ']' "
  			self match: $[.
  			atom := self characterSet.
  			self match: $].
  			^atom ].
  	
  	lookahead == $: 
  		ifTrue: [
  			"<atom> ::= ':' <messagePredicate> ':' "
  			self match: $:.
  			atom := self messagePredicate.
  			self match: $:.
  			^atom ].
  	
  	lookahead == $. 
  		ifTrue: [
  			"any non-whitespace character"
  			self next.
  			^RxsContextCondition new beAny].
  	
  	lookahead == $^ 
  		ifTrue: [
  			"beginning of line condition"
  			self next.
  			^RxsContextCondition new beBeginningOfLine].
  	
  	lookahead == $$ 
  		ifTrue: [
  			"end of line condition"
  			self next.
  			^RxsContextCondition new beEndOfLine].
  		
  	lookahead == $\ 
  		ifTrue: [
  			"<atom> ::= '\' <character>"
  			self next ifNil: [ self signalParseError: 'bad quotation' ].
  			(BackslashConstants includesKey: lookahead) ifTrue: [
  				atom := RxsCharacter with: (BackslashConstants at: lookahead).
  				self next.
+ 				^ atom].
+ 			self
+ 				ifSpecial: lookahead
+ 				then: [:node | self next. ^ node]].
- 				^atom].
- 			self ifSpecial: lookahead
- 				then: [:node | self next. ^node]].
  		
  	"If passed through the above, the following is a regular character."
  	atom := RxsCharacter with: lookahead.
  	self next.
  	^atom!

Item was added:
+ ----- Method: RxParser>>group (in category 'recursive descent') -----
+ group
+ 
+ 	self match: $(.
+ 	lookahead == $?
+ 		ifFalse: [
+ 			| group |
+ 			"<group> ::= '(' <regex> ')' "
+ 			group := self regex.
+ 			self match: $).
+ 			^ group].
+ 	
+ 	self next.
+ 	
+ 	lookahead == $:
+ 		ifTrue: [
+ 			"non-capturing group"
+ 			"<group> ::= '(?:' <regex> ')' "
+ 			| group |
+ 			self next.
+ 			group := self regex.
+ 			group beNonCapturing.
+ 			self match: $).
+ 			^ group].
+ 	
+ 	('<=!!' includes: lookahead)
+ 		ifTrue: [
+ 			| lookaround |
+ 			lookaround := self lookAround.
+ 			self match: $).
+ 			^ lookaround ].
+ 	
+ 	^ self signalParseError!

Item was changed:
  ----- Method: RxParser>>lookAround (in category 'recursive descent') -----
  lookAround
  	"Parse a lookaround expression after: (?<lookaround>) 
+ 	<lookaround> ::= !!<regex> | =<regex>
+ 	Positive lookahead: ?=
+ 	Negative lookahead: ?!!
+ 	Positive lookbehind: ?<=
+ 	Negative lookbehind: ?<!!"
+ 
- 	<lookaround> ::= !!<regex> | =<regex>"
  	| lookbehind positive |
+ 	('<!!=' includes: lookahead) ifFalse: [
- 	('!!=<' includes: lookahead) ifFalse: [
  		^ self signalParseError: 'Invalid lookaround expression ?', lookahead asString].
+ 	
  	lookbehind := lookahead == $<
  		ifTrue: [self next];
  		yourself.
+ 	
+ 	('!!=' includes: lookahead) ifFalse: [
+ 		^ self signalParseError: 'Invalid lookaround expression'].
+ 	
  	positive := lookahead == $=.
  	self next.
  	^ RxsLookaround
  		with: self regex
  		forward: lookbehind not
  		positive: positive!

Item was changed:
  ----- Method: RxParser>>piece (in category 'recursive descent') -----
  piece
  	"<piece> ::= <atom> | <atom>* | <atom>+ | <atom>? | <atom>{<number>,<number>}"
  
  	| atom |
  	atom := self atom.
  	
  	lookahead == $*
  		ifTrue: [ 
  			self next.
- 			atom isNullable
- 				ifTrue: [ self signalNullableClosureParserError ].
  			^ RxsPiece new initializeStarAtom: atom ].
  
  	lookahead == $+
  		ifTrue: [ 
  			self next.
- 			atom isNullable
- 				ifTrue: [ self signalNullableClosureParserError ].
  			^ RxsPiece new initializePlusAtom: atom ].
  
  	lookahead == $?
  		ifTrue: [ 
  			self next.
- 			atom isNullable
- 				ifTrue: [ 
- 					^ self lookAround ].
  			^ RxsPiece new initializeOptionalAtom: atom ].
  	
  	lookahead == ${
  		ifTrue: [
  			^ self quantifiedAtom: atom ].
  		
  	^ RxsPiece new initializeAtom: atom!

Item was changed:
  ----- Method: RxParser>>quantifiedAtom: (in category 'recursive descent') -----
  quantifiedAtom: atom
  	"Parse a quanitifer expression which can have one of the following forms
  		{<min>,<max>}    match <min> to <max> occurences
  		{<minmax>}       which is the same as with repeated limits: {<number>,<number>}
  		{<min>,}         match at least <min> occurences
  		{,<max>}         match maximally <max> occurences, which is the same as {0,<max>}"
  	| min max |
  	self next.
  	lookahead == $,
  		ifTrue: [ min := 0 ]
  		ifFalse: [
  			max := min := (self inputUpToAny: ',}' errorMessage: ' no terminating "}"') asUnsignedInteger ].
  	lookahead == $,
  		ifTrue: [
  			self next.
  			max := (self inputUpToAny: ',}' errorMessage: ' no terminating "}"') asUnsignedInteger ].	
  	self match: $}.
- 	atom isNullable
- 		ifTrue: [ self signalNullableClosureParserError ].
  	(max notNil and: [ max < min ])
  		ifTrue: [ self signalParseError: ('wrong quantifier, expected ', min asString, ' <= ', max asString) ].
  	^ RxsPiece new 
  		initializeAtom: atom
  		min: min
  		max: max!

Item was removed:
- ----- Method: RxParser>>signalNullableClosureParserError (in category 'private') -----
- signalNullableClosureParserError
- 	self signalParseError: ' nullable closure'.!

Item was changed:
  ----- Method: RxmBranch>>matchAgainst: (in category 'matching') -----
  matchAgainst: aMatcher
  	"Match either `next' or `alternative'. Fail if the alternative is nil."
  
+ 	^ aMatcher matchAgainstBranch: self nextLink: next alternativeLink: alternative!
- 	(next matchAgainst: aMatcher) ifTrue: [ ^true ].
- 	^(alternative ifNil: [ ^false ]) matchAgainst: aMatcher!

Item was changed:
  RxsNode subclass: #RxsRegex
+ 	instanceVariableNames: 'branch regex isCapturing'
- 	instanceVariableNames: 'branch regex'
  	classVariableNames: ''
  	poolDictionaries: ''
  	category: 'Regex-Core'!
  
  !RxsRegex commentStamp: 'Tbn 11/12/2010 23:15' prior: 0!
  -- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
  --
  The body of a parenthesized thing, or a top-level expression, also an atom.  
  
  Instance variables:
  	branch		<RxsBranch>
  	regex		<RxsRegex | RxsEpsilon>!

Item was added:
+ ----- Method: RxsRegex>>beNonCapturing (in category 'accessing') -----
+ beNonCapturing
+ 
+ 	isCapturing := false.!

Item was added:
+ ----- Method: RxsRegex>>initialize (in category 'initialize-release') -----
+ initialize
+ 
+ 	super initialize.
+ 	isCapturing := true.!

Item was added:
+ ----- Method: RxsRegex>>isCapturing (in category 'accessing') -----
+ isCapturing
+ 
+ 	^ isCapturing!



More information about the Squeak-dev mailing list