[squeak-dev] The Trunk: Regex-Core-ct.67.mcz

commits at source.squeak.org commits at source.squeak.org
Wed Oct 5 18:06:32 UTC 2022


Christoph Thiede uploaded a new version of Regex-Core to project The Trunk:
http://source.squeak.org/trunk/Regex-Core-ct.67.mcz

==================== Summary ====================

Name: Regex-Core-ct.67
Author: ct
Time: 23 August 2021, 7:47:23.2455 pm
UUID: afde3e88-ef4b-f34a-97be-35c2dd290a35
Ancestors: Regex-Core-mt.61

Adds support for named capturing groups.

Usage:

	'Hello (?<name>\w+)' asRegex
		matches: 'Hello Squeak';
		keyedSubexpression: 'name'. "--> 'Squeak'"
	'Hello (?''x''\w)+' asRegex
		matches: 'Hello Squeak';
		keyedSubexpressions: 'x'. "--> #('S' 'q' 'u' 'e' 'a' 'k')"

=============== Diff against Regex-Core-mt.61 ===============

Item was changed:
  Object subclass: #RxMatcher
+ 	instanceVariableNames: 'matcher ignoreCase startOptimizer stream markerPositions keyedMarkerPositions previousMarkerPositions markerCount lastResult firstTryMatch'
- 	instanceVariableNames: 'matcher ignoreCase startOptimizer stream markerPositions previousMarkerPositions markerCount lastResult firstTryMatch'
  	classVariableNames: 'Cr Lf NullCharacter'
  	poolDictionaries: ''
  	category: 'Regex-Core'!
  
  !RxMatcher commentStamp: 'ul 8/28/2015 14:18' prior: 0!
  -- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
  --
  This is a recursive regex matcher. Not strikingly efficient, but simple. Also, keeps track of matched subexpressions.  The life cycle goes as follows:
  
  1. Initialization. Accepts a syntax tree (presumably produced by RxParser) and compiles it into a matcher built of other classes in this category.
  
  2. Matching. Accepts a stream or a string and returns a boolean indicating whether the whole stream or its prefix -- depending on the message sent -- matches the regex.
  
  3. Subexpression query. After a successful match, and before any other match, the matcher may be queried about the range of specific stream (string) positions that matched to certain parenthesized subexpressions of the original expression.
  
  Any number of queries may follow a successful match, and any number or matches may follow a successful initialization.
  
  Note that `matcher' is actually a sort of a misnomer. The actual matcher is a web of Rxm* instances built by RxMatcher during initialization. RxMatcher is just the interface facade of this network.  It is also a builder of it, and also provides a stream-like protocol to easily access the stream being matched.
  
  Instance variables:
  	matcher					<RxmLink> The entry point into the actual matcher.
  	igoreCase					<Boolean> Whether the matching algorithm should be case sensitive or not.
  	startOptimizer				<RxMatchOptimizer> An object which can quickly decide whether the next character can be the prefix of a match or not.
  	stream						<Stream> The stream currently being matched against.
  	markerPositions			<Array of: nil | Integer | OrderedCollection> Positions of markers' matches.
  	previousMarkerPositions	<Array of: nil |  Integer | OrderedCollection> Positions of markers from the previous #tryMatch send.
  	markerCount				<Integer> Number of markers.
  	lastResult 					<Boolean> Whether the latest match attempt succeeded or not.
  	firtTryMatch				<Boolean> True if there hasn't been any send of #tryMatch during the current matching.!

Item was added:
+ ----- Method: RxMatcher>>allKeyedSubexpressions (in category 'accessing') -----
+ allKeyedSubexpressions
+ 
+ 	^ self keyedMarkers
+ 		collect: [:key | key -> (self keyedSubexpressions: key)]
+ 		as: Dictionary!

Item was changed:
  ----- Method: RxMatcher>>buildFrom: (in category 'accessing') -----
  buildFrom: aSyntaxTreeRoot
  	"Private - Entry point of matcher build process."
  
  	markerCount := 0.  "must go before #dispatchTo: !!"
+ 	keyedMarkerPositions := Dictionary new.
  	matcher := aSyntaxTreeRoot dispatchTo: self.
  	matcher terminateWith: RxmTerminator new!

Item was added:
+ ----- Method: RxMatcher>>keyedMarkers (in category 'accessing') -----
+ keyedMarkers
+ 
+ 	^ keyedMarkerPositions keys!

Item was added:
+ ----- Method: RxMatcher>>keyedSubexpression: (in category 'accessing') -----
+ keyedSubexpression: key
+ 
+ 	^ (self keyedSubexpressions: key)
+ 		ifNotEmpty: [:matches | matches last]
+ 		ifEmpty: [nil]!

Item was added:
+ ----- Method: RxMatcher>>keyedSubexpressions: (in category 'accessing') -----
+ keyedSubexpressions: key
+ 
+ 	^ (self keyedSubexpressionsRanges: key) collect: [:range |
+ 		stream
+ 			position: range start;
+ 			next: range size]!

Item was added:
+ ----- Method: RxMatcher>>keyedSubexpressionsRanges: (in category 'accessing') -----
+ keyedSubexpressionsRanges: key
+ 
+ 	^ ((keyedMarkerPositions at: key) gather: [:pair |
+ 		(markerPositions at: pair first)
+ 			with: (markerPositions at: pair second)
+ 			collect: [:start :stop | start to: stop - 1]]) reversed!

Item was added:
+ ----- Method: RxMatcher>>registerMarkerKey:from:to: (in category 'private') -----
+ registerMarkerKey: key from: beginIndex to: endIndex
+ 
+ 	(keyedMarkerPositions at: key ifAbsentPut: [OrderedCollection new])
+ 		addLast: {beginIndex. endIndex}.!

Item was changed:
  ----- Method: RxMatcher>>resetMarkerPositions (in category 'private') -----
  resetMarkerPositions
  	"Reset the marker positions. This method should only be sent from #tryMatch. When this is after the first #tryMatch send, then the marker positions must be swapped."
  
  	firstTryMatch
  		ifTrue: [ firstTryMatch := false ]
  		ifFalse: [
  			| temp |
  			temp := previousMarkerPositions.
  			previousMarkerPositions := markerPositions.
  			markerPositions := temp ].
+ 	
  	markerPositions
  		at: 1 put: nil;
  		at: 2 put: nil.
+ 	3 to: markerCount do: [:index | 
+ 		(markerPositions at: index) reset].!
- 	3 to: markerCount do: [ :index | 
- 		(markerPositions at: index) resetTo: 1 ]!

Item was changed:
  ----- Method: RxMatcher>>syntaxRegex: (in category 'double dispatch') -----
  syntaxRegex: regexNode
+ 	"Double dispatch from the syntax tree.
+ 	Regex node is a chain of branches to be tried. Should compile this into a bundle of parallel branches, between two marker nodes."
+ 
+ 	| startNode endNode alternatives |
+ 	startNode := RxmMarker new index: self allocateMarker.
+ 	endNode := RxmMarker new index: self allocateMarker.
- 	"Double dispatch from the syntax tree. 
- 	Regex node is a chain of branches to be tried. Should compile this 
- 	into a bundle of parallel branches, between two marker nodes." 
  	
+ 	regexNode key ifNotNil: [:key |
+ 		self registerMarkerKey: key from: startNode index to: endNode index].
+ 	
- 	| startIndex endIndex endNode alternatives |
- 	startIndex := self allocateMarker.
- 	endIndex := self allocateMarker.
- 	endNode := RxmMarker new index: endIndex.
  	alternatives := self hookBranchOf: regexNode onto: endNode.
+ 	^ startNode
- 	^(RxmMarker new index: startIndex)
  		pointTailTo: alternatives;
  		yourself!

Item was changed:
  ----- Method: RxMatcher>>tryMatch (in category 'private') -----
  tryMatch
  	"Match thyself against the current stream."
  
  	| wasFirstTryMatch |
  	wasFirstTryMatch := firstTryMatch.
  	self resetMarkerPositions.
  	lastResult := startOptimizer
  		ifNil: [ matcher matchAgainst: self ]
  		ifNotNil: [ (startOptimizer canStartMatch: stream peek in: self) and: [ matcher matchAgainst: self ] ].
  	"check for duplicates"
  	lastResult ifFalse: [ ^false ].
  	wasFirstTryMatch ifTrue: [ ^true ].
+ 	(previousMarkerPositions hasEqualElements: markerPositions)
+ 		flag: #suspicious; "ct: do we need to check uncaptured groups here, too?"
+ 		ifFalse: [ ^true ].
- 	(previousMarkerPositions hasEqualElements: markerPositions) ifFalse: [ ^true ].
  	"this is a duplicate match"
  	^ lastResult := false!

Item was changed:
  ----- Method: RxParser>>atom (in category 'recursive descent') -----
  atom
  	"An atom is one of a lot of possibilities, see below."
  
  	| atom |
  	(lookahead == nil 
  	or: [ lookahead == $| 
  	or: [ lookahead == $)
  	or: [ lookahead == $*
  	or: [ lookahead == $+ 
  	or: [ lookahead == $? ]]]]])
  		ifTrue: [ ^RxsEpsilon new ].
  		
  	lookahead == $( 
  		ifTrue: [
+ 			^ self group ].
- 			"<atom> ::= '(' <regex> ')' "
- 			self match: $(.
- 			atom := self regex.
- 			self match: $).
- 			^atom ].
  	
  	lookahead == $[
  		ifTrue: [
  			"<atom> ::= '[' <characterSet> ']' "
  			self match: $[.
  			atom := self characterSet.
  			self match: $].
  			^atom ].
  	
  	lookahead == $: 
  		ifTrue: [
  			"<atom> ::= ':' <messagePredicate> ':' "
  			self match: $:.
  			atom := self messagePredicate.
  			self match: $:.
  			^atom ].
  	
  	lookahead == $. 
  		ifTrue: [
  			"any non-whitespace character"
  			self next.
  			^RxsContextCondition new beAny].
  	
  	lookahead == $^ 
  		ifTrue: [
  			"beginning of line condition"
  			self next.
  			^RxsContextCondition new beBeginningOfLine].
  	
  	lookahead == $$ 
  		ifTrue: [
  			"end of line condition"
  			self next.
  			^RxsContextCondition new beEndOfLine].
  		
  	lookahead == $\ 
  		ifTrue: [
  			"<atom> ::= '\' <character>"
  			self next ifNil: [ self signalParseError: 'bad quotation' ].
  			(BackslashConstants includesKey: lookahead) ifTrue: [
  				atom := RxsCharacter with: (BackslashConstants at: lookahead).
  				self next.
  				^atom].
  			self ifSpecial: lookahead
  				then: [:node | self next. ^node]].
  		
  	"If passed through the above, the following is a regular character."
  	atom := RxsCharacter with: lookahead.
  	self next.
  	^atom!

Item was added:
+ ----- Method: RxParser>>group (in category 'recursive descent') -----
+ group
+ 
+ 	self match: $(.
+ 	lookahead == $?
+ 		ifFalse: [
+ 			| group |
+ 			"<group> ::= '(' <regex> ')' "
+ 			group := self regex.
+ 			self match: $).
+ 			^ group].
+ 	
+ 	self next.
+ 	
+ 	(lookahead == $' or: [lookahead == $<
+ 		and: [input atEnd not] and: [input peek isAlphaNumeric]])
+ 		ifTrue: [
+ 			"named capture group"
+ 			"<group ::= '(?<' <key> '>' <regex> ') | '(?''' <key> '''' <regex> ')' "
+ 			| closing key group |
+ 			closing := #($'$>) at: (#($'$<) indexOf: lookahead).
+ 			self next.
+ 			key := self
+ 				inputWhile: #isAlphaNumeric
+ 				errorMessage: 'Invalid capture group name' translated.
+ 			self match: closing.
+ 			group := self regex.
+ 			group key: key.
+ 			self match: $).
+ 			^ group].
+ 	
+ 	('<=!!' includes: lookahead)
+ 		ifTrue: [
+ 			| lookaround |
+ 			lookaround := self lookAround.
+ 			self match: $).
+ 			^ lookaround ].
+ 	
+ 	^ self signalParseError!

Item was added:
+ ----- Method: RxParser>>inputWhile:errorMessage: (in category 'private') -----
+ inputWhile: aBlock errorMessage: aString
+ 	"Accumulate input stream until <aBlock> returns false on the next character and answer the accumulated chars as String, not including the last character. Signal error if end of stream is encountered, passing <aString> as the error description."
+ 
+ 	| accumulator |
+ 	accumulator := WriteStream on: (String new: 20).
+ 	lookahead ifNil: [self signalParseError: aString].
+ 	[lookahead notNil and: [aBlock cull: lookahead]] whileTrue: [
+ 		accumulator nextPut: lookahead.
+ 		self next].
+ 	^ accumulator contents!

Item was changed:
  ----- Method: RxParser>>lookAround (in category 'recursive descent') -----
  lookAround
  	"Parse a lookaround expression after: (?<lookaround>) 
+ 	<lookaround> ::= !!<regex> | =<regex>
+ 	Positive lookahead: ?=
+ 	Negative lookahead: ?!!
+ 	Positive lookbehind: ?<=
+ 	Negative lookbehind: ?<!!"
+ 
- 	<lookaround> ::= !!<regex> | =<regex>"
  	| lookbehind positive |
+ 	('<!!=' includes: lookahead) ifFalse: [
- 	('!!=<' includes: lookahead) ifFalse: [
  		^ self signalParseError: 'Invalid lookaround expression ?', lookahead asString].
+ 	
  	lookbehind := lookahead == $<
  		ifTrue: [self next];
  		yourself.
+ 	
+ 	('!!=' includes: lookahead) ifFalse: [
+ 		^ self signalParseError: 'Invalid lookaround expression'].
+ 	
  	positive := lookahead == $=.
  	self next.
  	^ RxsLookaround
  		with: self regex
  		forward: lookbehind not
  		positive: positive!

Item was changed:
  ----- Method: RxParser>>piece (in category 'recursive descent') -----
  piece
  	"<piece> ::= <atom> | <atom>* | <atom>+ | <atom>? | <atom>{<number>,<number>}"
  
  	| atom |
  	atom := self atom.
  	
  	lookahead == $*
  		ifTrue: [ 
  			self next.
  			atom isNullable
  				ifTrue: [ self signalNullableClosureParserError ].
  			^ RxsPiece new initializeStarAtom: atom ].
  
  	lookahead == $+
  		ifTrue: [ 
  			self next.
  			atom isNullable
  				ifTrue: [ self signalNullableClosureParserError ].
  			^ RxsPiece new initializePlusAtom: atom ].
  
  	lookahead == $?
  		ifTrue: [ 
  			self next.
  			atom isNullable
+ 				ifTrue: [ self signalNullableClosureParserError ].
- 				ifTrue: [ 
- 					^ self lookAround ].
  			^ RxsPiece new initializeOptionalAtom: atom ].
  	
  	lookahead == ${
  		ifTrue: [
  			^ self quantifiedAtom: atom ].
  		
  	^ RxsPiece new initializeAtom: atom!

Item was added:
+ ----- Method: RxmMarker>>index (in category 'accessing') -----
+ index
+ 
+ 	^ index!

Item was changed:
  RxsNode subclass: #RxsRegex
+ 	instanceVariableNames: 'branch regex key'
- 	instanceVariableNames: 'branch regex'
  	classVariableNames: ''
  	poolDictionaries: ''
  	category: 'Regex-Core'!
  
  !RxsRegex commentStamp: 'Tbn 11/12/2010 23:15' prior: 0!
  -- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
  --
  The body of a parenthesized thing, or a top-level expression, also an atom.  
  
  Instance variables:
  	branch		<RxsBranch>
  	regex		<RxsRegex | RxsEpsilon>!

Item was added:
+ ----- Method: RxsRegex>>key (in category 'accessing') -----
+ key
+ 
+ 	^ key!

Item was added:
+ ----- Method: RxsRegex>>key: (in category 'accessing') -----
+ key: anObject
+ 
+ 	key := anObject!



More information about the Squeak-dev mailing list