[Pkg] The Trunk: Regex-Core-ul.37.mcz

Fri Aug 14 20:25:35 UTC 2015

Levente Uzonyi uploaded a new version of Regex-Core to project The Trunk:
http://source.squeak.org/trunk/Regex-Core-ul.37.mcz

==================== Summary ====================

Name: Regex-Core-ul.37
Author: ul
Time: 24 May 2015, 10:05:56.876 pm
UUID: c3993ed6-4c3a-4ccc-8efa-8ac5f58acfce
Ancestors: Regex-Core-ul.36

Accept \r as an escape for the carriage return character (cr).
Further optimizations in RxCharSetParser, RxMatchOptimizer, and RxmSubstring.
Renamed extension method category to *Regex-Core.

==================== Snapshot ====================

SystemOrganization addCategory: #'Regex-Core'!
SystemOrganization addCategory: #'Regex-Core-Exceptions'!

----- Method: String>>allRangesOfRegexMatches: (in category '*Regex-Core') -----
allRangesOfRegexMatches: rxString

	^rxString asRegex matchingRangesIn: self!

----- Method: String>>allRegexMatches: (in category '*Regex-Core') -----
allRegexMatches: rxString

	^rxString asRegex matchesIn: self!

----- Method: String>>asRegex (in category '*Regex-Core') -----
asRegex
	"Compile the receiver as a regex matcher. May raise RxParser>>syntaxErrorSignal
	or RxParser>>compilationErrorSignal.
	This is a part of the Regular Expression Matcher package, (c) 1996, 1999 Vassili Bykov.
	Refer to `documentation' protocol of RxParser class for details."

	^RxParser preferredMatcherClass for: (RxParser new parse: self)!

----- Method: String>>asRegexIgnoringCase (in category '*Regex-Core') -----
asRegexIgnoringCase
	"Compile the receiver as a regex matcher. May raise RxParser>>syntaxErrorSignal
	or RxParser>>compilationErrorSignal.
	This is a part of the Regular Expression Matcher package, (c) 1996, 1999 Vassili Bykov.
	Refer to `documentation' protocol of RxParser class for details."

	^RxParser preferredMatcherClass
		for: (RxParser new parse: self)
		ignoreCase: true!

----- Method: String>>copyWithRegex:matchesReplacedWith: (in category '*Regex-Core') -----
copyWithRegex: rxString matchesReplacedWith: aString

	^rxString asRegex
		copy: self replacingMatchesWith: aString!

----- Method: String>>copyWithRegex:matchesTranslatedUsing: (in category '*Regex-Core') -----
copyWithRegex: rxString matchesTranslatedUsing: aBlock

	^rxString asRegex
		copy: self translatingMatchesUsing: aBlock!

----- Method: String>>matchesRegex: (in category '*Regex-Core') -----
matchesRegex: regexString
	"Test if the receiver matches a regex.  May raise RxParser>>regexErrorSignal or
	child signals.
	This is a part of the Regular Expression Matcher package, (c) 1996, 1999 Vassili Bykov.
	Refer to `documentation' protocol of RxParser class for details."

	^regexString asRegex matches: self!

----- Method: String>>matchesRegexIgnoringCase: (in category '*Regex-Core') -----
matchesRegexIgnoringCase: regexString
	"Test if the receiver matches a regex.  May raise RxParser>>regexErrorSignal or
	child signals.
	This is a part of the Regular Expression Matcher package, (c) 1996, 1999 Vassili Bykov.
	Refer to `documentation' protocol of RxParser class for details."

	^regexString asRegexIgnoringCase matches: self!

----- Method: String>>occurrencesOfRegex: (in category '*Regex-Core') -----
occurrencesOfRegex: rxString

	| count |
	count := 0.
	self regex: rxString matchesDo: [ :each | count := count + 1 ].
	^count!

----- Method: String>>prefixMatchesRegex: (in category '*Regex-Core') -----
prefixMatchesRegex: regexString
	"Test if the receiver's prefix matches a regex.	
	May raise RxParser class>>regexErrorSignal or child signals.
	This is a part of the Regular Expression Matcher package, (c) 1996, 1999 Vassili Bykov.
	Refer to `documentation' protocol of RxParser class for details."

	^regexString asRegex matchesPrefix: self!

----- Method: String>>prefixMatchesRegexIgnoringCase: (in category '*Regex-Core') -----
prefixMatchesRegexIgnoringCase: regexString
	"Test if the receiver's prefix matches a regex.	
	May raise RxParser class>>regexErrorSignal or child signals.
	This is a part of the Regular Expression Matcher package, (c) 1996, 1999 Vassili Bykov.
	Refer to `documentation' protocol of RxParser class for details."

	^regexString asRegexIgnoringCase matchesPrefix: self!

----- Method: String>>regex:matchesCollect: (in category '*Regex-Core') -----
regex: rxString matchesCollect: aBlock

	^rxString asRegex matchesIn: self collect: aBlock!

----- Method: String>>regex:matchesDo: (in category '*Regex-Core') -----
regex: rxString matchesDo: aBlock

	^rxString asRegex matchesIn: self do: aBlock!

----- Method: String>>search: (in category '*Regex-Core') -----
search: aString
	"compatibility method to make regexp and strings work polymorphicly"
	^ aString includesSubstring: self!

Error subclass: #RegexError
	instanceVariableNames: ''
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core-Exceptions'!

!RegexError commentStamp: 'Tbn 11/12/2010 22:37' prior: 0!
This is a common superclass for errors in regular expressions.!

RegexError subclass: #RegexCompilationError
	instanceVariableNames: ''
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core-Exceptions'!

!RegexCompilationError commentStamp: 'Tbn 11/12/2010 22:38' prior: 0!
This class represents compilation errors in regular expressions.!

RegexError subclass: #RegexMatchingError
	instanceVariableNames: ''
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core-Exceptions'!

!RegexMatchingError commentStamp: 'Tbn 11/12/2010 22:38' prior: 0!
This class represents matching errors in regular expressions.!

RegexError subclass: #RegexSyntaxError
	instanceVariableNames: 'position'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core-Exceptions'!

!RegexSyntaxError commentStamp: 'Tbn 11/12/2010 22:38' prior: 0!
This class represents syntax errors in regular expressions.!

----- Method: RegexSyntaxError class>>signal:at: (in category 'signaling') -----
signal: anErrorMessage at: errorPosition
	^ (self new)
		position: errorPosition;
		signal: anErrorMessage!

----- Method: RegexSyntaxError>>position (in category 'accessing') -----
position
	"return the parsing error location"
	^ position!

----- Method: RegexSyntaxError>>position: (in category 'accessing') -----
position: anInteger
	position := anInteger.!

Object subclass: #RxCharSetParser
	instanceVariableNames: 'source lookahead elements'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxCharSetParser commentStamp: 'Tbn 11/12/2010 23:13' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
I am a parser created to parse the insides of a character set ([...]) construct. I create and answer a collection of "elements", each being an instance of one of: RxsCharacter, RxsRange, or RxsPredicate.

Instance Variables:

	source	<Stream>	open on whatever is inside the square brackets we have to parse.
	lookahead	<Character>	The current lookahead character
	elements	<Collection of: <RxsCharacter|RxsRange|RxsPredicate>> Parsing result!

----- Method: RxCharSetParser class>>on: (in category 'instance creation') -----
on: aStream

	^self new initialize: aStream!

----- Method: RxCharSetParser>>addChar: (in category 'parsing') -----
addChar: aChar

	elements add: (RxsCharacter with: aChar)!

----- Method: RxCharSetParser>>addRangeFrom:to: (in category 'parsing') -----
addRangeFrom: firstChar to: lastChar

	firstChar asInteger > lastChar asInteger ifTrue:
		[RxParser signalSyntaxException: ' bad character range' at: source position].
	elements add: (RxsRange from: firstChar to: lastChar)!

----- Method: RxCharSetParser>>initialize: (in category 'initialize-release') -----
initialize: aStream

	source := aStream.
	lookahead := aStream next.
	elements := OrderedCollection new!

----- Method: RxCharSetParser>>match: (in category 'parsing') -----
match: aCharacter

	aCharacter = lookahead ifTrue: [ ^self next ].
	RxParser 
		signalSyntaxException: 'unexpected character: ', (String with: lookahead)
		at: source position!

----- Method: RxCharSetParser>>next (in category 'parsing') -----
next

	^lookahead := source next!

----- Method: RxCharSetParser>>parse (in category 'accessing') -----
parse

	lookahead == $- ifTrue: [
		self addChar: $-.
		self next ].
	[ lookahead == nil ] whileFalse: [ self parseStep ].
	^elements!

----- Method: RxCharSetParser>>parseCharOrRange (in category 'parsing') -----
parseCharOrRange

	| firstChar |
	firstChar := lookahead.
	self next == $- ifFalse: [ ^self addChar: firstChar ].
	self next ifNil: [ ^self addChar: firstChar; addChar: $- ].
	self addRangeFrom: firstChar to: lookahead.
	self next!

----- Method: RxCharSetParser>>parseEscapeChar (in category 'parsing') -----
parseEscapeChar

	self match: $\.
	$- == lookahead
		ifTrue: [elements add: (RxsCharacter with: $-)]
		ifFalse: [elements add: (RxsPredicate forEscapedLetter: lookahead)].
	self next!

----- Method: RxCharSetParser>>parseNamedSet (in category 'parsing') -----
parseNamedSet

	| name |
	self match: $[; match: $:.
	name := (String with: lookahead), (source upTo: $:).
	self next.
	self match: $].
	elements add: (RxsPredicate forNamedClass: name)!

----- Method: RxCharSetParser>>parseStep (in category 'parsing') -----
parseStep

	lookahead == $[ ifTrue:
		[source peek == $:
			ifTrue: [^self parseNamedSet]
			ifFalse: [^self parseCharOrRange]].
	lookahead == $\ ifTrue:
		[^self parseEscapeChar].
	lookahead == $- ifTrue:
		[RxParser signalSyntaxException: 'invalid range' at: source position].
	self parseCharOrRange!

Object subclass: #RxMatchOptimizer
	instanceVariableNames: 'ignoreCase prefixes nonPrefixes conditions testBlock methodPredicates nonMethodPredicates predicates nonPredicates lookarounds'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxMatchOptimizer commentStamp: 'Tbn 11/12/2010 23:13' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
A match start optimizer, handy for searching a string. Takes a regex syntax tree and sets itself up so that prefix characters or matcher states that cannot start a match are later recognized with #canStartMatch:in: method.

Used by RxMatcher, but can be used by other matchers (if implemented) as well.!

----- Method: RxMatchOptimizer>>canStartMatch:in: (in category 'accessing') -----
canStartMatch: aCharacter in: aMatcher 
	"Answer whether a match could commence at the given lookahead
	character, or in the current state of <aMatcher>. True answered
	by this method does not mean a match will definitly occur, while false
	answered by this method *does* guarantee a match will never occur."

	aCharacter ifNil: [ ^true ].
	testBlock ifNil: [ ^true ].
	^testBlock value: aCharacter value: aMatcher!

----- Method: RxMatchOptimizer>>conditionTester (in category 'accessing') -----
conditionTester
	"#any condition is filtered at the higher level;
	it cannot appear among the conditions here."

	| matchConditions |
	conditions isEmpty ifTrue: [^nil].
	conditions size = 1 ifTrue: [
		| matchCondition |
		matchCondition := conditions anyOne.
		"Special case all of the possible conditions."
		#atBeginningOfLine == matchCondition ifTrue: [^[:c :matcher | matcher atBeginningOfLine]].
		#atEndOfLine == matchCondition ifTrue: [^[:c :matcher | matcher atEndOfLine]].
		#atBeginningOfWord == matchCondition ifTrue: [^[:c :matcher | matcher atBeginningOfWord]].
		#atEndOfWord == matchCondition ifTrue: [^[:c :matcher | matcher atEndOfWord]].
		#atWordBoundary == matchCondition ifTrue: [^[:c :matcher | matcher atWordBoundary]].
		#notAtWordBoundary == matchCondition ifTrue: [^[:c :matcher | matcher notAtWordBoundary]].
		RxParser signalCompilationException: 'invalid match condition'].
	"More than one condition. Capture them as an array in scope."
	matchConditions := conditions asArray.
	^[ :c :matcher |
		matchConditions anySatisfy: [ :conditionSelector |
			matcher perform: conditionSelector ] ]!

----- Method: RxMatchOptimizer>>determineTestMethod (in category 'private') -----
determineTestMethod
	"Answer a block closure that will work as a can-match predicate.
	Answer nil if no viable optimization is possible (too many chars would
	be able to start a match)."

	| testers size |
	(conditions includes: #any) ifTrue: [^nil].
	testers := {
		self prefixTester.
		self nonPrefixTester.
		self conditionTester.
		self methodPredicateTester.
		self nonMethodPredicateTester.
		self predicateTester.
		self nonPredicateTester } reject: [ :each | each isNil ].
	(size := testers size) = 0 ifTrue: [ ^nil ].
	size = 1 ifTrue: [ ^testers first ].
	^[ :char :matcher | testers anySatisfy: [ :t | t value: char value: matcher ] ]!

----- Method: RxMatchOptimizer>>initialize:ignoreCase: (in category 'initialize-release') -----
initialize: aRegex ignoreCase: aBoolean 
	"Set `testMethod' variable to a can-match predicate block:
	two-argument block which accepts a lookahead character
	and a matcher (presumably built from aRegex) and answers 
	a boolean indicating whether a match could start at the given
	lookahead. "

	ignoreCase := aBoolean.
	prefixes := Set new: 10.
	nonPrefixes := Set new: 10.
	conditions := Set new: 3.
	methodPredicates := Set new: 3.
	nonMethodPredicates := Set new: 3.
	predicates := Set new: 3.
	nonPredicates := Set new: 3.
	lookarounds := Set new: 3.
	aRegex dispatchTo: self.	"If the whole expression is nullable, 
		end-of-line is an implicit can-match condition!!"
	aRegex isNullable ifTrue: [conditions add: #atEndOfLine].
	testBlock := self determineTestMethod!

----- Method: RxMatchOptimizer>>methodPredicateTester (in category 'accessing') -----
methodPredicateTester

	| p size |
	(size := methodPredicates size) = 0 ifTrue: [ ^nil ].
	size = 1 ifTrue: [
		|  selector |
		"might be a pretty common case"
		selector := methodPredicates anyOne.
		^[ :char :matcher | 
			RxParser doHandlingMessageNotUnderstood: [
				char perform: selector ] ] ].
	p := methodPredicates asArray.
	^[ :char :matcher | 
		RxParser doHandlingMessageNotUnderstood: [
			p anySatisfy: [ :sel | char perform: sel ] ] ]!

----- Method: RxMatchOptimizer>>nonMethodPredicateTester (in category 'accessing') -----
nonMethodPredicateTester

	| p size |
	(size := nonMethodPredicates size) = 0 ifTrue: [ ^nil ].
	size = 1 ifTrue: [
		| selector |
		selector := nonMethodPredicates anyOne.
		^[ :char :matcher | 
			RxParser doHandlingMessageNotUnderstood: [
				(char perform: selector) not ] ] ].
	p := nonMethodPredicates asArray.
	^[:char :m | 
		RxParser doHandlingMessageNotUnderstood: [
			(p allSatisfy: [:sel | char perform: sel ]) not ] ]!

----- Method: RxMatchOptimizer>>nonPredicateTester (in category 'private') -----
nonPredicateTester

	| p size |
	(size := nonPredicates size) = 0 ifTrue: [ ^nil ].
	size = 1 ifTrue:  [
		| predicate |
		predicate := nonPredicates anyOne.
		^[ :char :matcher | (predicate value: char) not] ].
	p := nonPredicates asArray.
	^[ :char :m | (p allSatisfy: [:some | some value: char ]) not ]!

----- Method: RxMatchOptimizer>>nonPrefixTester (in category 'private') -----
nonPrefixTester

	| size |
	(size := nonPrefixes size) = 0 ifTrue: [ ^nil ].
	size = 1 ifTrue: [
		| nonPrefixChar |
		nonPrefixChar := nonPrefixes anyOne.
		^[ :char :matcher | char ~= nonPrefixChar ] ].
	^[ :char : matcher | (nonPrefixes includes: char) not ]!

----- Method: RxMatchOptimizer>>optimizeSet: (in category 'private') -----
optimizeSet: aSet
	"If a set is small, convert it to array to speed up lookup
	(Array has no hashing overhead, beats Set on small number
	of elements)."

	^aSet size < 10 ifTrue: [aSet asArray] ifFalse: [aSet]!

----- Method: RxMatchOptimizer>>predicateTester (in category 'private') -----
predicateTester

	| p size |
	(size := predicates size) = 0 ifTrue: [ ^nil ].
	size = 1 ifTrue: [
		| pred |
		pred := predicates anyOne.
		^[ :char :matcher | pred value: char ] ].
	p := predicates asArray. 
	^[ :char :matcher | p anySatisfy: [:some | some value: char ] ]!

----- Method: RxMatchOptimizer>>prefixTester (in category 'private') -----
prefixTester

	| p size |
	(size := prefixes size) = 0 ifTrue: [ ^nil ].
	size = 1 ifTrue: [
		| prefixChar |
		prefixChar := prefixes anyOne.
		ignoreCase ifTrue: [ ^[ :char :matcher | char sameAs: prefixChar ] ].
		^[ :char :matcher | char = prefixChar ] ].
	ignoreCase ifFalse: [ ^[ :char :matcher | prefixes includes: char ] ].
	p := prefixes collect: [ :each | each asUppercase ].
	^[ :char :matcher | p includes: char asUppercase ]!

----- Method: RxMatchOptimizer>>syntaxAny (in category 'double dispatch') -----
syntaxAny
	"Any special char is among the prefixes."

	conditions add: #any!

----- Method: RxMatchOptimizer>>syntaxBeginningOfLine (in category 'double dispatch') -----
syntaxBeginningOfLine
	"Beginning of line is among the prefixes."

	conditions add: #atBeginningOfLine!

----- Method: RxMatchOptimizer>>syntaxBeginningOfWord (in category 'double dispatch') -----
syntaxBeginningOfWord
	"Beginning of line is among the prefixes."

	conditions add: #atBeginningOfWord!

----- Method: RxMatchOptimizer>>syntaxBranch: (in category 'double dispatch') -----
syntaxBranch: branchNode
	"If the head piece of the branch is transparent (allows 0 matches),
	we must recurse down the branch. Otherwise, just the head atom
	is important."

	(branchNode piece isNullable and: [branchNode branch notNil])
		ifTrue: [branchNode branch dispatchTo: self].
	branchNode piece dispatchTo: self!

----- Method: RxMatchOptimizer>>syntaxCharSet: (in category 'double dispatch') -----
syntaxCharSet: charSetNode 
	"All these (or none of these) characters is the prefix."

	(charSetNode enumerableSetIgnoringCase: ignoreCase) ifNotNil: [ :enumerableSet |
		charSetNode isNegated
			ifTrue: [ nonPrefixes addAll: enumerableSet ]
			ifFalse: [ prefixes addAll: enumerableSet ] ].
	charSetNode predicates ifNotNil: [ :charsetPredicates |
		charSetNode isNegated
			ifTrue: [ nonPredicates addAll: charsetPredicates ]
			ifFalse: [ predicates addAll: charsetPredicates ] ]!

----- Method: RxMatchOptimizer>>syntaxCharacter: (in category 'double dispatch') -----
syntaxCharacter: charNode
	"This character is the prefix, of one of them."

	prefixes add: charNode character!

----- Method: RxMatchOptimizer>>syntaxEndOfLine (in category 'double dispatch') -----
syntaxEndOfLine
	"Beginning of line is among the prefixes."

	conditions add: #atEndOfLine!

----- Method: RxMatchOptimizer>>syntaxEndOfWord (in category 'double dispatch') -----
syntaxEndOfWord

	conditions add: #atEndOfWord!

----- Method: RxMatchOptimizer>>syntaxEpsilon (in category 'double dispatch') -----
syntaxEpsilon
	"Empty string, terminate the recursion (do nothing)."!

----- Method: RxMatchOptimizer>>syntaxLookaround: (in category 'double dispatch') -----
syntaxLookaround: lookaroundNode 

	lookarounds add: lookaroundNode!

----- Method: RxMatchOptimizer>>syntaxMessagePredicate: (in category 'double dispatch') -----
syntaxMessagePredicate: messagePredicateNode 
	messagePredicateNode negated
		ifTrue: [nonMethodPredicates add: messagePredicateNode selector]
		ifFalse: [methodPredicates add: messagePredicateNode selector]!

----- Method: RxMatchOptimizer>>syntaxNonWordBoundary (in category 'double dispatch') -----
syntaxNonWordBoundary

	conditions add: #notAtWordBoundary!

----- Method: RxMatchOptimizer>>syntaxPiece: (in category 'double dispatch') -----
syntaxPiece: pieceNode
	"Pass on to the atom."

	pieceNode atom dispatchTo: self!

----- Method: RxMatchOptimizer>>syntaxPredicate: (in category 'double dispatch') -----
syntaxPredicate: predicateNode 

	predicates add: predicateNode predicate!

----- Method: RxMatchOptimizer>>syntaxRegex: (in category 'double dispatch') -----
syntaxRegex: regexNode
	"All prefixes of the regex's branches should be combined.
	Therefore, just recurse."

	regexNode branch dispatchTo: self.
	regexNode regex notNil
		ifTrue: [regexNode regex dispatchTo: self]!

----- Method: RxMatchOptimizer>>syntaxWordBoundary (in category 'double dispatch') -----
syntaxWordBoundary

	conditions add: #atWordBoundary!

Object subclass: #RxMatcher
	instanceVariableNames: 'matcher ignoreCase startOptimizer stream markerPositions markerCount lastResult'
	classVariableNames: 'Cr Lf'
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxMatcher commentStamp: 'Tbn 11/12/2010 23:13' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
This is a recursive regex matcher. Not strikingly efficient, but simple. Also, keeps track of matched subexpressions.  The life cycle goes as follows:

1. Initialization. Accepts a syntax tree (presumably produced by RxParser) and compiles it into a matcher built of other classes in this category.

2. Matching. Accepts a stream or a string and returns a boolean indicating whether the whole stream or its prefix -- depending on the message sent -- matches the regex.

3. Subexpression query. After a successful match, and before any other match, the matcher may be queried about the range of specific stream (string) positions that matched to certain parenthesized subexpressions of the original expression.

Any number of queries may follow a successful match, and any number or matches may follow a successful initialization.

Note that `matcher' is actually a sort of a misnomer. The actual matcher is a web of Rxm* instances built by RxMatcher during initialization. RxMatcher is just the interface facade of this network.  It is also a builder of it, and also provides a stream-like protocol to easily access the stream being matched.

Instance variables:
	matcher				<RxmLink> The entry point into the actual matcher.
	stream				<Stream> The stream currently being matched against.
	markerPositions		<Array of: Integer> Positions of markers' matches.
	markerCount		<Integer> Number of markers.
	lastResult 			<Boolean> Whether the latest match attempt succeeded or not.
	lastChar			<Character | nil> character last seen in the matcher stream!

----- Method: RxMatcher class>>for: (in category 'instance creation') -----
for: aRegex
	"Create and answer a matcher that will match a regular expression
	specified by the syntax tree of which `aRegex' is a root."

	^self for: aRegex ignoreCase: false!

----- Method: RxMatcher class>>for:ignoreCase: (in category 'instance creation') -----
for: aRegex ignoreCase: aBoolean
	"Create and answer a matcher that will match a regular expression
	specified by the syntax tree of which `aRegex' is a root."

	^self new
		initialize: aRegex
		ignoreCase: aBoolean!

----- Method: RxMatcher class>>forString: (in category 'instance creation') -----
forString: aString
	"Create and answer a matcher that will match the regular expression
	`aString'."

	^self for: (RxParser new parse: aString)!

----- Method: RxMatcher class>>forString:ignoreCase: (in category 'instance creation') -----
forString: aString ignoreCase: aBoolean
	"Create and answer a matcher that will match the regular expression
	`aString'."

	^self for: (RxParser new parse: aString) ignoreCase: aBoolean!

----- Method: RxMatcher class>>initialize (in category 'class initialization') -----
initialize
	"RxMatcher initialize"
	Cr := Character cr.
	Lf := Character lf.!

----- Method: RxMatcher>>allocateMarker (in category 'private') -----
allocateMarker
	"Answer an integer to use as an index of the next marker."

	markerCount := markerCount + 1.
	^markerCount!

----- Method: RxMatcher>>atBeginningOfLine (in category 'testing') -----
atBeginningOfLine

	^self position = 0 or: [self lastChar = Cr]!

----- Method: RxMatcher>>atBeginningOfWord (in category 'testing') -----
atBeginningOfWord

	^(self isWordChar: self lastChar) not
		and: [self isWordChar: stream peek]!

----- Method: RxMatcher>>atEnd (in category 'streaming') -----
atEnd

	^stream atEnd!

----- Method: RxMatcher>>atEndOfLine (in category 'testing') -----
atEndOfLine

	^self atEnd or: [stream peek = Cr]!

----- Method: RxMatcher>>atEndOfWord (in category 'testing') -----
atEndOfWord

	^(self isWordChar: self lastChar)
		and: [(self isWordChar: stream peek) not]!

----- Method: RxMatcher>>atWordBoundary (in category 'testing') -----
atWordBoundary

	^(self isWordChar: self lastChar)
		xor: (self isWordChar: stream peek)!

----- Method: RxMatcher>>buildFrom: (in category 'accessing') -----
buildFrom: aSyntaxTreeRoot
	"Private - Entry point of matcher build process."

	markerCount := 0.  "must go before #dispatchTo: !!"
	matcher := aSyntaxTreeRoot dispatchTo: self.
	matcher terminateWith: RxmTerminator new!

----- Method: RxMatcher>>copy:replacingMatchesWith: (in category 'match enumeration') -----
copy: aString replacingMatchesWith: replacementString
	"Copy <aString>, except for the matches. Replace each match with <aString>."

	| answer |
	answer := (String new: 40) writeStream.
	self
		copyStream: aString readStream
		to: answer
		replacingMatchesWith: replacementString.
	^answer contents!

----- Method: RxMatcher>>copy:translatingMatchesUsing: (in category 'match enumeration') -----
copy: aString translatingMatchesUsing: aBlock
	"Copy <aString>, except for the matches. For each match, evaluate <aBlock> passing the matched substring as the argument.  Expect the block to answer a String, and replace the match with the answer."

	| answer |
	answer := (String new: 40) writeStream.
	self copyStream: aString readStream to: answer translatingMatchesUsing: aBlock.
	^answer contents!

----- Method: RxMatcher>>copyStream:to:replacingMatchesWith: (in category 'match enumeration') -----
copyStream: aStream to: writeStream replacingMatchesWith: aString
	"Copy the contents of <aStream> on the <writeStream>, except for the matches. Replace each match with <aString>."

	| searchStart matchStart matchEnd |
	stream := aStream.
	markerPositions := nil.
	[searchStart := aStream position.
	self proceedSearchingStream: aStream] whileTrue:
		[matchStart := (self subBeginning: 1) first.
		matchEnd := (self subEnd: 1) first.
		aStream position: searchStart.
		searchStart to: matchStart - 1 do:
			[:ignoredPos | writeStream nextPut: aStream next].
		writeStream nextPutAll: aString.
		aStream position: matchEnd.
		"Be extra careful about successful matches which consume no input.
		After those, make sure to advance or finish if already at end."
		matchEnd = searchStart ifTrue: 
			[aStream atEnd
				ifTrue:	[^self "rest after end of whileTrue: block is a no-op if atEnd"]
				ifFalse:	[writeStream nextPut: aStream next]]].
	aStream position: searchStart.
	[aStream atEnd] whileFalse: [writeStream nextPut: aStream next]!

----- Method: RxMatcher>>copyStream:to:translatingMatchesUsing: (in category 'match enumeration') -----
copyStream: aStream to: writeStream translatingMatchesUsing: aBlock
	"Copy the contents of <aStream> on the <writeStream>, except for the matches. For each match, evaluate <aBlock> passing the matched substring as the argument.  Expect the block to answer a String, and write the answer to <writeStream> in place of the match."

	| searchStart matchStart matchEnd match |
	stream := aStream.	
	markerPositions := nil.
	[searchStart := aStream position.
	self proceedSearchingStream: aStream] whileTrue:
		[matchStart := (self subBeginning: 1) first.
		matchEnd := (self subEnd: 1) first.
		aStream position: searchStart.
		searchStart to: matchStart - 1 do:
			[:ignoredPos | writeStream nextPut: aStream next].
		match := (String new: matchEnd - matchStart + 1) writeStream.
		matchStart to: matchEnd - 1 do:
			[:ignoredPos | match nextPut: aStream next].
		writeStream nextPutAll: (aBlock value: match contents).
		"Be extra careful about successful matches which consume no input.
		After those, make sure to advance or finish if already at end."
		matchEnd = searchStart ifTrue: 
			[aStream atEnd
				ifTrue:	[^self "rest after end of whileTrue: block is a no-op if atEnd"]
				ifFalse:	[writeStream nextPut: aStream next]]].
	aStream position: searchStart.
	[aStream atEnd] whileFalse: [writeStream nextPut: aStream next]!

----- Method: RxMatcher>>currentState (in category 'privileged') -----
currentState
	"Answer an opaque object that can later be used to restore the matcher's state (for backtracking)."

	^stream position!

----- Method: RxMatcher>>hookBranchOf:onto: (in category 'private') -----
hookBranchOf: regexNode onto: endMarker
	"Private - Recurse down the chain of regexes starting at
	regexNode, compiling their branches and hooking their tails 
	to the endMarker node."

	| rest |
	rest := regexNode regex ifNotNil: [ :regex |
		self hookBranchOf: regex onto: endMarker ].
	^RxmBranch new
		next: ((regexNode branch dispatchTo: self)
					pointTailTo: endMarker; 
					yourself);
		alternative: rest;
		yourself!

----- Method: RxMatcher>>initialize:ignoreCase: (in category 'initialize-release') -----
initialize: syntaxTreeRoot ignoreCase: aBoolean
	"Compile thyself for the regex with the specified syntax tree.
	See comment and `building' protocol in this class and 
	#dispatchTo: methods in syntax tree components for details 
	on double-dispatch building. 
	The argument is supposedly a RxsRegex."

	ignoreCase := aBoolean.
	self buildFrom: syntaxTreeRoot.
	startOptimizer := RxMatchOptimizer new initialize: syntaxTreeRoot ignoreCase: aBoolean!

----- Method: RxMatcher>>isWordChar: (in category 'private') -----
isWordChar: aCharacterOrNil
	"Answer whether the argument is a word constituent character:
	alphanumeric or _."

	^aCharacterOrNil ~~ nil
		and: [aCharacterOrNil isAlphaNumeric]!

----- Method: RxMatcher>>lastChar (in category 'accessing') -----
lastChar
	^ stream position = 0
		ifFalse: [ stream skip: -1; next ]!

----- Method: RxMatcher>>lastResult (in category 'accessing') -----
lastResult

	^lastResult!

----- Method: RxMatcher>>makeOptional: (in category 'private') -----
makeOptional: aMatcher
	"Private - Wrap this matcher so that the result would match 0 or 1
	occurrences of the matcher."

	| dummy branch |
	dummy := RxmLink new.
	branch := (RxmBranch new beLoopback)
		next: aMatcher;
		alternative: dummy.
	aMatcher pointTailTo: dummy.
	^branch!

----- Method: RxMatcher>>makePlus: (in category 'private') -----
makePlus: aMatcher
	"Private - Wrap this matcher so that the result would match 1 and more
	occurrences of the matcher."

	| loopback |
	loopback := (RxmBranch new beLoopback)
		next: aMatcher.
	aMatcher pointTailTo: loopback.
	^aMatcher!

----- Method: RxMatcher>>makeQuantified:min:max: (in category 'private') -----
makeQuantified: anRxmLink min: min max: max 
	"Perform recursive poor-man's transformation of the {<min>,<max>} quantifiers."
	| aMatcher |

	"<atom>{,<max>}       ==>  (<atom>{1,<max>})?"
	min = 0 ifTrue: [ 
		^ self makeOptional: (self makeQuantified: anRxmLink min: 1 max: max) ].

	"<atom>{<min>,}       ==>  <atom>{<min>-1, <min>-1}<atom>+"
	max ifNil: [
		^ (self makeQuantified: anRxmLink min: 1 max: min-1) pointTailTo: (self makePlus: anRxmLink copy) ].

	"<atom>{<max>,<max>}  ==>  <atom><atom> ... <atom>"
	min = max 
		ifTrue: [ 
			aMatcher := anRxmLink copy.
			(min-1) timesRepeat: [ aMatcher pointTailTo: anRxmLink copy ].
			^ aMatcher ].

	"<atom>{<min>,<max>}  ==>  <atom>{<min>,<min>}(<atom>{1,<max>-1})?"
	aMatcher := self makeOptional: anRxmLink copy.
	(max - min - 1) timesRepeat: [ 
		 aMatcher := self makeOptional: (anRxmLink copy pointTailTo: aMatcher) ].
	^ (self makeQuantified: anRxmLink min: min max: min) pointTailTo: aMatcher!

----- Method: RxMatcher>>makeStar: (in category 'private') -----
makeStar: aMatcher
	"Private - Wrap this matcher so that the result would match 0 and more
	occurrences of the matcher."

	| dummy detour loopback |
	dummy := RxmLink new.
	detour := RxmBranch new
		next: aMatcher;
		alternative: dummy.
	loopback := (RxmBranch new beLoopback)
		next: aMatcher;
		alternative: dummy.
	aMatcher pointTailTo: loopback.
	^detour!

----- Method: RxMatcher>>markerPositionAt:add: (in category 'privileged') -----
markerPositionAt: anIndex add: position
	"Remember position of another instance of the given marker."

	(markerPositions at: anIndex) addFirst: position!

----- Method: RxMatcher>>matches: (in category 'accessing') -----
matches: aString
	"Match against a string. Return true if the complete String matches.
	If you want to search for occurences anywhere in the String see #search:"

	^self matchesStream: aString readStream!

----- Method: RxMatcher>>matchesIn: (in category 'match enumeration') -----
matchesIn: aString
	"Search aString repeatedly for the matches of the receiver.  Answer an OrderedCollection of all matches (substrings)."

	| result |
	result := OrderedCollection new.
	self
		matchesOnStream: aString readStream
		do: [:match | result add: match].
	^result!

----- Method: RxMatcher>>matchesIn:collect: (in category 'match enumeration') -----
matchesIn: aString collect: aBlock
	"Search aString repeatedly for the matches of the receiver.  Evaluate aBlock for each match passing the matched substring as the argument, collect evaluation results in an OrderedCollection, and return in. The following example shows how to use this message to split a string into words."
	"'\w+' asRegex matchesIn: 'Now is the Time' collect: [:each | each asLowercase]"

	| result |
	result := OrderedCollection new.
	self
		matchesOnStream: aString readStream
		do: [:match | result add: (aBlock value: match)].
	^result!

----- Method: RxMatcher>>matchesIn:do: (in category 'match enumeration') -----
matchesIn: aString do: aBlock
	"Search aString repeatedly for the matches of the receiver.
	Evaluate aBlock for each match passing the matched substring
	as the argument."

	self
		matchesOnStream: aString readStream
		do: aBlock!

----- Method: RxMatcher>>matchesOnStream: (in category 'match enumeration') -----
matchesOnStream: aStream

	| result |
	result := OrderedCollection new.
	self
		matchesOnStream: aStream
		do: [:match | result add: match].
	^result!

----- Method: RxMatcher>>matchesOnStream:collect: (in category 'match enumeration') -----
matchesOnStream: aStream collect: aBlock

	| result |
	result := OrderedCollection new.
	self
		matchesOnStream: aStream
		do: [:match | result add: (aBlock value: match)].
	^result!

----- Method: RxMatcher>>matchesOnStream:do: (in category 'match enumeration') -----
matchesOnStream: aStream do: aBlock
	"Be extra careful about successful matches which consume no input.
	After those, make sure to advance or finish if already at end."

	| position subexpression |
	[
		position := aStream position.
		self searchStream: aStream
	] whileTrue: [
		subexpression := self subexpression: 1.
		aBlock value: subexpression.
		subexpression size = 0 ifTrue: [
			aStream atEnd
				ifTrue: [^self]
				ifFalse: [aStream next]]]!

----- Method: RxMatcher>>matchesPrefix: (in category 'accessing') -----
matchesPrefix: aString
	"Match against a string. Return true if a prefix matches.
	If you want to match 
		- the full string use #matches:
		- anywhere in the string use #search:"

	^self matchesStreamPrefix: aString readStream!

----- Method: RxMatcher>>matchesStream: (in category 'accessing') -----
matchesStream: theStream
	"Match thyself against a positionable stream."

	^(self matchesStreamPrefix: theStream)
		and: [stream atEnd]!

----- Method: RxMatcher>>matchesStreamPrefix: (in category 'accessing') -----
matchesStreamPrefix: theStream
	"Match thyself against a positionable stream."

	stream := theStream.
	markerPositions := nil.
	^self tryMatch!

----- Method: RxMatcher>>matchingRangesIn: (in category 'match enumeration') -----
matchingRangesIn: aString
	"Search aString repeatedly for the matches of the receiver.  Answer an OrderedCollection of ranges of each match (index of first character to: index of last character)."

	| result |
	result := OrderedCollection new.
	self
		matchesIn: aString 
		do: [:match | result add: (self position - match size + 1 to: self position)].
	^result!

----- Method: RxMatcher>>next (in category 'streaming') -----
next
	^ stream next!

----- Method: RxMatcher>>notAtWordBoundary (in category 'testing') -----
notAtWordBoundary

	^self atWordBoundary not!

----- Method: RxMatcher>>position (in category 'streaming') -----
position

	^stream position!

----- Method: RxMatcher>>proceedSearchingStream: (in category 'private') -----
proceedSearchingStream: aStream

	| position |
	position := aStream position.
	[aStream atEnd] whileFalse:
		[self tryMatch ifTrue: [^true].
		aStream position: position; next.
		position := aStream position].
	"Try match at the very stream end too!!"
	self tryMatch ifTrue: [^true]. 
	^false!

----- Method: RxMatcher>>restoreState: (in category 'privileged') -----
restoreState: streamPosition

	stream position: streamPosition!

----- Method: RxMatcher>>search: (in category 'accessing') -----
search: aString
	"Search anywhere in the String for occurrence of something matching myself.
	If you want to match the full String see #matches:
	Answer a Boolean indicating success."

	^self searchStream: aString readStream!

----- Method: RxMatcher>>searchStream: (in category 'accessing') -----
searchStream: aStream
	"Search the stream for occurrence of something matching myself.
	After the search has occurred, stop positioned after the end of the
	matched substring. Answer a Boolean indicating success."

	| position |
	stream := aStream.
	position := aStream position.
	markerPositions := nil.
	[aStream atEnd] whileFalse:
		[self tryMatch ifTrue: [^true].
		aStream position: position; next.
		position := aStream position].
	"Try match at the very stream end too!!"
	self tryMatch ifTrue: [^true]. 
	^false!

----- Method: RxMatcher>>subBeginning: (in category 'accessing') -----
subBeginning: subIndex

	^markerPositions at: subIndex * 2 - 1!

----- Method: RxMatcher>>subEnd: (in category 'accessing') -----
subEnd: subIndex

	^markerPositions at: subIndex * 2!

----- Method: RxMatcher>>subexpression: (in category 'accessing') -----
subexpression: subIndex
	"Answer a string that matched the subexpression at the given index.
	If there are multiple matches, answer the last one.
	If there are no matches, answer nil. 
	(NB: it used to answer an empty string but I think nil makes more sense)."

	| matches |
	matches := self subexpressions: subIndex.
	^matches isEmpty ifTrue: [nil] ifFalse: [matches last]!

----- Method: RxMatcher>>subexpressionCount (in category 'accessing') -----
subexpressionCount

	^markerCount // 2!

----- Method: RxMatcher>>subexpressions: (in category 'accessing') -----
subexpressions: subIndex
	"Answer an array of all matches of the subexpression at the given index.
	The answer is always an array; it is empty if there are no matches."

	| originalPosition startPositions stopPositions reply |
	originalPosition := stream position.
	startPositions := self subBeginning: subIndex.
	stopPositions := self subEnd: subIndex.
	(startPositions isEmpty or: [stopPositions isEmpty]) ifTrue: [^Array new].
	reply := Array new: startPositions size.
	1 to: reply size do: [ :index |
		| start stop |
		start := startPositions at: index.
		stop := stopPositions at: index.
		stream position: start.
		reply at: index put: (stream next: stop - start) ].
	stream position: originalPosition.
	^reply!

----- Method: RxMatcher>>supportsSubexpressions (in category 'testing') -----
supportsSubexpressions

	^true!

----- Method: RxMatcher>>syntaxAny (in category 'double dispatch') -----
syntaxAny
	"Double dispatch from the syntax tree. 
	Create a matcher for any non-null character."

	^RxmPredicate new
		predicate: [:char | char asInteger ~= 0]!

----- Method: RxMatcher>>syntaxBeginningOfLine (in category 'double dispatch') -----
syntaxBeginningOfLine
	"Double dispatch from the syntax tree. 
	Create a matcher for beginning-of-line condition."

	^RxmSpecial new beBeginningOfLine!

----- Method: RxMatcher>>syntaxBeginningOfWord (in category 'double dispatch') -----
syntaxBeginningOfWord
	"Double dispatch from the syntax tree. 
	Create a matcher for beginning-of-word condition."

	^RxmSpecial new beBeginningOfWord!

----- Method: RxMatcher>>syntaxBranch: (in category 'double dispatch') -----
syntaxBranch: branchNode
	"Double dispatch from the syntax tree. 
	Branch node is a link in a chain of concatenated pieces.
	First build the matcher for the rest of the chain, then make 
	it for the current piece and hook the rest to it."

	| piece branch |
	piece := branchNode piece.
	branch := branchNode branch ifNil: [ ^piece dispatchTo: self ].
	"Optimization: glue a sequence of individual characters into a single string to match."
	piece isAtomic ifTrue: [
		| result next stream |
		stream := (String new: 40) writeStream.
		next := branchNode tryMergingInto: stream.
		result := stream contents.
		result size > 1 ifTrue: [
			"worth merging"
			^(RxmSubstring new substring: result ignoreCase: ignoreCase)
				pointTailTo: (next ifNotNil: [ next dispatchTo: self ]);
				yourself ] ].
	"No optimization possible or worth it, just concatenate all. "
	^(piece dispatchTo: self)
		pointTailTo: (branch dispatchTo: self);
		yourself!

----- Method: RxMatcher>>syntaxCharSet: (in category 'double dispatch') -----
syntaxCharSet: charSetNode
	"Double dispatch from the syntax tree. 
	A character set is a few characters, and we either match any of them,
	or match any that is not one of them."

	^RxmPredicate with: (charSetNode predicateIgnoringCase: ignoreCase)!

----- Method: RxMatcher>>syntaxCharacter: (in category 'double dispatch') -----
syntaxCharacter: charNode
	"Double dispatch from the syntax tree. 
	We get here when no merging characters into strings was possible."

	| wanted |
	wanted := charNode character.
	^RxmPredicate new predicate: 
		(ignoreCase
			ifTrue: [[:char | char sameAs: wanted]]
			ifFalse: [[:char | char = wanted]])!

----- Method: RxMatcher>>syntaxEndOfLine (in category 'double dispatch') -----
syntaxEndOfLine
	"Double dispatch from the syntax tree. 
	Create a matcher for end-of-line condition."

	^RxmSpecial new beEndOfLine!

----- Method: RxMatcher>>syntaxEndOfWord (in category 'double dispatch') -----
syntaxEndOfWord
	"Double dispatch from the syntax tree. 
	Create a matcher for end-of-word condition."

	^RxmSpecial new beEndOfWord!

----- Method: RxMatcher>>syntaxEpsilon (in category 'double dispatch') -----
syntaxEpsilon
	"Double dispatch from the syntax tree. Match empty string. This is unlikely
	to happen in sane expressions, so we'll live without special epsilon-nodes."

	^RxmSubstring new
		substring: String new
		ignoreCase: ignoreCase!

----- Method: RxMatcher>>syntaxLookaround: (in category 'double dispatch') -----
syntaxLookaround: lookaroundNode
	"Double dispatch from the syntax tree. 
	Special link can handle lookarounds (look ahead, positive and negative)."
	| piece |
	piece := lookaroundNode piece dispatchTo: self.
	^ RxmLookahaed with: piece!

----- Method: RxMatcher>>syntaxMessagePredicate: (in category 'double dispatch') -----
syntaxMessagePredicate: messagePredicateNode
	"Double dispatch from the syntax tree. 
	Special link can handle predicates."

	^messagePredicateNode negated
		ifTrue: [RxmPredicate new bePerformNot: messagePredicateNode selector]
		ifFalse: [RxmPredicate new bePerform: messagePredicateNode selector]!

----- Method: RxMatcher>>syntaxNonWordBoundary (in category 'double dispatch') -----
syntaxNonWordBoundary
	"Double dispatch from the syntax tree. 
	Create a matcher for the word boundary condition."

	^RxmSpecial new beNotWordBoundary!

----- Method: RxMatcher>>syntaxPiece: (in category 'double dispatch') -----
syntaxPiece: pieceNode
	"Double dispatch from the syntax tree. 
	Piece is an atom repeated a few times. Take care of a special
	case when the atom is repeated just once."

	| atom |
	atom := pieceNode atom dispatchTo: self.
	pieceNode isSingular ifTrue: [ ^atom ].
	pieceNode isStar ifTrue: [ ^self makeStar: atom ].
	pieceNode isPlus ifTrue: [ ^self makePlus: atom ].
	pieceNode isOptional ifTrue: [ ^self makeOptional: atom ].
	^self makeQuantified: atom min: pieceNode min max: pieceNode max!

----- Method: RxMatcher>>syntaxPredicate: (in category 'double dispatch') -----
syntaxPredicate: predicateNode
	"Double dispatch from the syntax tree. 
	A character set is a few characters, and we either match any of them,
	or match any that is not one of them."

	^RxmPredicate with: predicateNode predicate!

----- Method: RxMatcher>>syntaxRegex: (in category 'double dispatch') -----
syntaxRegex: regexNode
	"Double dispatch from the syntax tree. 
	Regex node is a chain of branches to be tried. Should compile this 
	into a bundle of parallel branches, between two marker nodes." 

	| startIndex endIndex endNode alternatives |
	startIndex := self allocateMarker.
	endIndex := self allocateMarker.
	endNode := RxmMarker new index: endIndex.
	alternatives := self hookBranchOf: regexNode onto: endNode.
	^(RxmMarker new index: startIndex)
		pointTailTo: alternatives;
		yourself!

----- Method: RxMatcher>>syntaxWordBoundary (in category 'double dispatch') -----
syntaxWordBoundary
	"Double dispatch from the syntax tree. 
	Create a matcher for the word boundary condition."

	^RxmSpecial new beWordBoundary!

----- Method: RxMatcher>>tryMatch (in category 'private') -----
tryMatch
	"Match thyself against the current stream."

	| oldMarkerPositions |
	oldMarkerPositions := markerPositions.
	markerPositions := Array new: markerCount.
	1 to: markerCount do: [ :i |
		| collection |
		collection := OrderedCollection new.
		collection resetTo: collection capacity + 1. "We'll add element to the beginning, so make room there."
		markerPositions at: i put: collection ].
	lastResult := startOptimizer isNil
		ifTrue: [ matcher matchAgainst: self]
		ifFalse: [ (startOptimizer canStartMatch: stream peek in: self) and: [ matcher matchAgainst: self ] ].
	"check for duplicates"
	(lastResult not or: [ oldMarkerPositions isNil or: [ oldMarkerPositions size ~= markerPositions size ] ])
		ifTrue: [ ^ lastResult ].
	oldMarkerPositions with: markerPositions do: [ :oldPos :newPos |
		oldPos size = newPos size 
			ifFalse: [ ^ lastResult ].
		oldPos with: newPos do: [ :old :new |
			old = new
				ifFalse: [ ^ lastResult ] ] ].
	"this is a duplicate"
	^ lastResult := false!

Object subclass: #RxParser
	instanceVariableNames: 'input lookahead'
	classVariableNames: 'BackslashConstants BackslashSpecials'
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxParser commentStamp: 'Tbn 11/12/2010 23:13' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
The regular expression parser. Translates a regular expression read from a stream into a parse tree. ('accessing' protocol). The tree can later be passed to a matcher initialization method.  All other classes in this category implement the tree. Refer to their comments for any details.

Instance variables:
	input		<Stream> A stream with the regular expression being parsed.
	lookahead	<Character>!

----- Method: RxParser class>>a:introduction: (in category 'DOCUMENTATION') -----
a: x introduction: xx 
" 
A regular expression is a template specifying a class of strings. A
regular expression matcher is an tool that determines whether a string
belongs to a class specified by a regular expression.  This is a
common task of a user input validation code, and the use of regular
expressions can GREATLY simplify and speed up development of such
code.  As an example, here is how to verify that a string is a valid
hexadecimal number in Smalltalk notation, using this matcher package:

	aString matchesRegex: '16r[[:xdigit:]]+'

(Coding the same ``the hard way'' is an exercise to a curious reader).

This matcher is offered to the Smalltalk community in hope it will be
useful. It is free in terms of money, and to a large extent--in terms
of rights of use. Refer to `Boring Stuff' section for legalese.

The 'What's new in this release' section describes the functionality
introduced in 1.1 release.

The `Syntax' section explains the recognized syntax of regular
expressions.

The `Usage' section explains matcher capabilities that go beyond what
String>>matchesRegex: method offers.

The `Implementation notes' sections says a few words about what is
under the hood.

Happy hacking,

--Vassili Bykov 
<vassili at objectpeople.com> <vassili at magma.ca>

August 6, 1996
April 4, 1999
"

	self error: 'comment only'!

----- Method: RxParser class>>b:whatsNewInThisRelease: (in category 'DOCUMENTATION') -----
b: x whatsNewInThisRelease: xx
"
VERSION 1.3.1 (September 2008)
1. Updated documentation of character classes, making clear the problems of locale - an area for future improvement

VERSION 1.3 (September 2008)
1. \w now matches underscore as well as alphanumerics, in line with most other regex libraries (and our documentation!!).  
2. \W rejects underscore as well as alphanumerics
3. added tests for this at end of testSuite
4. updated documentation and added note to old incorrect comments in version 1.1 below

VERSION 1.2.3 (November 2007)

1. Regexs with ^ or $ applied to copy empty strings caused infinite loops, e.g. ('' copyWithRegex: '^.*$' matchesReplacedWith: 'foo'). Applied a similar correction to that from version 1.1c, to #copyStream:to:(replacingMatchesWith:|translatingMatchesUsing:).
2. Extended RxParser testing to run each test for #copy:translatingMatchesUsing: as well as #search:.
3. Corrected #testSuite test that a dot does not match a null, which was passing by luck with Smalltalk code in a literal array.
4. Added test to end of test suite for fix 1 above.

VERSION 1.2.2 (November 2006)

There was no way to specify a backslash in a character set. Now [\\] is accepted.

VERSION 1.2.1	(August 2006)

1. Support for returning all ranges (startIndex to: stopIndex) matching a regex - #allRangesOfRegexMatches:, #matchingRangesIn:
2. Added hint to usage documentation on how to get more information about matches when enumerating
3. Syntax description of dot corrected: matches anything but NUL since 1.1a

VERSION 1.2	(May 2006)

Fixed case-insensitive search for character sets.

VERSION 1.1c	(December 2004)

Fixed the issue with #matchesOnStream:do: which caused infinite loops for matches 
that matched empty strings.

VERSION 1.1b	(November 2001)

Changes valueNowOrOnUnwindDo: to ensure:, plus incorporates some earlier fixes.

VERSION 1.1a	(May 2001)

1. Support for keeping track of multiple subexpressions.
2. Dot (.) matches anything but NUL character, as it should per POSIX spec.
3. Some bug fixes.

VERSION 1.1	(October 1999)

Regular expression syntax corrections and enhancements:

1. Backslash escapes similar to those in Perl are allowed in patterns:

	\w	any word constituent character (equivalent to [a-zA-Z0-9_]) *** underscore only since 1.3 ***
	\W	any character but a word constituent (equivalent to [^a-xA-Z0-9_] *** underscore only since 1.3 ***
	\d	a digit (same as [0-9])
	\D	anything but a digit
	\s 	a whitespace character
	\S	anything but a whitespace character
	\b	an empty string at a word boundary
	\B	an empty string not at a word boundary
	\<	an empty string at the beginning of a word
	\>	an empty string at the end of a word

For example, '\w+' is now a valid expression matching any word.

2. The following backslash escapes are also allowed in character sets
(between square brackets):

	\w, \W, \d, \D, \s, and \S.

3. The following grep(1)-compatible named character classes are
recognized in character sets as well:

	[:alnum:]
	[:alpha:]
	[:cntrl:]
	[:digit:]
	[:graph:]
	[:lower:]
	[:print:]
	[:punct:]
	[:space:]
	[:upper:]
	[:xdigit:]

For example, the following patterns are equivalent:

	'[[:alnum:]_]+' '\w+'  '[\w]+' '[a-zA-Z0-9_]+' *** underscore only since 1.3 ***

4. Some non-printable characters can be represented in regular
expressions using a common backslash notation:

	\t	tab (Character tab)
	\n	newline (Character lf)
	\r	carriage return (Character cr)
	\f	form feed (Character newPage)
	\e	escape (Character esc)

5. A dot is corectly interpreted as 'any character but a newline'
instead of 'anything but whitespace'.

6. Case-insensitive matching.  The easiest access to it are new
messages CharacterArray understands: #asRegexIgnoringCase,
#matchesRegexIgnoringCase:, #prefixMatchesRegexIgnoringCase:.

7. The matcher (an instance of RxMatcher, the result of
String>>asRegex) now provides a collection-like interface to matches
in a particular string or on a particular stream, as well as
substitution protocol. The interface includes the following messages:

	matchesIn: aString
	matchesIn: aString collect: aBlock
	matchesIn: aString do: aBlock

	matchesOnStream: aStream
	matchesOnStream: aStream collect: aBlock
	matchesOnStream: aStream do: aBlock

	copy: aString translatingMatchesUsing: aBlock
	copy: aString replacingMatchesWith: replacementString

	copyStream: aStream to: writeStream translatingMatchesUsing: aBlock
	copyStream: aStream to: writeStream replacingMatchesWith: aString

Examples:

	'\w+' asRegex matchesIn: 'now is the time'

returns an OrderedCollection containing four strings: 'now', 'is',
'the', and 'time'.

	'\<t\w+' asRegexIgnoringCase
		copy: 'now is the Time'
		translatingMatchesUsing: [:match | match asUppercase]

returns 'now is THE TIME' (the regular expression matches words
beginning with either an uppercase or a lowercase T).

ACKNOWLEDGEMENTS

Since the first release of the matcher, thanks to the input from
several fellow Smalltalkers, I became convinced a native Smalltalk
regular expression matcher was worth the effort to keep it alive. For
the contributions, suggestions, and bug reports that made this release 
possible, I want to thank:

	Felix Hack
	Peter Hatch
	Alan Knight
	Eliot Miranda
	Thomas Muhr
	Robb Shecter
	David N. Smith
	Francis Wolinski

and anyone whom I haven't yet met or heard from, but who agrees this
has not been a complete waste of time.

--Vassili Bykov
October 3, 1999
"

	self error: 'comment only'!

----- Method: RxParser class>>c:syntax: (in category 'DOCUMENTATION') -----
c: x syntax: xx
" 

[You can select and `print it' examples in this method. Just don't
forget to cancel the changes.]

The simplest regular expression is a single character.  It matches
exactly that character. A sequence of characters matches a string with
exactly the same sequence of characters:

	'a' matchesRegex: 'a'				-- true
	'foobar' matchesRegex: 'foobar'		-- true
	'blorple' matchesRegex: 'foobar'		-- false

The above paragraph introduced a primitive regular expression (a
character), and an operator (sequencing). Operators are applied to
regular expressions to produce more complex regular expressions.
Sequencing (placing expressions one after another) as an operator is,
in a certain sense, `invisible'--yet it is arguably the most common.

A more `visible' operator is Kleene closure, more often simply
referred to as `a star'.  A regular expression followed by an asterisk
matches any number (including 0) of matches of the original
expression. For example:

	'ab' matchesRegex: 'a*b'		 		-- true
	'aaaaab' matchesRegex: 'a*b'	 	-- true
	'b' matchesRegex: 'a*b'		 		-- true
	'aac' matchesRegex: 'a*b'	 		-- false: b does not match

A star's precedence is higher than that of sequencing. A star applies
to the shortest possible subexpression that precedes it. For example,
'ab*' means `a followed by zero or more occurrences of b', not `zero
or more occurrences of ab':

	'abbb' matchesRegex: 'ab*'	 		-- true
	'abab' matchesRegex: 'ab*'		 	-- false

To actually make a regex matching `zero or more occurrences of ab',
`ab' is enclosed in parentheses:

	'abab' matchesRegex: '(ab)*'		 	-- true
	'abcab' matchesRegex: '(ab)*'	 	-- false: c spoils the fun

Two other operators similar to `*' are `+' and `?'. `+' (positive
closure, or simply `plus') matches one or more occurrences of the
original expression. `?' (`optional') matches zero or one, but never
more, occurrences.

	'ac' matchesRegex: 'ab*c'	 		-- true
	'ac' matchesRegex: 'ab+c'	 		-- false: need at least one b
	'abbc' matchesRegex: 'ab+c'		 	-- true
	'abbc' matchesRegex: 'ab?c'		 	-- false: too many b's

As we have seen, characters `*', `+', `?', `(', and `)' have special
meaning in regular expressions. If one of them is to be used
literally, it should be quoted: preceded with a backslash. (Thus,
backslash is also special character, and needs to be quoted for a
literal match--as well as any other special character described
further).

	'ab*' matchesRegex: 'ab*'		 	-- false: star in the right string is special
	'ab*' matchesRegex: 'ab\*'	 		-- true
	'a\c' matchesRegex: 'a\\c'		 	-- true

The last operator is `|' meaning `or'. It is placed between two
regular expressions, and the resulting expression matches if one of
the expressions matches. It has the lowest possible precedence (lower
than sequencing). For example, `ab*|ba*' means `a followed by any
number of b's, or b followed by any number of a's':

	'abb' matchesRegex: 'ab*|ba*'	 	-- true
	'baa' matchesRegex: 'ab*|ba*'	 	-- true
	'baab' matchesRegex: 'ab*|ba*'	 	-- false

A bit more complex example is the following expression, matching the
name of any of the Lisp-style `car', `cdr', `caar', `cadr',
... functions:

	c(a|d)+r

It is possible to write an expression matching an empty string, for
example: `a|'.  However, it is an error to apply `*', `+', or `?' to
such expression: `(a|)*' is an invalid expression.

So far, we have used only characters as the 'smallest' components of
regular expressions. There are other, more `interesting', components.

A character set is a string of characters enclosed in square
brackets. It matches any single character if it appears between the
brackets. For example, `[01]' matches either `0' or `1':

	'0' matchesRegex: '[01]'		 		-- true
	'3' matchesRegex: '[01]'		 		-- false
	'11' matchesRegex: '[01]'		 		-- false: a set matches only one character

Using plus operator, we can build the following binary number
recognizer:

	'10010100' matchesRegex: '[01]+'	 	-- true
	'10001210' matchesRegex: '[01]+'	 	-- false

If the first character after the opening bracket is `^', the set is
inverted: it matches any single character *not* appearing between the
brackets:

	'0' matchesRegex: '[^01]'		  		-- false
	'3' matchesRegex: '[^01]'		 		-- true

For convenience, a set may include ranges: pairs of characters
separated with `-'. This is equivalent to listing all characters
between them: `[0-9]' is the same as `[0123456789]'.

Special characters within a set are `^', `-', and `]' that closes the
set. Below are the examples of how to literally use them in a set:

	[01^]		-- put the caret anywhere except the beginning
	[01-]		-- put the dash as the last character
	[]01]		-- put the closing bracket as the first character 
	[^]01]			(thus, empty and universal sets cannot be specified)

Regular expressions can also include the following backquote escapes
to refer to popular classes of characters:

	\w	any word constituent character (same as [a-zA-Z0-9_])
	\W	any character but a word constituent
	\d	a digit (same as [0-9])
	\D	anything but a digit
	\s 	a whitespace character (same as [:space:] below)
	\S	anything but a whitespace character

These escapes are also allowed in character classes: '[\w+-]' means
'any character that is either a word constituent, or a plus, or a
minus'.

Character classes can also include the following grep(1)-compatible
elements to refer to:

	[:alnum:]		any alphanumeric character (same as [a-zA-Z0-9])
	[:alpha:]		any alphabetic character (same as [a-zA-Z])
	[:cntrl:]		any control character. (any character with code < 32)
	[:digit:]		any decimal digit (same as [0-9])
	[:graph:]		any graphical character. (any character with code >= 32).
	[:lower:]		any lowercase character (including non-ASCII lowercase characters)
	[:print:]		any printable character. In this version, this is the same as [:graph:]
	[:punct:]		any punctuation character:  . , !! ? ; : ' - ( ) ` and double quotes
	[:space:]		any whitespace character (space, tab, CR, LF, null, form feed, Ctrl-Z, 16r2000-16r200B, 16r3000)
	[:upper:]		any uppercase character (including non-ASCII uppercase characters)
	[:xdigit:]		any hexadecimal character (same as [a-fA-F0-9]).

Note that many of these are only as consistent or inconsistent on issues
of locale as the underlying Smalltalk implementation. Values shown here
are for VisualWorks 7.6.

Note that these elements are components of the character classes,
i.e. they have to be enclosed in an extra set of square brackets to
form a valid regular expression.  For example, a non-empty string of
digits would be represented as '[[:digit:]]+'.

The above primitive expressions and operators are common to many
implementations of regular expressions. The next primitive expression
is unique to this Smalltalk implementation.

A sequence of characters between colons is treated as a unary selector
which is supposed to be understood by Characters. A character matches
such an expression if it answers true to a message with that
selector. This allows a more readable and efficient way of specifying
character classes. For example, `[0-9]' is equivalent to `:isDigit:',
but the latter is more efficient. Analogously to character sets,
character classes can be negated: `:^isDigit:' matches a Character
that answers false to #isDigit, and is therefore equivalent to
`[^0-9]'.

As an example, so far we have seen the following equivalent ways to
write a regular expression that matches a non-empty string of digits:

	'[0-9]+'
	'\d+'
	'[\d]+'
	'[[:digit:]]+'
	:isDigit:+'

The last group of special primitive expressions includes: 

	.	matching any character except a NULL; 
	^	matching an empty string at the beginning of a line; 
	$	matching an empty string at the end of a line.
	\b	an empty string at a word boundary
	\B	an empty string not at a word boundary
	\<	an empty string at the beginning of a word
	\>	an empty string at the end of a word

	'axyzb' matchesRegex: 'a.+b'		-- true
	'ax zb' matchesRegex: 'a.+b'			-- true (space is matched by `.')
	'ax
zb' matchesRegex: 'a.+b'				-- true (carriage return is matched by `.')

Again, the dot ., caret ^ and dollar $ characters are special and should be quoted
to be matched literally.

	EXAMPLES

As the introductions said, a great use for regular expressions is user
input validation. Following are a few examples of regular expressions
that might be handy in checking input entered by the user in an input
field. Try them out by entering something between the quotes and
print-iting. (Also, try to imagine Smalltalk code that each validation
would require if coded by hand).  Most example expressions could have
been written in alternative ways.

Checking if aString may represent a nonnegative integer number:

	'' matchesRegex: ':isDigit:+'
or
	'' matchesRegex: '[0-9]+'
or
	'' matchesRegex: '\d+'

Checking if aString may represent an integer number with an optional
sign in front:

	'' matchesRegex: '(\+|-)?\d+'

Checking if aString is a fixed-point number, with at least one digit
is required after a dot:

	'' matchesRegex: '(\+|-)?\d+(\.\d+)?'

The same, but allow notation like `123.':

	'' matchesRegex: '(\+|-)?\d+(\.\d*)?'

Recognizer for a string that might be a name: one word with first
capital letter, no blanks, no digits.  More traditional:

	'' matchesRegex: '[A-Z][A-Za-z]*'

more Smalltalkish:

	'' matchesRegex: ':isUppercase::isAlphabetic:*'

A date in format MMM DD, YYYY with any number of spaces in between, in
XX century:

	'' matchesRegex: '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+(\d\d?)[ ]*,[ ]*19(\d\d)'

Note parentheses around some components of the expression above. As
`Usage' section shows, they will allow us to obtain the actual strings
that have matched them (i.e. month name, day number, and year number).

For dessert, coming back to numbers: here is a recognizer for a
general number format: anything like 999, or 999.999, or -999.999e+21.

	'' matchesRegex: '(\+|-)?\d+(\.\d*)?((e|E)(\+|-)?\d+)?'

"

	self error: 'comment only'!

----- Method: RxParser class>>d:usage: (in category 'DOCUMENTATION') -----
d: x usage: xx
" 
The preceding section covered the syntax of regular expressions. It
used the simplest possible interface to the matcher: sending
#matchesRegex: message to the sample string, with regular expression
string as the argument.  This section explains hairier ways of using
the matcher.

	PREFIX MATCHING AND CASE-INSENSITIVE MATCHING

A CharacterArray (an EsString in VA) also understands these messages:

	#prefixMatchesRegex: regexString
	#matchesRegexIgnoringCase: regexString
	#prefixMatchesRegexIgnoringCase: regexString

#prefixMatchesRegex: is just like #matchesRegex, except that the whole
receiver is not expected to match the regular expression passed as the
argument; matching just a prefix of it is enough.  For example:

	'abcde' matchesRegex: '(a|b)+'		-- false
	'abcde' prefixMatchesRegex: '(a|b)+'	-- true

The last two messages are case-insensitive versions of matching.

	ENUMERATION INTERFACE

An application can be interested in all matches of a certain regular
expression within a String.  The matches are accessible using a
protocol modelled after the familiar Collection-like enumeration
protocol:

	#regex: regexString matchesDo: aBlock

Evaluates a one-argument <aBlock> for every match of the regular
expression within the receiver string.

	#regex: regexString matchesCollect: aBlock

Evaluates a one-argument <aBlock> for every match of the regular
expression within the receiver string. Collects results of evaluations
and anwers them as a SequenceableCollection.

	#allRegexMatches: regexString

Returns a collection of all matches (substrings of the receiver
string) of the regular expression.  It is an equivalent of <aString
regex: regexString matchesCollect: [:each | each]>.

	#allRangesOfRegexMatches: regexString

Returns a collection of all character ranges (startIndex to: stopIndex)
that match the regular expression.

	REPLACEMENT AND TRANSLATION

It is possible to replace all matches of a regular expression with a
certain string using the message:

	#copyWithRegex: regexString matchesReplacedWith: aString

For example:

	'ab cd ab' copyWithRegex: '(a|b)+' matchesReplacedWith: 'foo'

A more general substitution is match translation:

	#copyWithRegex: regexString matchesTranslatedUsing: aBlock

This message evaluates a block passing it each match of the regular
expression in the receiver string and answers a copy of the receiver
with the block results spliced into it in place of the respective
matches.  For example:

	'ab cd ab' copyWithRegex: '(a|b)+' matchesTranslatedUsing: [:each | each asUppercase]

All messages of enumeration and replacement protocols perform a
case-sensitive match.  Case-insensitive versions are not provided as
part of a CharacterArray protocol.  Instead, they are accessible using
the lower-level matching interface.

	LOWER-LEVEL INTERFACE

Internally, #matchesRegex: works as follows:

1. A fresh instance of RxParser is created, and the regular expression
string is passed to it, yielding the expression's syntax tree.

2. The syntax tree is passed as an initialization parameter to an
instance of RxMatcher. The instance sets up some data structure that
will work as a recognizer for the regular expression described by the
tree.

3. The original string is passed to the matcher, and the matcher
checks for a match.

	THE MATCHER

If you repeatedly match a number of strings against the same regular
expression using one of the messages defined in CharacterArray, the
regular expression string is parsed and a matcher is created anew for
every match.  You can avoid this overhead by building a matcher for
the regular expression, and then reusing the matcher over and over
again. You can, for example, create a matcher at a class or instance
initialization stage, and store it in a variable for future use.

You can create a matcher using one of the following methods:

	- Sending #forString:ignoreCase: message to RxMatcher class, with
the regular expression string and a Boolean indicating whether case is
ignored as arguments.

	- Sending #forString: message.  It is equivalent to <... forString:
regexString ignoreCase: false>.

A more convenient way is using one of the two matcher-created messages
understood by CharacterArray.

	- <regexString asRegex> is equivalent to <RxMatcher forString:
regexString>.

	- <regexString asRegexIgnoringCase> is equivalent to <RxMatcher
forString: regexString ignoreCase: true>.

Here are four examples of creating a matcher:

	hexRecognizer := RxMatcher forString: '16r[0-9A-Fa-f]+'
	hexRecognizer := RxMatcher forString: '16r[0-9A-Fa-f]+' ignoreCase: false
	hexRecognizer := '16r[0-9A-Fa-f]+' asRegex
	hexRecognizer := '16r[0-9A-F]+' asRegexIgnoringCase

	MATCHING

The matcher understands these messages (all of them return true to
indicate successful match or search, and false otherwise):

matches: aString

	True if the whole target string (aString) matches.

matchesPrefix: aString

	True if some prefix of the string (not necessarily the whole
	string) matches.

search: aString

	Search the string for the first occurrence of a matching
	substring. (Note that the first two methods only try matching from
	the very beginning of the string). Using the above example with a
	matcher for `a+', this method would answer success given a string
	`baaa', while the previous two would fail.

matchesStream: aStream
matchesStreamPrefix: aStream
searchStream: aStream

	Respective analogs of the first three methods, taking input from a
	stream instead of a string. The stream must be positionable and
	peekable.

All these methods answer a boolean indicating success. The matcher
also stores the outcome of the last match attempt and can report it:

lastResult

	Answers a Boolean -- the outcome of the most recent match
	attempt. If no matches were attempted, the answer is unspecified.

	SUBEXPRESSION MATCHES

After a successful match attempt, you can query the specifics of which
part of the original string has matched which part of the whole
expression.

A subexpression is a parenthesized part of a regular expression, or
the whole expression. When a regular expression is compiled, its
subexpressions are assigned indices starting from 1, depth-first,
left-to-right. For example, `((ab)+(c|d))?ef' includes the following
subexpressions with these indices:

	1:	((ab)+(c|d))?ef
	2:	(ab)+(c|d)
	3:	ab
	4:	c|d

After a successful match, the matcher can report what part of the
original string matched what subexpression. It understandards these
messages:

subexpressionCount

	Answers the total number of subexpressions: the highest value that
	can be used as a subexpression index with this matcher. This value
	is available immediately after initialization and never changes.

subexpression: anIndex

	An index must be a valid subexpression index, and this message
	must be sent only after a successful match attempt. The method
	answers a substring of the original string the corresponding
	subexpression has matched to.

subBeginning: anIndex
subEnd: anIndex

	Answer positions within the original string or stream where the
	match of a subexpression with the given index has started and
	ended, respectively.

This facility provides a convenient way of extracting parts of input
strings of complex format. For example, the following piece of code
uses the 'MMM DD, YYYY' date format recognizer example from the
`Syntax' section to convert a date to a three-element array with year,
month, and day strings (you can select and evaluate it right here):

	| matcher |
	matcher := RxMatcher forString: '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+(:isDigit::isDigit:?)[ ]*,[ ]*(19|20)(:isDigit::isDigit:)'.
	(matcher matches: 'Aug 6, 1996')
		ifTrue: 
			[Array 
				with: (matcher subexpression: 5)
				with: (matcher subexpression: 2)
				with: (matcher subexpression: 3)]
		ifFalse: ['no match']

(should answer ` #('96' 'Aug' '6')').

	ENUMERATION AND REPLACEMENT

The enumeration and replacement protocols exposed in CharacterArray
are actually implemented by the matcher.  The following messages are
understood:

	#matchesIn: aString
	#matchesIn: aString do: aBlock
	#matchesIn: aString collect: aBlock
	#copy: aString replacingMatchesWith: replacementString
	#copy: aString translatingMatchesUsing: aBlock
	#matchingRangesIn: aString

	#matchesOnStream: aStream
	#matchesOnStream: aStream do: aBlock
	#matchesOnStream: aStream collect: aBlock
	#copy: sourceStream to: targetStream replacingMatchesWith: replacementString
	#copy: sourceStream to: targetStream translatingMatchesWith: aBlock

Note that in those methods that take a block, the block may refer to the rxMatcher itself, 
e.g. to collect information about the position the match occurred at, or the
subexpressions of the match. An example can be seen in #matchingRangesIn:

	ERROR HANDLING

Exception signaling objects (Signals in VisualWorks, Exceptions in VisualAge) are
accessible through RxParser class protocol. To handle possible errors, use
the protocol described below to obtain the exception objects and use the
protocol of the native Smalltalk implementation to handle them.

If a syntax error is detected while parsing expression,
RxParser>>syntaxErrorSignal is raised/signaled.

If an error is detected while building a matcher,
RxParser>>compilationErrorSignal is raised/signaled.

If an error is detected while matching (for example, if a bad selector
was specified using `:<selector>:' syntax, or because of the matcher's
internal error), RxParser>>matchErrorSignal is raised

RxParser>>regexErrorSignal is the parent of all three.  Since any of
the three signals can be raised within a call to #matchesRegex:, it is
handy if you want to catch them all.  For example:

VisualWorks:

	RxParser regexErrorSignal
		handle: [:ex | ex returnWith: nil]
		do: ['abc' matchesRegex: '))garbage[']

VisualAge:

	['abc' matchesRegex: '))garbage[']
		when: RxParser regexErrorSignal
		do: [:signal | signal exitWith: nil]

"

	self error: 'comment only'!

----- Method: RxParser class>>doHandlingMessageNotUnderstood: (in category 'exception signaling') -----
doHandlingMessageNotUnderstood: aBlock
	"MNU should be trapped and resignaled as a match error in a few places in the matcher.
	This method factors out this dialect-dependent code to make porting easier."
	^ aBlock
		on: MessageNotUnderstood
		do: [:ex | RxParser signalMatchException: 'invalid predicate selector']!

----- Method: RxParser class>>e:implementationNotes: (in category 'DOCUMENTATION') -----
e: x implementationNotes: xx
"	
	Version:		1.1
	Released:		October 1999
	Mail to:		Vassili Bykov <vassili at parcplace.com>, <v_bykov at yahoo.com>
	Flames to:		/dev/null

	WHAT IS ADDED

The matcher includes classes in two categories:
	VB-Regex-Syntax
	VB-Regex-Matcher
and a few CharacterArray methods in `VB-regex' protocol.  No system
classes or methods are modified.

	WHAT TO LOOK AT FIRST

String>>matchesRegex: -- in 90% cases this method is all you need to
access the package.

RxParser -- accepts a string or a stream of characters with a regular
expression, and produces a syntax tree corresponding to the
expression. The tree is made of instances of Rxs<whatever> classes.

RxMatcher -- accepts a syntax tree of a regular expression built by
the parser and compiles it into a matcher: a structure made of
instances of Rxm<whatever> classes. The RxMatcher instance can test
whether a string or a positionable stream of characters matches the
original regular expression, or search a string or a stream for
substrings matching the expression. After a match is found, the
matcher can report a specific string that matched the whole
expression, or any parenthesized subexpression of it.

All other classes support the above functionality and are used by
RxParser, RxMatcher, or both.

	CAVEATS

The matcher is similar in spirit, but NOT in the design--let alone the
code--to the original Henry Spencer's regular expression
implementation in C.  The focus is on simplicity, not on efficiency.
I didn't optimize or profile anything.  I may in future--or I may not:
I do this in my spare time and I don't promise anything.

The matcher passes H. Spencer's test suite (see 'test suite'
protocol), with quite a few extra tests added, so chances are good
there are not too many bugs.  But watch out anyway.

	EXTENSIONS, FUTURE, ETC.

With the existing separation between the parser, the syntax tree, and
the matcher, it is easy to extend the system with other matchers based
on other algorithms. In fact, I have a DFA-based matcher right now,
but I don't feel it is good enough to include it here.  I might add
automata-based matchers later, but again I don't promise anything.

	HOW TO REACH ME

As of today (December 20, 2000), you can contact me at
<vassili at parcplace.com>. If this doesn't work, look around
comp.lang.smalltalk or comp.lang.lisp.  
"

	self error: 'comment only'!

----- Method: RxParser class>>f:boringStuff: (in category 'DOCUMENTATION') -----
f: x boringStuff: xx
"
The Regular Expression Matcher (``The Software'') 
is Copyright (C) 1996, 1999 Vassili Bykov.  
It is provided to the Smalltalk community in hope it will be useful.

1. This license applies to the package as a whole, as well as to any
   component of it. By performing any of the activities described
   below, you accept the terms of this agreement.

2. The software is provided free of charge, and ``as is'', in hope
   that it will be useful, with ABSOLUTELY NO WARRANTY. The entire
   risk and all responsibility for the use of the software is with
   you.  Under no circumstances the author may be held responsible for
   loss of data, loss of profit, or any other damage resulting
   directly or indirectly from the use of the software, even if the
   damage is caused by defects in the software.

3. You may use this software in any applications you build.

4. You may distribute this software provided that the software
   documentation and copyright notices are included and intact.

5. You may create and distribute modified versions of the software,
   such as ports to other Smalltalk dialects or derived work, provided
   that: 

   a. any modified version is expressly marked as such and is not
   misrepresented as the original software; 

   b. credit is given to the original software in the source code and
   documentation of the derived work; 

   c. the copyright notice at the top of this document accompanies
   copyright notices of any modified version.  "

	self error: 'comment only'!

----- Method: RxParser class>>initialize (in category 'class initialization') -----
initialize
	"self initialize"
	self
		initializeBackslashConstants;
		initializeBackslashSpecials!

----- Method: RxParser class>>initializeBackslashConstants (in category 'class initialization') -----
initializeBackslashConstants
	"self initializeBackslashConstants"

	(BackslashConstants := Dictionary new)
		at: $e put: Character escape;
		at: $n put: Character lf;
		at: $r put: Character cr;
		at: $f put: Character newPage;
		at: $t put: Character tab!

----- Method: RxParser class>>initializeBackslashSpecials (in category 'class initialization') -----
initializeBackslashSpecials
	"Keys are characters that normally follow a \, the values are
	associations of classes and initialization selectors on the instance side
	of the classes."
	"self initializeBackslashSpecials"

	(BackslashSpecials := Dictionary new)
		at: $w put: (Association key: RxsPredicate value: #beWordConstituent);
		at: $W put: (Association key: RxsPredicate value: #beNotWordConstituent);
		at: $s put: (Association key: RxsPredicate value: #beSpace);
		at: $S put: (Association key: RxsPredicate value: #beNotSpace);
		at: $d put: (Association key: RxsPredicate value: #beDigit);
		at: $D put: (Association key: RxsPredicate value: #beNotDigit);
		at: $b put: (Association key: RxsContextCondition value: #beWordBoundary);
		at: $B put: (Association key: RxsContextCondition value: #beNonWordBoundary);
		at: $< put: (Association key: RxsContextCondition value: #beBeginningOfWord);
		at: $> put: (Association key: RxsContextCondition value: #beEndOfWord)!

----- Method: RxParser class>>parse: (in category 'utilities') -----
parse: aString
	"Parse the argument and return the result (the parse tree).
	In case of a syntax error, the corresponding exception is signaled."

	^self new parse: aString!

----- Method: RxParser class>>preferredMatcherClass (in category 'preferences') -----
preferredMatcherClass
	"The matcher to use. For now just one is available, but in
	principle this determines the matchers built implicitly,
	such as by String>>asRegex, or String>>matchesRegex:.
	This might seem a bit strange place for this preference, but
	Parser is still more or less `central' thing in the whole package."

	^RxMatcher!

----- Method: RxParser class>>safelyParse: (in category 'utilities') -----
safelyParse: aString
	"Parse the argument and return the result (the parse tree).
	In case of a syntax error, return nil.
	Exception handling here is dialect-dependent."
	^ [self new parse: aString] on: RegexSyntaxError do: [:ex | nil]!

----- Method: RxParser class>>signalCompilationException: (in category 'exception signaling') -----
signalCompilationException: errorString
	RegexCompilationError new signal: errorString!

----- Method: RxParser class>>signalMatchException: (in category 'exception signaling') -----
signalMatchException: errorString
	RegexMatchingError new signal: errorString!

----- Method: RxParser class>>signalSyntaxException: (in category 'exception signaling') -----
signalSyntaxException: errorString
	RegexSyntaxError new signal: errorString!

----- Method: RxParser class>>signalSyntaxException:at: (in category 'exception signaling') -----
signalSyntaxException: errorString at: errorPosition
	RegexSyntaxError signal: errorString at: errorPosition!

----- Method: RxParser>>atom (in category 'recursive descent') -----
atom
	"An atom is one of a lot of possibilities, see below."

	| atom |
	(lookahead == #epsilon 
	or: [ lookahead == $| 
	or: [ lookahead == $)
	or: [ lookahead == $*
	or: [ lookahead == $+ 
	or: [ lookahead == $? ]]]]])
		ifTrue: [ ^RxsEpsilon new ].

	lookahead == $( 
		ifTrue: [
			"<atom> ::= '(' <regex> ')' "
			self match: $(.
			atom := self regex.
			self match: $).
			^atom ].

	lookahead == $[
		ifTrue: [
			"<atom> ::= '[' <characterSet> ']' "
			self match: $[.
			atom := self characterSet.
			self match: $].
			^atom ].

	lookahead == $: 
		ifTrue: [
			"<atom> ::= ':' <messagePredicate> ':' "
			self match: $:.
			atom := self messagePredicate.
			self match: $:.
			^atom ].

	lookahead == $. 
		ifTrue: [
			"any non-whitespace character"
			self next.
			^RxsContextCondition new beAny].

	lookahead == $^ 
		ifTrue: [
			"beginning of line condition"
			self next.
			^RxsContextCondition new beBeginningOfLine].

	lookahead == $$ 
		ifTrue: [
			"end of line condition"
			self next.
			^RxsContextCondition new beEndOfLine].

	lookahead == $\ 
		ifTrue: [
			"<atom> ::= '\' <character>"
			self next.
			lookahead == #epsilon 
				ifTrue: [ self signalParseError: 'bad quotation' ].
			(BackslashConstants includesKey: lookahead)
				ifTrue: [
					atom := RxsCharacter with: (BackslashConstants at: lookahead).
					self next.
					^atom].
			self ifSpecial: lookahead
				then: [:node | self next. ^node]].

	"If passed through the above, the following is a regular character."
	atom := RxsCharacter with: lookahead.
	self next.
	^atom!

----- Method: RxParser>>branch (in category 'recursive descent') -----
branch
	"<branch> ::= e | <piece> <branch>"

	| piece branch |
	piece := self piece.
	(lookahead == #epsilon 
	or: [ lookahead == $| 
	or: [ lookahead == $) ]])
		ifTrue: [ branch := nil ]
		ifFalse: [ branch := self branch ].
	^RxsBranch new 
		initializePiece: piece 
		branch: branch!

----- Method: RxParser>>characterSet (in category 'recursive descent') -----
characterSet
	"Match a range of characters: something between `[' and `]'.
	Opening bracked has already been seen, and closing should
	not be consumed as well. Set spec is as usual for
	sets in regexes."

	| spec errorMessage |
	errorMessage := ' no terminating "]"'.
	spec := self inputUpTo: $] nestedOn: $[ errorMessage: errorMessage.
	(spec isEmpty 
	or: [spec = '^']) 
		ifTrue: [
			"This ']' was literal." 
			self next.
			spec := spec, ']', (self inputUpTo: $] nestedOn: $[ errorMessage: errorMessage)].
	^self characterSetFrom: spec!

----- Method: RxParser>>characterSetFrom: (in category 'private') -----
characterSetFrom: setSpec
	"<setSpec> is what goes between the brackets in a charset regex
	(a String). Make a string containing all characters the spec specifies.
	Spec is never empty."

	| negated spec |
	spec := ReadStream on: setSpec.
	spec peek = $^
		ifTrue: 	[negated := true.
				spec next]
		ifFalse:	[negated := false].
	^RxsCharSet new
		initializeElements: (RxCharSetParser on: spec) parse
		negated: negated!

----- Method: RxParser>>ifSpecial:then: (in category 'private') -----
ifSpecial: aCharacter then: aBlock
	"If the character is such that it defines a special node when follows a $\,
	then create that node and evaluate aBlock with the node as the parameter.
	Otherwise just return."

	| classAndSelector |
	classAndSelector := BackslashSpecials at: aCharacter ifAbsent: [^self].
	^aBlock value: (classAndSelector key new perform: classAndSelector value)!

----- Method: RxParser>>inputUpTo:errorMessage: (in category 'private') -----
inputUpTo: aCharacter errorMessage: aString
	"Accumulate input stream until <aCharacter> is encountered
	and answer the accumulated chars as String, not including
	<aCharacter>. Signal error if end of stream is encountered,
	passing <aString> as the error description."

	| accumulator |
	accumulator := WriteStream on: (String new: 20).
	[ lookahead == aCharacter or: [lookahead == #epsilon] ]
		whileFalse: [
			accumulator nextPut: lookahead.
			self next].
	lookahead == #epsilon
		ifTrue: [ self signalParseError: aString ].
	^accumulator contents!

----- Method: RxParser>>inputUpTo:nestedOn:errorMessage: (in category 'private') -----
inputUpTo: aCharacter nestedOn: anotherCharacter errorMessage: aString 
	"Accumulate input stream until <aCharacter> is encountered
	and answer the accumulated chars as String, not including
	<aCharacter>. Signal error if end of stream is encountered,
	passing <aString> as the error description."

	| accumulator nestLevel |
	accumulator := WriteStream on: (String new: 20).
	nestLevel := 0.
	[lookahead == aCharacter and: [nestLevel = 0]] whileFalse: 
			[#epsilon == lookahead ifTrue: [self signalParseError: aString].
			accumulator nextPut: lookahead.
			lookahead == anotherCharacter ifTrue: [nestLevel := nestLevel + 1].
			lookahead == aCharacter ifTrue: [nestLevel := nestLevel - 1].
			self next].
	^accumulator contents!

----- Method: RxParser>>inputUpToAny:errorMessage: (in category 'private') -----
inputUpToAny: aDelimiterString errorMessage: aString
	"Accumulate input stream until any character from <aDelimiterString> is encountered
	and answer the accumulated chars as String, not including the matched characters from the
	<aDelimiterString>. Signal error if end of stream is encountered,
	passing <aString> as the error description."

	| accumulator |
	accumulator := WriteStream on: (String new: 20).
	[ lookahead == #epsilon or: [ aDelimiterString includes: lookahead ] ]
		whileFalse: [
			accumulator nextPut: lookahead.
			self next ].
	lookahead == #epsilon
		ifTrue: [ self signalParseError: aString ].
	^accumulator contents!

----- Method: RxParser>>lookAround (in category 'recursive descent') -----
lookAround
	"Parse a lookaround expression after: (?<lookround>) 
	<lookround> ::= !!<regex> | =<regex>"
	| lookaround |
	(lookahead == $!!
	or: [ lookahead == $=])
		ifFalse: [ ^ self signalParseError: 'Invalid lookaround expression ?', lookahead asString ].
	self next.
	lookaround := RxsLookaround with: self regex.
	lookahead == $!!
		ifTrue: [ lookaround beNegative ].
	^ lookaround
	!

----- Method: RxParser>>match: (in category 'private') -----
match: aCharacter
	"<aCharacter> MUST match the current lookeahead.
	If this is the case, advance the input. Otherwise, blow up."

	aCharacter == lookahead ifFalse: [ ^self signalParseError ]. "does not return"
	self next!

----- Method: RxParser>>messagePredicate (in category 'recursive descent') -----
messagePredicate
	"Match a message predicate specification: a selector (presumably
	understood by a Character) enclosed in :'s ."

	| spec negated |
	spec := self inputUpTo: $: errorMessage: ' no terminating ":"'.
	negated := false.
	spec first = $^ 
		ifTrue: [
			negated := true.
			spec := spec copyFrom: 2 to: spec size].
	^RxsMessagePredicate new 
		initializeSelector: spec asSymbol
		negated: negated!

----- Method: RxParser>>next (in category 'private') -----
next
	"Advance the input storing the just read character
	as the lookahead."

	lookahead := input next ifNil: [ #epsilon ]!

----- Method: RxParser>>parse: (in category 'accessing') -----
parse: aString
	"Parse input from a string <aString>.
	On success, answers an RxsRegex -- parse tree root.
	On error, raises `RxParser syntaxErrorSignal' with the current
	input stream position as the parameter."

	^self parseStream: (ReadStream on: aString)!

----- Method: RxParser>>parseStream: (in category 'accessing') -----
parseStream: aStream
	"Parse an input from a character stream <aStream>.
	On success, answers an RxsRegex -- parse tree root.
	On error, raises `RxParser syntaxErrorSignal' with the current
	input stream position as the parameter."

	| tree |
	input := aStream.
	lookahead := nil.
	self match: nil.
	tree := self regex.
	self match: #epsilon.
	^tree!

----- Method: RxParser>>piece (in category 'recursive descent') -----
piece
	"<piece> ::= <atom> | <atom>* | <atom>+ | <atom>? | <atom>{<number>,<number>}"

	| atom |
	atom := self atom.

	lookahead == $*
		ifTrue: [ 
			self next.
			atom isNullable
				ifTrue: [ self signalNullableClosureParserError ].
			^ RxsPiece new initializeStarAtom: atom ].

	lookahead == $+
		ifTrue: [ 
			self next.
			atom isNullable
				ifTrue: [ self signalNullableClosureParserError ].
			^ RxsPiece new initializePlusAtom: atom ].

	lookahead == $?
		ifTrue: [ 
			self next.
			atom isNullable
				ifTrue: [ 
					^ self lookAround ].
			^ RxsPiece new initializeOptionalAtom: atom ].

	lookahead == ${
		ifTrue: [
			^ self quantifiedAtom: atom ].

	^ RxsPiece new initializeAtom: atom!

----- Method: RxParser>>quantifiedAtom: (in category 'recursive descent') -----
quantifiedAtom: atom
	"Parse a quanitifer expression which can have one of the following forms
		{<min>,<max>}    match <min> to <max> occurences
		{<minmax>}       which is the same as with repeated limits: {<number>,<number>}
		{<min>,}         match at least <min> occurences
		{,<max>}         match maximally <max> occurences, which is the same as {0,<max>}"
	| min max |
	self next.
	lookahead == $,
		ifTrue: [ min := 0 ]
		ifFalse: [
			max := min := (self inputUpToAny: ',}' errorMessage: ' no terminating "}"') asUnsignedInteger ].
	lookahead == $,
		ifTrue: [
			self next.
			max := (self inputUpToAny: ',}' errorMessage: ' no terminating "}"') asUnsignedInteger ].	
	self match: $}.
	atom isNullable
		ifTrue: [ self signalNullableClosureParserError ].
	(max notNil and: [ max < min ])
		ifTrue: [ self signalParseError: ('wrong quantifier, expected ', min asString, ' <= ', max asString) ].
	^ RxsPiece new 
		initializeAtom: atom
		min: min
		max: max!

----- Method: RxParser>>regex (in category 'recursive descent') -----
regex
	"<regex> ::= e | <branch> `|' <regex>"

	| branch regex |
	branch := self branch.

	(lookahead == #epsilon 
	or: [ lookahead == $) ])
		ifTrue: [ regex := nil ]
		ifFalse: [
			self match: $|.
			regex := self regex ].

	^RxsRegex new initializeBranch: branch regex: regex!

----- Method: RxParser>>signalNullableClosureParserError (in category 'private') -----
signalNullableClosureParserError
	self signalParseError: ' nullable closure'.!

----- Method: RxParser>>signalParseError (in category 'private') -----
signalParseError

	self class 
		signalSyntaxException: 'Regex syntax error' at: input position!

----- Method: RxParser>>signalParseError: (in category 'private') -----
signalParseError: aString

	self class signalSyntaxException: aString at: input position!

Object subclass: #RxmLink
	instanceVariableNames: 'next'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxmLink commentStamp: 'Tbn 11/12/2010 23:14' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
A matcher is built of a number of links interconnected into some intricate structure. Regardless of fancy stuff, any link (except for the terminator) has the next one. Any link can match against a stream of characters, recursively propagating the match to the next link. Any link supports a number of matcher-building messages. This superclass does all of the above. 

The class is not necessarily abstract. It may double as an empty string matcher: it recursively propagates the match to the next link, thus always matching nothing successfully.

Principal method:
	matchAgainst: aMatcher
		Any subclass will reimplement this to test the state of the matcher, most
		probably reading one or more characters from the matcher's stream, and
		either decide it has matched and answer true, leaving matcher stream
		positioned at the end of match, or answer false and restore the matcher
		stream position to whatever it was before the matching attempt.

Instance variables:
	next		<RxmLink | RxmTerminator> The next link in the structure.!

RxmLink subclass: #RxmBranch
	instanceVariableNames: 'loopback alternative'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxmBranch commentStamp: 'Tbn 11/12/2010 23:14' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
This is a branch of a matching process. Either `next' chain should match, or `alternative', if not nil, should match. Since this is also used to build loopbacks to match repetitions, `loopback' variable indicates whether the instance is a loopback: it affects the matcher-building operations (which of the paths through the branch is to consider as the primary when we have to find the "tail" of a matcher construct).

Instance variables
	alternative		<RxmLink> to match if `next' fails to match.
	loopback		<Boolean>!

----- Method: RxmBranch>>alternative: (in category 'initialize-release') -----
alternative: aBranch
	"See class comment for instance variable description."

	alternative := aBranch!

----- Method: RxmBranch>>beLoopback (in category 'initialize-release') -----
beLoopback
	"See class comment for instance variable description."

	loopback := true!

----- Method: RxmBranch>>initialize (in category 'initialization') -----
initialize
	"See class comment for instance variable description."

	super initialize.
	loopback := false!

----- Method: RxmBranch>>matchAgainst: (in category 'matching') -----
matchAgainst: aMatcher
	"Match either `next' or `alternative'. Fail if the alternative is nil."

	(next matchAgainst: aMatcher) ifTrue: [ ^true ].
	^(alternative ifNil: [ ^false ]) matchAgainst: aMatcher!

----- Method: RxmBranch>>pointTailTo: (in category 'building') -----
pointTailTo: aNode
	"See superclass for explanations."

	loopback
		ifTrue: [
			alternative == nil
				ifTrue: [alternative := aNode]
				ifFalse: [alternative pointTailTo: aNode]]
		ifFalse: [super pointTailTo: aNode]!

----- Method: RxmBranch>>terminateWith: (in category 'building') -----
terminateWith: aNode
	"See superclass for explanations."

	loopback
		ifTrue: [alternative == nil
			ifTrue: [alternative := aNode]
			ifFalse: [alternative terminateWith: aNode]]
		ifFalse: [super terminateWith: aNode]!

----- Method: RxmLink>>matchAgainst: (in category 'matching') -----
matchAgainst: aMatcher
	"If a link does not match the contents of the matcher's stream,
	answer false. Otherwise, let the next matcher in the chain match."

	^next matchAgainst: aMatcher!

----- Method: RxmLink>>next (in category 'accessing') -----
next

	^next!

----- Method: RxmLink>>next: (in category 'accessing') -----
next: aLink
	"Set the next link, either an RxmLink or an RxmTerminator."

	next := aLink!

----- Method: RxmLink>>pointTailTo: (in category 'building') -----
pointTailTo: anRxmLink
	"Propagate this message along the chain of links.
	Point `next' reference of the last link to <anRxmLink>.
	If the chain is already terminated, blow up."

	next == nil
		ifTrue: [next := anRxmLink]
		ifFalse: [next pointTailTo: anRxmLink]!

----- Method: RxmLink>>postCopy (in category 'copying') -----
postCopy
	super postCopy.
	next := next copy!

----- Method: RxmLink>>terminateWith: (in category 'building') -----
terminateWith: aTerminator
	"Propagate this message along the chain of links, and
	make aTerminator the `next' link of the last link in the chain.
	If the chain is already reminated with the same terminator, 
	do not blow up."

	next == nil
		ifTrue: [next := aTerminator]
		ifFalse: [next terminateWith: aTerminator]!

RxmLink subclass: #RxmLookahaed
	instanceVariableNames: 'lookahead positive'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxmLookahaed commentStamp: '<historical>' prior: 0!
Instance holds onto a lookead which matches but does not consume anything.

Instance variables:
	predicate		<RxmLink>!

----- Method: RxmLookahaed class>>with: (in category 'instance creation') -----
with: aPiece

	^self new lookahead: aPiece!

----- Method: RxmLookahaed>>initialize (in category 'initialization') -----
initialize
	super initialize.
	positive := true.!

----- Method: RxmLookahaed>>lookahead (in category 'accessing') -----
lookahead
	^ lookahead!

----- Method: RxmLookahaed>>lookahead: (in category 'accessing') -----
lookahead: anRxmLink
	lookahead := anRxmLink!

----- Method: RxmLookahaed>>matchAgainst: (in category 'matching') -----
matchAgainst: aMatcher
	"Match if the predicate block evaluates to true when given the
	current stream character as the argument."

	| original result |
	original := aMatcher currentState.
	result := lookahead matchAgainst: aMatcher.
	aMatcher restoreState: original.
	^ result not 
		and: [ next matchAgainst: aMatcher ]!

----- Method: RxmLookahaed>>terminateWith: (in category 'building') -----
terminateWith: aNode
	lookahead terminateWith: aNode.
	super terminateWith: aNode.!

RxmLink subclass: #RxmMarker
	instanceVariableNames: 'index'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxmMarker commentStamp: 'Tbn 11/12/2010 23:14' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
A marker is used to remember positions of match of certain points of a regular expression. The marker receives an identifying key from the Matcher and uses that key to report positions of successful matches to the Matcher.

Instance variables:
	index	<Object> Something that makes sense for the Matcher. Received from the latter during initalization and later passed to it to identify the receiver.!

----- Method: RxmMarker>>index: (in category 'initialize-release') -----
index: anIndex
	"An index is a key that makes sense for the matcher.
	This key can be passed to marker position getters and
	setters to access position for this marker in the current
	matching session."

	index := anIndex!

----- Method: RxmMarker>>matchAgainst: (in category 'matching') -----
matchAgainst: aMatcher
	"If the rest of the link chain matches successfully, report the
	position of the stream *before* the match started to the matcher."

	| startPosition |
	startPosition := aMatcher position.
	(next matchAgainst: aMatcher) ifFalse: [ ^false ].
	aMatcher markerPositionAt: index add: startPosition.
	^true!

RxmLink subclass: #RxmPredicate
	instanceVariableNames: 'predicate'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxmPredicate commentStamp: 'Tbn 11/12/2010 23:14' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
Instance holds onto a one-argument block and matches exactly one character if the block evaluates to true when passed the character as the argument.

Instance variables:
	predicate		<BlockClosure>!

----- Method: RxmPredicate class>>with: (in category 'instance creation') -----
with: unaryBlock

	^self new predicate: unaryBlock!

----- Method: RxmPredicate>>bePerform: (in category 'initialize-release') -----
bePerform: aSelector
	"Match any single character that answers true  to this message."

	self predicate: 
		[:char | 
		RxParser doHandlingMessageNotUnderstood: [char perform: aSelector]]!

----- Method: RxmPredicate>>bePerformNot: (in category 'initialize-release') -----
bePerformNot: aSelector
	"Match any single character that answers false to this message."

	self predicate: 
		[:char | 
		RxParser doHandlingMessageNotUnderstood: [(char perform: aSelector) not]]!

----- Method: RxmPredicate>>matchAgainst: (in category 'matching') -----
matchAgainst: aMatcher
	"Match if the predicate block evaluates to true when given the
	current stream character as the argument."

	| original |
	aMatcher atEnd ifTrue: [ ^false ].
	original := aMatcher currentState.
	(predicate value: aMatcher next) ifFalse: [
		aMatcher restoreState: original.
		^false ].
	(next matchAgainst: aMatcher) ifTrue: [ ^true ].
	aMatcher restoreState: original.
	^false
!

----- Method: RxmPredicate>>predicate: (in category 'initialize-release') -----
predicate: aBlock
	"This link will match any single character for which <aBlock>
	evaluates to true."

	aBlock numArgs ~= 1 ifTrue: [self error: 'bad predicate block'].
	predicate := aBlock.
	^self!

RxmLink subclass: #RxmSpecial
	instanceVariableNames: 'matchSelector'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxmSpecial commentStamp: 'Tbn 11/12/2010 23:14' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
A special node that matches a specific matcher state rather than any input character.
The state is either at-beginning-of-line or at-end-of-line.!

----- Method: RxmSpecial>>beBeginningOfLine (in category 'initialize-release') -----
beBeginningOfLine

	matchSelector := #atBeginningOfLine!

----- Method: RxmSpecial>>beBeginningOfWord (in category 'initialize-release') -----
beBeginningOfWord

	matchSelector := #atBeginningOfWord!

----- Method: RxmSpecial>>beEndOfLine (in category 'initialize-release') -----
beEndOfLine

	matchSelector := #atEndOfLine!

----- Method: RxmSpecial>>beEndOfWord (in category 'initialize-release') -----
beEndOfWord

	matchSelector := #atEndOfWord!

----- Method: RxmSpecial>>beNotWordBoundary (in category 'initialize-release') -----
beNotWordBoundary

	matchSelector := #notAtWordBoundary!

----- Method: RxmSpecial>>beWordBoundary (in category 'initialize-release') -----
beWordBoundary

	matchSelector := #atWordBoundary!

----- Method: RxmSpecial>>matchAgainst: (in category 'matching') -----
matchAgainst: aMatcher
	"Match without consuming any input, if the matcher is
	in appropriate state."

	^(aMatcher perform: matchSelector)
		and: [next matchAgainst: aMatcher]!

RxmLink subclass: #RxmSubstring
	instanceVariableNames: 'sample compare'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxmSubstring commentStamp: 'Tbn 11/12/2010 23:14' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
Instance holds onto a string and matches exactly this string, and exactly once.

Instance variables:
	string 	<String>!

----- Method: RxmSubstring>>beCaseInsensitive (in category 'initialize-release') -----
beCaseInsensitive

	compare := [:char1 :char2 | char1 sameAs: char2]!

----- Method: RxmSubstring>>beCaseSensitive (in category 'initialize-release') -----
beCaseSensitive

	compare := [:char1 :char2 | char1 = char2]!

----- Method: RxmSubstring>>character:ignoreCase: (in category 'initialize-release') -----
character: aCharacter ignoreCase: aBoolean
	"Match exactly this character."

	sample := String with: aCharacter.
	aBoolean ifTrue: [self beCaseInsensitive]!

----- Method: RxmSubstring>>initialize (in category 'initialization') -----
initialize
	super initialize.
	self beCaseSensitive!

----- Method: RxmSubstring>>matchAgainst: (in category 'matching') -----
matchAgainst: aMatcher
	"Match if my sample stream is exactly the current prefix
	of the matcher stream's contents."

	| originalState sampleStream nextSample |
	originalState := aMatcher currentState.
	sampleStream := self sampleStream.
	[ (nextSample := sampleStream next) == nil or: [ aMatcher atEnd ] ] whileFalse: [
		(compare value: nextSample value: aMatcher next) ifFalse: [
			aMatcher restoreState: originalState.
			^false ] ].
	(nextSample == nil and: [ next matchAgainst: aMatcher ]) ifTrue: [ ^true ].
	aMatcher restoreState: originalState.
	^false!

----- Method: RxmSubstring>>sampleStream (in category 'private') -----
sampleStream

	^sample readStream!

----- Method: RxmSubstring>>substring:ignoreCase: (in category 'initialize-release') -----
substring: aString ignoreCase: aBoolean
	"Match exactly this string."

	sample := aString.
	aBoolean ifTrue: [self beCaseInsensitive]!

Object subclass: #RxmTerminator
	instanceVariableNames: ''
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxmTerminator commentStamp: 'Tbn 11/12/2010 23:14' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
Instances of this class are used to terminate matcher's chains. When a match reaches this (an instance receives #matchAgainst: message), the match is considered to succeed. Instances also support building protocol of RxmLinks, with some restrictions.!

----- Method: RxmTerminator>>matchAgainst: (in category 'matching') -----
matchAgainst: aStream
	"If got here, the match is successful."

	^true!

----- Method: RxmTerminator>>pointTailTo: (in category 'building') -----
pointTailTo: anRxmLink
	"Branch tails are never redirected by the build algorithm.
	Healthy terminators should never receive this."

	RxParser signalCompilationException:
		'internal matcher build error - redirecting terminator tail'!

----- Method: RxmTerminator>>terminateWith: (in category 'building') -----
terminateWith: aTerminator
	"Branch terminators are never supposed to change.
	Make sure this is the case."

	aTerminator ~~ self
		ifTrue: [RxParser signalCompilationException:
				'internal matcher build error - wrong terminator']!

Object subclass: #RxsNode
	instanceVariableNames: ''
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxsNode commentStamp: 'Tbn 11/12/2010 23:14' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
A generic syntax tree node, provides some common responses to the standard tests, as well as tree structure printing -- handy for debugging.!

RxsNode subclass: #RxsBranch
	instanceVariableNames: 'piece branch'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxsBranch commentStamp: 'Tbn 11/12/2010 23:14' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
A Branch is a Piece followed by a Branch or an empty string.

Instance variables:
	piece		<RxsPiece>
	branch		<RxsBranch|RxsEpsilon>!

----- Method: RxsBranch>>branch (in category 'accessing') -----
branch

	^branch!

----- Method: RxsBranch>>dispatchTo: (in category 'accessing') -----
dispatchTo: aMatcher
	"Inform the matcher of the kind of the node, and it
	will do whatever it has to."

	^aMatcher syntaxBranch: self!

----- Method: RxsBranch>>initializePiece:branch: (in category 'initialize-release') -----
initializePiece: aPiece branch: aBranch
	"See class comment for instance variables description."

	piece := aPiece.
	branch := aBranch!

----- Method: RxsBranch>>isNullable (in category 'testing') -----
isNullable

	^piece isNullable and: [branch isNil or: [branch isNullable]]!

----- Method: RxsBranch>>piece (in category 'accessing') -----
piece

	^piece!

----- Method: RxsBranch>>tryMergingInto: (in category 'optimization') -----
tryMergingInto: aStream
	"Concatenation of a few simple characters can be optimized
	to be a plain substring match. Answer the node to resume
	syntax tree traversal at. Epsilon node used to terminate the branch
	will implement this to answer nil, thus indicating that the branch
	has ended."

	piece isAtomic ifFalse: [^self].
	aStream nextPut: piece character.
	^branch isNil
		ifTrue: [branch]
		ifFalse: [branch tryMergingInto: aStream]!

RxsNode subclass: #RxsCharSet
	instanceVariableNames: 'negated elements'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxsCharSet commentStamp: 'Tbn 11/12/2010 23:14' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
A character set corresponds to a [...] construct in the regular expression.

Instance variables:
	elements	<OrderedCollection> An element can be one of: RxsCharacter, RxsRange, or RxsPredicate.
	negated		<Boolean>!

----- Method: RxsCharSet>>dispatchTo: (in category 'accessing') -----
dispatchTo: aMatcher
	"Inform the matcher of the kind of the node, and it
	will do whatever it has to."

	^aMatcher syntaxCharSet: self!

----- Method: RxsCharSet>>enumerablePartPredicateIgnoringCase: (in category 'privileged') -----
enumerablePartPredicateIgnoringCase: aBoolean

	| enumeration |
	enumeration := self enumerableSetIgnoringCase: aBoolean.
	enumeration ifNil: [ ^nil ].
	negated ifTrue: [ ^[ :char | (enumeration includes: char) not ] ].
	^[ :char | enumeration includes: char ]!

----- Method: RxsCharSet>>enumerableSetIgnoringCase: (in category 'privileged') -----
enumerableSetIgnoringCase: aBoolean
	"Answer a collection of characters that make up the portion of me that can be enumerated, or nil if there are no such characters."

	| size set |
	size := elements detectSum: [ :each |
		each enumerateSizeIgnoringCase: aBoolean ].
	size = 0 ifTrue: [ ^nil ].
	set := Set new: size.
	elements do: [ :each |
		each enumerateTo: set ignoringCase: aBoolean ].
	^set!

----- Method: RxsCharSet>>hasPredicates (in category 'accessing') -----
hasPredicates

	^(elements allSatisfy: [ :some | some isEnumerable ]) not!

----- Method: RxsCharSet>>initializeElements:negated: (in category 'initialize-release') -----
initializeElements: aCollection negated: aBoolean
	"See class comment for instance variables description."

	elements := aCollection.
	negated := aBoolean!

----- Method: RxsCharSet>>isEnumerable (in category 'testing') -----
isEnumerable

	^elements anySatisfy: [:some | some isEnumerable ]!

----- Method: RxsCharSet>>isNegated (in category 'testing') -----
isNegated

	^negated!

----- Method: RxsCharSet>>predicateIgnoringCase: (in category 'accessing') -----
predicateIgnoringCase: aBoolean

	| enumerable predicate |
	enumerable := self enumerablePartPredicateIgnoringCase: aBoolean.
	predicate := self predicatePartPredicate ifNil: [ 
		"There are no predicates in this set."
		^enumerable ifNil: [ 
			"This set is empty."
			[ :char | negated ] ] ].
	enumerable ifNil: [ ^predicate ].
	negated ifTrue: [
		"enumerable and predicate already negate the result, that's why #not is not needed here."
		^[ :char | (enumerable value: char) and: [ predicate value: char ] ] ].
	^[ :char | (enumerable value: char) or: [ predicate value: char ] ]!

----- Method: RxsCharSet>>predicatePartPredicate (in category 'privileged') -----
predicatePartPredicate
	"Answer a predicate that tests all of my elements that cannot be enumerated, or nil if such elements don't exist."

	| predicates size |
	predicates := elements reject: [ :some | some isEnumerable ].
	(size := predicates size) = 0 ifTrue: [ 
		"We could return a real predicate block - like [ :char | negated ] - here, but it wouldn't be used anyway. This way we signal that this character set has no predicates."
		^nil ].
	size = 1 ifTrue: [
		negated ifTrue: [ ^predicates first predicateNegation ].
		^predicates first predicate ].
	predicates replace: [ :each | each predicate ].
	negated ifTrue: [ ^[ [: char | predicates noneSatisfy: [ :some | some value: char ] ] ] ].
	^[ :char | predicates anySatisfy: [ :some | some value: char ] ]
	!

----- Method: RxsCharSet>>predicates (in category 'accessing') -----
predicates

	| predicates |
	predicates := elements reject: [ :some | some isEnumerable ].
	predicates isEmpty ifTrue: [ ^nil ].
	^predicates replace: [ :each | each predicate ]!

RxsNode subclass: #RxsCharacter
	instanceVariableNames: 'character'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxsCharacter commentStamp: 'Tbn 11/12/2010 23:14' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
A character is a literal character that appears either in the expression itself or in a character set within an expression.

Instance variables:
	character		<Character>!

----- Method: RxsCharacter class>>with: (in category 'instance creation') -----
with: aCharacter

	^self new initializeCharacter: aCharacter!

----- Method: RxsCharacter>>character (in category 'accessing') -----
character

	^character!

----- Method: RxsCharacter>>dispatchTo: (in category 'accessing') -----
dispatchTo: aMatcher
	"Inform the matcher of the kind of the node, and it
	will do whatever it has to."

	^aMatcher syntaxCharacter: self!

----- Method: RxsCharacter>>enumerateSizeIgnoringCase: (in category 'accessing') -----
enumerateSizeIgnoringCase: aBoolean

	aBoolean ifFalse: [ ^1 ].
	character isLetter ifTrue: [ ^2 ].
	^1!

----- Method: RxsCharacter>>enumerateTo:ignoringCase: (in category 'accessing') -----
enumerateTo: aSet ignoringCase: aBoolean

	aBoolean ifFalse: [ ^aSet add: character ].
	aSet 
		add: character asUppercase;
		add: character asLowercase!

----- Method: RxsCharacter>>initializeCharacter: (in category 'initialize-release') -----
initializeCharacter: aCharacter
	"See class comment for instance variable description."

	character := aCharacter!

----- Method: RxsCharacter>>isAtomic (in category 'testing') -----
isAtomic
	"A character is always atomic."

	^true!

----- Method: RxsCharacter>>isEnumerable (in category 'testing') -----
isEnumerable

	^true!

RxsNode subclass: #RxsContextCondition
	instanceVariableNames: 'kind'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxsContextCondition commentStamp: 'Tbn 11/12/2010 23:14' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
One of a few special nodes more often representing special state of the match rather than a predicate on a character.  The ugly exception is the #any condition which *is* a predicate on a character.

Instance variables:
	kind		<Selector>!

----- Method: RxsContextCondition>>beAny (in category 'initialize-release') -----
beAny
	"Matches anything but a newline."

	kind := #syntaxAny!

----- Method: RxsContextCondition>>beBeginningOfLine (in category 'initialize-release') -----
beBeginningOfLine
	"Matches empty string at the beginning of a line."

	kind := #syntaxBeginningOfLine!

----- Method: RxsContextCondition>>beBeginningOfWord (in category 'initialize-release') -----
beBeginningOfWord
	"Matches empty string at the beginning of a word."

	kind := #syntaxBeginningOfWord!

----- Method: RxsContextCondition>>beEndOfLine (in category 'initialize-release') -----
beEndOfLine
	"Matches empty string at the end of a line."

	kind := #syntaxEndOfLine!

----- Method: RxsContextCondition>>beEndOfWord (in category 'initialize-release') -----
beEndOfWord
	"Matches empty string at the end of a word."

	kind := #syntaxEndOfWord!

----- Method: RxsContextCondition>>beNonWordBoundary (in category 'initialize-release') -----
beNonWordBoundary
	"Analog of \B."

	kind := #syntaxNonWordBoundary!

----- Method: RxsContextCondition>>beWordBoundary (in category 'initialize-release') -----
beWordBoundary
	"Analog of \w (alphanumeric plus _)."

	kind := #syntaxWordBoundary!

----- Method: RxsContextCondition>>dispatchTo: (in category 'accessing') -----
dispatchTo: aBuilder

	^aBuilder perform: kind!

----- Method: RxsContextCondition>>isNullable (in category 'testing') -----
isNullable

	^#syntaxAny ~~ kind!

RxsNode subclass: #RxsEpsilon
	instanceVariableNames: ''
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxsEpsilon commentStamp: 'Tbn 11/12/2010 23:14' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
This is an empty string.  It terminates some of the recursive constructs.!

----- Method: RxsEpsilon>>dispatchTo: (in category 'building') -----
dispatchTo: aBuilder
	"Inform the matcher of the kind of the node, and it
	will do whatever it has to."

	^aBuilder syntaxEpsilon!

----- Method: RxsEpsilon>>isNullable (in category 'testing') -----
isNullable
	"See comment in the superclass."

	^true!

RxsNode subclass: #RxsLookaround
	instanceVariableNames: 'piece positive'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxsLookaround commentStamp: '<historical>' prior: 0!
I lookaround is used for lookaheads and lookbehinds. They are used to check if the input matches a certain subexpression without consuming any characters (e.g. not advancing the match position).

Lookarounds can be positive or negative. If they are positive the condition fails if the subexpression fails, if they are negative it is inverse.!

----- Method: RxsLookaround class>>with: (in category 'instance creation') -----
with: anRsxPiece
	^ self new
		initializePiece: anRsxPiece!

----- Method: RxsLookaround>>beNegative (in category 'initailize-release') -----
beNegative
	positive := false!

----- Method: RxsLookaround>>bePositive (in category 'initailize-release') -----
bePositive
	positive := true!

----- Method: RxsLookaround>>dispatchTo: (in category 'accessing') -----
dispatchTo: aBuilder
	"Inform the matcher of the kind of the node, and it
	will do whatever it has to."
	^aBuilder syntaxLookaround: self!

----- Method: RxsLookaround>>initializePiece: (in category 'initailize-release') -----
initializePiece: anRsxPiece
	super initialize.
	piece := anRsxPiece.!

----- Method: RxsLookaround>>piece (in category 'accessing') -----
piece
	^ piece!

RxsNode subclass: #RxsMessagePredicate
	instanceVariableNames: 'selector negated'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxsMessagePredicate commentStamp: 'Tbn 11/12/2010 23:14' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
A message predicate represents a condition on a character that is tested (at the match time) by sending a unary message to the character expecting a Boolean answer.

Instance variables:
	selector		<Symbol>!

----- Method: RxsMessagePredicate>>dispatchTo: (in category 'accessing') -----
dispatchTo: aBuilder
	"Inform the matcher of the kind of the node, and it
	will do whatever it has to."

	^aBuilder syntaxMessagePredicate: self!

----- Method: RxsMessagePredicate>>initializeSelector: (in category 'initialize-release') -----
initializeSelector: aSelector
	"The selector must be a one-argument message understood by Character."

	selector := aSelector!

----- Method: RxsMessagePredicate>>initializeSelector:negated: (in category 'initialize-release') -----
initializeSelector: aSelector negated: aBoolean
	"The selector must be a one-argument message understood by Character."

	selector := aSelector.
	negated := aBoolean!

----- Method: RxsMessagePredicate>>negated (in category 'accessing') -----
negated

	^negated!

----- Method: RxsMessagePredicate>>selector (in category 'accessing') -----
selector

	^selector!

----- Method: RxsNode>>indentCharacter (in category 'constants') -----
indentCharacter
	"Normally, #printOn:withIndent: method in subclasses
	print several characters returned by this method to indicate
	the tree structure."

	^$+!

----- Method: RxsNode>>isAtomic (in category 'testing') -----
isAtomic
	"Answer whether the node is atomic, i.e. matches exactly one 
	constant predefined normal character.  A matcher may decide to 
	optimize matching of a sequence of atomic nodes by glueing them 
	together in a string."

	^false "tentatively"!

----- Method: RxsNode>>isNullable (in category 'testing') -----
isNullable
	"True if the node can match an empty sequence of characters."

	^false "for most nodes"!

RxsNode subclass: #RxsPiece
	instanceVariableNames: 'atom min max'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxsPiece commentStamp: '<historical>' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
A piece is an atom, possibly optional or repeated a number of times.

Instance variables:
	atom	<RxsCharacter|RxsCharSet|RxsPredicate|RxsRegex|RxsSpecial>
	min		<Integer>
	max		<Integer|nil> nil means infinity!

----- Method: RxsPiece>>atom (in category 'accessing') -----
atom

	^atom!

----- Method: RxsPiece>>character (in category 'accessing') -----
character
	"If this node is atomic, answer the character it
	represents. It is the caller's responsibility to make sure this
	node is indeed atomic before using this."

	^atom character!

----- Method: RxsPiece>>dispatchTo: (in category 'accessing') -----
dispatchTo: aMatcher
	"Inform the matcher of the kind of the node, and it
	will do whatever it has to."

	^aMatcher syntaxPiece: self!

----- Method: RxsPiece>>initializeAtom: (in category 'initialize-release') -----
initializeAtom: anAtom
	"This piece is exactly one occurrence of the specified RxsAtom."

	self initializeAtom: anAtom min: 1 max: 1!

----- Method: RxsPiece>>initializeAtom:min:max: (in category 'initialize-release') -----
initializeAtom: anAtom min: minOccurrences max: maxOccurrences
	"This piece is from <minOccurrences> to <maxOccurrences> 
	occurrences of the specified RxsAtom."

	atom := anAtom.
	min := minOccurrences.
	max := maxOccurrences!

----- Method: RxsPiece>>initializeOptionalAtom: (in category 'initialize-release') -----
initializeOptionalAtom: anAtom
	"This piece is 0 or 1 occurrences of the specified RxsAtom."

	self initializeAtom: anAtom min: 0 max: 1!

----- Method: RxsPiece>>initializePlusAtom: (in category 'initialize-release') -----
initializePlusAtom: anAtom
	"This piece is one or more occurrences of the specified RxsAtom."

	self initializeAtom: anAtom min: 1 max: nil!

----- Method: RxsPiece>>initializeStarAtom: (in category 'initialize-release') -----
initializeStarAtom: anAtom
	"This piece is any number of occurrences of the atom."

	self initializeAtom: anAtom min: 0 max: nil!

----- Method: RxsPiece>>isAtomic (in category 'testing') -----
isAtomic
	"A piece is atomic if only it contains exactly one atom
	which is atomic (sic)."

	^self isSingular and: [atom isAtomic]!

----- Method: RxsPiece>>isNullable (in category 'testing') -----
isNullable
	"A piece is nullable if it allows 0 matches. 
	This is often handy to know for optimization."

	^min = 0 or: [atom isNullable]!

----- Method: RxsPiece>>isOptional (in category 'testing') -----
isOptional

	^min = 0 and: [max = 1]!

----- Method: RxsPiece>>isPlus (in category 'testing') -----
isPlus

	^min = 1 and: [max == nil]!

----- Method: RxsPiece>>isSingular (in category 'testing') -----
isSingular
	"A piece with a range is 1 to 1 needs can be compiled
	as a simple match."

	^min = 1 and: [max = 1]!

----- Method: RxsPiece>>isStar (in category 'testing') -----
isStar

	^min = 0 and: [max == nil]!

----- Method: RxsPiece>>max (in category 'accessing') -----
max
	"The value answered may be nil, indicating infinity."

	^max!

----- Method: RxsPiece>>min (in category 'accessing') -----
min

	^min!

RxsNode subclass: #RxsPredicate
	instanceVariableNames: 'predicate negation'
	classVariableNames: 'EscapedLetterSelectors NamedClassSelectors'
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxsPredicate commentStamp: 'Tbn 11/12/2010 23:15' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
This represents a character that satisfies a certain predicate.

Instance Variables:

	predicate	<BlockClosure>	A one-argument block. If it evaluates to the value defined by <negated> when it is passed a character, the predicate is considered to match.
	negation	<BlockClosure>	A one-argument block that is a negation of <predicate>.!

----- Method: RxsPredicate class>>forEscapedLetter: (in category 'instance creation') -----
forEscapedLetter: aCharacter

	^self new perform:
		(EscapedLetterSelectors
			at: aCharacter
			ifAbsent: [RxParser signalSyntaxException: 'bad backslash escape'])!

----- Method: RxsPredicate class>>forNamedClass: (in category 'instance creation') -----
forNamedClass: aString

	^self new perform:
		(NamedClassSelectors
			at: aString
			ifAbsent: [RxParser signalSyntaxException: 'bad character class name'])!

----- Method: RxsPredicate class>>initialize (in category 'class initialization') -----
initialize
	"self initialize"

	self
		initializeNamedClassSelectors;
		initializeEscapedLetterSelectors!

----- Method: RxsPredicate class>>initializeEscapedLetterSelectors (in category 'class initialization') -----
initializeEscapedLetterSelectors
	"self initializeEscapedLetterSelectors"

	| newEscapedLetterSelectors |
	newEscapedLetterSelectors := Dictionary new
		at: $w put: #beWordConstituent;
		at: $W put: #beNotWordConstituent;
		at: $d put: #beDigit;
		at: $D put: #beNotDigit;
		at: $s put: #beSpace;
		at: $S put: #beNotSpace;
		at: $\ put: #beBackslash;
		at: $r put: #beCarriageReturn;
		yourself.
	EscapedLetterSelectors := newEscapedLetterSelectors!

----- Method: RxsPredicate class>>initializeNamedClassSelectors (in category 'class initialization') -----
initializeNamedClassSelectors
	"self initializeNamedClassSelectors"

	(NamedClassSelectors := Dictionary new)
		at: 'alnum' put: #beAlphaNumeric;
		at: 'alpha' put: #beAlphabetic;
		at: 'cntrl' put: #beControl;
		at: 'digit' put: #beDigit;
		at: 'graph' put: #beGraphics;
		at: 'lower' put: #beLowercase;
		at: 'print' put: #bePrintable;
		at: 'punct' put: #bePunctuation;
		at: 'space' put: #beSpace;
		at: 'upper' put: #beUppercase;
		at: 'xdigit' put: #beHexDigit!

----- Method: RxsPredicate>>beAlphaNumeric (in category 'initialize-release') -----
beAlphaNumeric

	predicate := [:char | char isAlphaNumeric].
	negation := [:char | char isAlphaNumeric not]!

----- Method: RxsPredicate>>beAlphabetic (in category 'initialize-release') -----
beAlphabetic

	predicate := [:char | char isLetter].
	negation := [:char | char isLetter not]!

----- Method: RxsPredicate>>beBackslash (in category 'initialize-release') -----
beBackslash

	predicate := [:char | char == $\].
	negation := [:char | char ~~ $\]!

----- Method: RxsPredicate>>beCarriageReturn (in category 'initialize-release') -----
beCarriageReturn

	| cr |
	cr := Character cr.
	predicate := [ :char | char == cr ].
	negation := [ :char | char ~~ cr  ]!

----- Method: RxsPredicate>>beControl (in category 'initialize-release') -----
beControl

	predicate := [:char | char asInteger < 32].
	negation := [:char | char asInteger >= 32]!

----- Method: RxsPredicate>>beDigit (in category 'initialize-release') -----
beDigit

	predicate := [:char | char isDigit].
	negation := [:char | char isDigit not]!

----- Method: RxsPredicate>>beGraphics (in category 'initialize-release') -----
beGraphics

	self
		beControl;
		negate!

----- Method: RxsPredicate>>beHexDigit (in category 'initialize-release') -----
beHexDigit

	| hexLetters |
	hexLetters := 'abcdefABCDEF'.
	predicate := [:char | char isDigit or: [hexLetters includes: char]].
	negation := [:char | char isDigit not and: [(hexLetters includes: char) not]]!

----- Method: RxsPredicate>>beLowercase (in category 'initialize-release') -----
beLowercase

	predicate := [:char | char isLowercase].
	negation := [:char | char isLowercase not]!

----- Method: RxsPredicate>>beNotDigit (in category 'initialize-release') -----
beNotDigit

	self
		beDigit;
		negate!

----- Method: RxsPredicate>>beNotSpace (in category 'initialize-release') -----
beNotSpace

	self
		beSpace;
		negate!

----- Method: RxsPredicate>>beNotWordConstituent (in category 'initialize-release') -----
beNotWordConstituent

	self
		beWordConstituent;
		negate!

----- Method: RxsPredicate>>bePrintable (in category 'initialize-release') -----
bePrintable

	self
		beControl;
		negate!

----- Method: RxsPredicate>>bePunctuation (in category 'initialize-release') -----
bePunctuation

	| punctuationChars |
	punctuationChars := #($. $, $!! $? $; $: $" $' $- $( $) $`).
	predicate := [:char | punctuationChars includes: char].
	negation := [:char | (punctuationChars includes: char) not]!

----- Method: RxsPredicate>>beSpace (in category 'initialize-release') -----
beSpace

	predicate := [:char | char isSeparator].
	negation := [:char | char isSeparator not]!

----- Method: RxsPredicate>>beUppercase (in category 'initialize-release') -----
beUppercase

	predicate := [:char | char isUppercase].
	negation := [:char | char isUppercase not]!

----- Method: RxsPredicate>>beWordConstituent (in category 'initialize-release') -----
beWordConstituent

	predicate := [:char | char isAlphaNumeric or: [char == $_]].
	negation := [:char | char isAlphaNumeric not and: [char ~~ $_]]!

----- Method: RxsPredicate>>dispatchTo: (in category 'accessing') -----
dispatchTo: anObject

	^anObject syntaxPredicate: self!

----- Method: RxsPredicate>>enumerateSizeIgnoringCase: (in category 'accessing') -----
enumerateSizeIgnoringCase: aBoolean

	^0 "Not enumerable"!

----- Method: RxsPredicate>>enumerateTo:ignoringCase: (in category 'accessing') -----
enumerateTo: aSet ignoringCase: aBoolean

	^self "Not enumerable"!

----- Method: RxsPredicate>>isEnumerable (in category 'testing') -----
isEnumerable

	^false!

----- Method: RxsPredicate>>negate (in category 'private') -----
negate

	| tmp |
	tmp := predicate.
	predicate := negation.
	negation := tmp!

----- Method: RxsPredicate>>negated (in category 'accessing') -----
negated

	^self copy negate!

----- Method: RxsPredicate>>predicate (in category 'accessing') -----
predicate

	^predicate!

----- Method: RxsPredicate>>predicateNegation (in category 'accessing') -----
predicateNegation

	^negation!

----- Method: RxsPredicate>>value: (in category 'accessing') -----
value: aCharacter

	^predicate value: aCharacter!

RxsNode subclass: #RxsRange
	instanceVariableNames: 'first last'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxsRange commentStamp: 'Tbn 11/12/2010 23:15' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
I represent a range of characters as appear in character classes such as

	[a-ZA-Z0-9].

I appear in a syntax tree only as an element of RxsCharSet.

Instance Variables:

	first	<Character>
	last	<Character>!

----- Method: RxsRange class>>from:to: (in category 'instance creation') -----
from: aCharacter to: anotherCharacter

	^self new initializeFirst: aCharacter last: anotherCharacter!

----- Method: RxsRange>>enumerateSizeIgnoringCase: (in category 'accessing') -----
enumerateSizeIgnoringCase: aBoolean
	"Add all of the elements I represent to the collection."

	| characterCount |
	characterCount := last asInteger - first asInteger + 1 max: 0.
	aBoolean ifFalse: [ ^characterCount ].
	(last isLetter or: [ first isLetter ]) ifTrue: [ ^characterCount * 2 "Assume many letters" ].
	^characterCount "Assume no letters"!

----- Method: RxsRange>>enumerateTo:ignoringCase: (in category 'accessing') -----
enumerateTo: aSet ignoringCase: aBoolean
	"Add all of the elements I represent to the collection."

	aBoolean ifFalse: [
		first asInteger to: last asInteger do: [ :charCode |
			aSet add: charCode asCharacter ].
		^self ].
	first asInteger to: last asInteger do: [ :charCode |
		| character |
		character := charCode asCharacter.
		aSet
			add: character asLowercase;
			add: character asUppercase ]!

----- Method: RxsRange>>initializeFirst:last: (in category 'initialize-release') -----
initializeFirst: aCharacter last: anotherCharacter

	first := aCharacter.
	last := anotherCharacter!

----- Method: RxsRange>>isEnumerable (in category 'testing') -----
isEnumerable

	^true!

RxsNode subclass: #RxsRegex
	instanceVariableNames: 'branch regex'
	classVariableNames: ''
	poolDictionaries: ''
	category: 'Regex-Core'!

!RxsRegex commentStamp: 'Tbn 11/12/2010 23:15' prior: 0!
-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
The body of a parenthesized thing, or a top-level expression, also an atom.  

Instance variables:
	branch		<RxsBranch>
	regex		<RxsRegex | RxsEpsilon>!

----- Method: RxsRegex>>branch (in category 'accessing') -----
branch

	^branch!

----- Method: RxsRegex>>dispatchTo: (in category 'accessing') -----
dispatchTo: aMatcher
	"Inform the matcher of the kind of the node, and it
	will do whatever it has to."

	^aMatcher syntaxRegex: self!

----- Method: RxsRegex>>initializeBranch:regex: (in category 'initialize-release') -----
initializeBranch: aBranch regex: aRegex
	"See class comment for instance variable description."

	branch := aBranch.
	regex := aRegex!

----- Method: RxsRegex>>isNullable (in category 'testing') -----
isNullable

	^branch isNullable or: [regex notNil and: [regex isNullable]]!

----- Method: RxsRegex>>regex (in category 'accessing') -----
regex
	^regex!

----- Method: UIManager>>request:regex: (in category '*Regex-Core') -----
request: aTitleString regex: initialRegexString
	"Prompt the user for a valid regex.
	Return nil on cancel or a valid RxMatcher"
	| regex |
	regex := initialRegexString.
	"loop until we get a valid regex string back"
	[
		regex := UIManager default 
			multiLineRequest: aTitleString 
			initialAnswer: regex 
			answerHeight: 200.
		"cancelled dialog ==> nil"	
		regex ifNil: [ ^ nil ].

		[ ^ regex asRegex ] on: Error do: [ :regexParsingError|
			self defer: [	self inform: 'Bad Regex: ', regexParsingError asString ]].
	] repeat.!