[squeak-dev] The Inbox: Regex-Core-ct.71

christoph.thiede at student.hpi.uni-potsdam.de christoph.thiede at student.hpi.uni-potsdam.de
Thu Oct 28 03:05:05 UTC 2021


As the mail notification failed again and https://source.squeak.org/inbox/Regex-Core-ct.71.diff is broken (probably a special character), I'm attaching the version here with you again.

==================== Summary ====================

Name: Regex-Core-ct.71
Author: ct
Time: 28 October 2021, 4:56:37.955233 am
UUID: f0c42025-64f2-664e-8439-f228df1f2220
Ancestors: Regex-Core-mt.61

Adds support for unicode backslash syntax in pieces and character sets.

Some examples:

	'Squeak is the perfect language' allRegexMatches: '\w*\u0061\w*'. "--> #('Squeak' 'language')"
	'Squeak is beautiful' allRegexMatches: '\w*\x75\w*'. "--> #('Squeak' 'beautiful')"
	(WebUtils jsonDecode: '"$1.00 = \u20AC0.86 = \u00A30.84"' readStream) allRegexMatches: '\p{Sc}\d+\.[\x31-\u{ar57}]+'. "--> #('?0.86' '£0.84')"
	'Carpe Squeak!' allRegexMatches: '\p{L}+'. "--> #('Carpe' 'Squeak')"
	(WebUtils jsonDecode: '" get rid of \u2007all these nonsense\nseparators"' readStream) allRegexMatches: '\P{Z}+'. "--> #('get' 'rid' 'of' 'all' 'these' 'nonsense
separators')"

This is a replacement for Regex-Core-ct.68 (which can be moved to the treated inbox) updated with support for the new syntax inside character sets, inspired by Regex-Core-tobe.62 (this is a counterproposal to Regex-Core-tobe.62). See Regex-Tests-Core-ct.28. The following changes have been made since Regex-Core-ct.68:

- Factored out common parser logic from RxParser and RxCharSetParser into new common superclass RxAbstractParser. Apart from deduplication, this is crucial to use #[uni]codePoint and #unicodeCategory specials in both parsers. (I also considered invoking another RxParser from RxCharSetParser but eventually found this solution more elegant.)
- Split up BackslashSpecials into BackslashPredicates (on RxAbstractParser) and BackslashConditions (only available on RxParser). Escaped uppercase characters (such as '\D') now automatically map to the negation of the lowercase special (see #backslashSpecial:). Deprecated RxsPredicate class >> #forEscapedLetter.
- Cleaned up & deduplicated RxCharSetParser to match the functional style of RxParser. RxCharSetParser is now responsible by itself for handling BackslashConstants.
- Made sure to parse escape characters in the end of a char set range, i.e., allow '[2-\x38]' asRegex and reject '[2-\d]' asRegex (like most other parsers out there do, too).
- Correct maintaining of source position in a RegexSyntaxError that was signaled while parsing a char set. See #testRegexSyntaxErrorPosition.
- Enabled and fixed matching against composed RxCharSets, which can happen now in the case of a pattern like '[\P{L}a]' asRegex.
    * Honor case-(in)sensitive matching in nested char sets by appending a #IgnoringCase: argument to #predicate[s|Negation|PartPredicate] on RxsCharSet resp. RxsPredicate.
- Added support for Squeak-style codepoints such as '\x{2r100000}' asRegex matches: ' '.
- Removed superfluos spaces from error messages.

Requires Kernel-ct.1419 (NumberParser >> #defaultBase:) and Multilingual-ct.259 (Unicode class >> #generalTagOf:).

=============== Diff against Regex-Core-mt.61 ===============

RxAbstractParser
+ Object subclass: #RxAbstractParser
+     instanceVariableNames: 'source lookahead'
+     classVariableNames: 'BackslashConstants BackslashPredicates'
+     poolDictionaries: ''
+     category: 'Regex-Core'
+ 
+ RxAbstractParser class 
+     instanceVariableNames: ''
+ 
+ "I provide general parsing facilities for all kinds of regex parsers.
+ 
+ Instance variables:
+     input        <Stream> A stream with the expression being parsed.
+     lookahead    <Character>    The current lookahead character."

RxAbstractParser class>>doShiftingSyntaxExceptionPositions:from: {exception signaling} · ct 10/28/2021 03:18
+ doShiftingSyntaxExceptionPositions: aBlock from: start
+     "When invoking a nested parser, make sure to update the positions of any syntax exception raised by this nested parser."
+     ^ aBlock
+         on: RegexSyntaxError
+         do: [:ex | ex resignalAs: (ex copy
+             position: ex position + start - 1;
+             yourself)]

RxAbstractParser class>>initialize {class initialization} · ct 10/27/2021 23:30
+ initialize
+     "self initialize"
+     self
+         initializeBackslashConstants;
+         initializeBackslashPredicates

RxAbstractParser class>>initializeBackslashConstants {class initialization} · ct 10/27/2021 07:50
+ initializeBackslashConstants
+     "self initializeBackslashConstants"
+ 
+     (BackslashConstants := Dictionary new)
+         at: $e put: Character escape;
+         at: $n put: Character lf;
+         at: $r put: Character cr;
+         at: $f put: Character newPage;
+         at: $t put: Character tab

RxAbstractParser class>>initializeBackslashPredicates {class initialization} · ct 10/27/2021 20:57
+ initializeBackslashPredicates
+     "The keys are characters that normally follow a $\, the values are either associations of classes and initialization selectors on their instance side, or evaluables that will be evaluated on the current parser instance."
+     "self initializeBackslashPredicates"
+ 
+     (BackslashPredicates := Dictionary new)
+         at: $d put: RxsPredicate -> #beDigit;
+         at: $p put: #unicodeCategory;
+         at: $s put: RxsPredicate -> #beSpace;
+         at: $u put: #unicodePoint;
+         at: $w put: RxsPredicate -> #beWordConstituent;
+         at: $x put: #codePoint.

RxAbstractParser class>>signalSyntaxException: {exception signaling} · avi 11/30/2003 13:25
+ signalSyntaxException: errorString
+     RegexSyntaxError new signal: errorString

RxAbstractParser class>>signalSyntaxException:at: {exception signaling} · CamilloBruni 10/7/2012 22:50
+ signalSyntaxException: errorString at: errorPosition
+     RegexSyntaxError signal: errorString at: errorPosition

RxAbstractParser>>backslashConstant {parsing} · ct 10/27/2021 07:48
+ backslashConstant
+ 
+     | character node |
+     character := BackslashConstants at: lookahead ifAbsent: [^ nil].
+     self next.
+     node := RxsCharacter with: character.
+     ^ node

RxAbstractParser>>backslashNode {parsing} · ct 10/28/2021 03:09
+ backslashNode
+ 
+     | char |
+     lookahead ifNil: [ self signalParseError: 'bad quotation' ].
+     
+     self basicBackslashNode ifNotNil: [:node | ^node].
+     
+     char := lookahead.
+     self next.
+     ^ RxsCharacter with: char

RxAbstractParser>>backslashPredicate {parsing} · ct 10/27/2021 07:49
+ backslashPredicate
+ 
+     ^ self backslashSpecial: BackslashPredicates

RxAbstractParser>>backslashSpecial: {private} · ct 10/28/2021 02:56
+ backslashSpecial: specials
+ 
+     | negate specialSelector node |
+     negate := false.
+     specialSelector := specials at: lookahead ifAbsent: [
+         (lookahead isLetter and: [lookahead isUppercase]) ifTrue: [
+             negate := true.
+             specialSelector := specials at: lookahead asLowercase ifAbsent: []].
+         specialSelector ifNil: [^ nil]].
+     self next.
+     
+     node := specialSelector isVariableBinding
+         ifTrue: [specialSelector key new perform: specialSelector value]
+         ifFalse: [specialSelector value: self].
+     negate ifTrue: [node := node negated].
+     ^ node

RxAbstractParser>>basicBackslashNode {parsing} · ct 10/28/2021 03:03
+ basicBackslashNode
+     
+     self backslashConstant ifNotNil: [:node | ^ node].
+     self backslashPredicate ifNotNil: [:node | ^ node].
+     ^ nil

RxAbstractParser>>codePoint {parsing} · ct 10/27/2021 20:48
+ codePoint
+ 
+     ^ self codePoint: 2

RxAbstractParser>>codePoint: {parsing} · ct 10/27/2021 22:47
+ codePoint: size
+     "Matches a character that has the given code codepoint with the specified size of hex digits, unless braced.
+     <codePoint> ::= \x ({<hex>} '|' <hex>[size])"
+ 
+     | braced codeString codePoint codeStream |
+     braced := self tryMatch: ${.
+     codeString := braced
+         ifFalse: [self
+             input: size
+             errorMessage: 'invalid codepoint']
+         ifTrue: [self
+             inputUpTo: $}
+             errorMessage: 'no terminating "}"'].
+     
+     codeStream := codeString readStream.
+     codePoint := ((ExtendedNumberParser on: codeStream)
+         defaultBase: 16;
+         nextInteger) ifNil: [
+             self signalParseError: 'invalid codepoint'].
+     codeStream atEnd ifFalse: [
+         self signalParseError: 'invalid codepoint'].
+     
+     braced ifTrue: [
+         self match: $}].
+     
+     ^ RxsCharacter with: (Character codePoint: codePoint)

RxAbstractParser>>initialize: {initialize-release} · ct 10/27/2021 07:24
+ initialize: aStream
+ 
+     source := aStream.
+     self next.

RxAbstractParser>>input:errorMessage: {private} · ct 10/27/2021 20:52
+ input: anInteger errorMessage: aString
+     "Accumulate input stream with anInteger characters. Raise an error with the specified message if there are not enough characters available, or if the accumulated characters are not included in the characterSet."
+ 
+     | accumulator |
+     accumulator := WriteStream on: (String new: 20).
+     anInteger timesRepeat: [
+         lookahead ifNil: [self signalParseError: aString].
+         accumulator nextPut: lookahead.
+         self next].
+     ^ accumulator contents

RxAbstractParser>>inputUpTo:errorMessage: {private} · ul 9/24/2015 08:25
+ inputUpTo: aCharacter errorMessage: aString
+     "Accumulate input stream until <aCharacter> is encountered
+     and answer the accumulated chars as String, not including
+     <aCharacter>. Signal error if end of stream is encountered,
+     passing <aString> as the error description."
+ 
+     | accumulator |
+     accumulator := WriteStream on: (String new: 20).
+     [ lookahead == aCharacter or: [lookahead == nil ] ]
+         whileFalse: [
+             accumulator nextPut: lookahead.
+             self next].
+     lookahead ifNil: [ self signalParseError: aString ].
+     ^accumulator contents

RxAbstractParser>>inputUpToAny:errorMessage: {private} · ul 9/24/2015 08:24
+ inputUpToAny: aDelimiterString errorMessage: aString
+     "Accumulate input stream until any character from <aDelimiterString> is encountered
+     and answer the accumulated chars as String, not including the matched characters from the
+     <aDelimiterString>. Signal error if end of stream is encountered,
+     passing <aString> as the error description."
+ 
+     | accumulator |
+     accumulator := WriteStream on: (String new: 20).
+     [ lookahead == nil or: [ aDelimiterString includes: lookahead ] ]
+         whileFalse: [
+             accumulator nextPut: lookahead.
+             self next ].
+     lookahead ifNil: [ self signalParseError: aString ].
+     ^accumulator contents

RxAbstractParser>>match: {parsing} · ct 10/27/2021 22:37
+ match: aCharacter
+     "<aCharacter> MUST match the current lookeahead. If this is the case, advance the input. Otherwise, blow up."
+ 
+     aCharacter = lookahead ifTrue: [ ^self next ].
+     self signalParseError: (lookahead
+         ifNil: ['unexpected end']
+         ifNotNil: ['unexpected character: ', lookahead asString])

RxAbstractParser>>next {private} · ct 10/27/2021 07:15
+ next
+ 
+     ^ lookahead := source next

RxAbstractParser>>signalParseError {private} · ct 10/27/2021 07:16
+ signalParseError
+ 
+     self class
+         signalSyntaxException: 'Regex syntax error'
+         at: source position

RxAbstractParser>>signalParseError: {private} · ct 10/27/2021 07:16
+ signalParseError: aString
+ 
+     self class signalSyntaxException: aString at: source position

RxAbstractParser>>tryMatch: {private} · ct 8/23/2021 21:01
+ tryMatch: aCharacter
+ 
+     ^ lookahead == ${
+         ifTrue: [self next];
+         yourself

RxAbstractParser>>unicodeCategory {parsing} · ct 10/27/2021 22:01
+ unicodeCategory
+     "Matches a character that belongs to the given unicode category.
+     <unicodeCategory> ::= \p '{' <categoryName> '}'"
+ 
+     | category |
+     self match: ${.
+     category := self inputUpTo: $} errorMessage: 'no terminating "}"'.
+     self match: $}.
+     ^ RxsPredicate new beUnicodeCategory: category

RxAbstractParser>>unicodePoint {parsing} · ct 10/27/2021 20:49
+ unicodePoint
+ 
+     ^ self codePoint: 4

RxCharSetParser (changed)
- Object subclass: #RxCharSetParser
-     instanceVariableNames: 'source lookahead elements'
+ RxAbstractParser subclass: #RxCharSetParser
+     instanceVariableNames: 'elements'
    classVariableNames: ''
    poolDictionaries: ''
    category: 'Regex-Core'

RxCharSetParser class 
    instanceVariableNames: ''

"-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
I am a parser created to parse the insides of a character set ([...]) construct. I create and answer a collection of "elements", each being an instance of one of: RxsCharacter, RxsRange, or RxsPredicate.

Instance Variables:

-     source    <Stream>    open on whatever is inside the square brackets we have to parse.
-     lookahead    <Character>    The current lookahead character
    elements    <Collection of: <RxsCharacter|RxsRange|RxsPredicate>> Parsing result"

RxCharSetParser>>add: {parsing} · ct 10/28/2021 02:24
+ add: nodeOrNodes
+ 
+     nodeOrNodes isCollection
+         ifFalse: [elements add: nodeOrNodes]
+         ifTrue: [elements addAll: nodeOrNodes]

RxCharSetParser>>addChar: {parsing} · vb 4/11/09 21:56 (removed)
- addChar: aChar
- 
-     elements add: (RxsCharacter with: aChar)

RxCharSetParser>>addRangeFrom:to: {parsing} · CamilloBruni 10/7/2012 22:52 (removed)
- addRangeFrom: firstChar to: lastChar
- 
-     firstChar asInteger > lastChar asInteger ifTrue:
-         [RxParser signalSyntaxException: ' bad character range' at: source position].
-     elements add: (RxsRange from: firstChar to: lastChar)

RxCharSetParser>>char {parsing} · ct 10/28/2021 03:06
+ char
+ 
+     | char |
+     lookahead == $\ ifTrue:
+         [self match: $\.
+         ^self backslashNode
+             ifNil: [RxsCharacter with: lookahead]].
+     
+     char := RxsCharacter with: lookahead.
+     self next.
+     ^char

RxCharSetParser>>char: {parsing} · ct 10/28/2021 02:20
+ char: aCharacter
+ 
+     ^ RxsCharacter with: aCharacter

RxCharSetParser>>charOrRange {parsing} · ct 10/28/2021 02:50
+ charOrRange
+ 
+     | firstChar lastChar |
+     firstChar := self char.
+     lookahead == $- ifFalse:
+         [^firstChar].
+     
+     self next ifNil:
+         [^{firstChar. self char: $-}].
+     
+     lastChar := self char.
+     firstChar isRegexCharacter ifFalse:
+         [self signalParseError: 'range must start with a single character'].
+     lastChar isRegexCharacter ifFalse: 
+         [self signalParseError: 'range must end with a single character'].
+     ^self rangeFrom: firstChar character to: lastChar character

RxCharSetParser>>element {parsing} · ct 10/28/2021 02:48
+ element
+ 
+     (lookahead == $[ and: [source peek == $:]) ifTrue:
+         [^self namedSet].
+     ^self charOrRange

RxCharSetParser>>initialize: {initialize-release} · ct 10/27/2021 07:24 (changed)
initialize: aStream

-     source := aStream.
-     lookahead := aStream next.
+     super initialize: aStream.
    elements := OrderedCollection new

RxCharSetParser>>match: {parsing} · ul 5/24/2015 22:01 (removed)
- match: aCharacter
- 
-     aCharacter = lookahead ifTrue: [ ^self next ].
-     RxParser 
-         signalSyntaxException: 'unexpected character: ', (String with: lookahead)
-         at: source position

RxCharSetParser>>namedSet {parsing} · ct 10/28/2021 02:19
+ namedSet
+ 
+     | name |
+     self match: $[; match: $:.
+     name := (String with: lookahead), (source upTo: $:).
+     self next.
+     self match: $].
+     ^ RxsPredicate forNamedClass: name

RxCharSetParser>>next {parsing} · ul 5/24/2015 21:19 (removed)
- next
- 
-     ^lookahead := source next

RxCharSetParser>>parse {accessing} · ct 10/28/2021 02:49 (changed)
parse

-     lookahead == $- ifTrue: [
-         self addChar: $-.
-         self next ].
-     [ lookahead == nil ] whileFalse: [ self parseStep ].
+     [ lookahead == nil ] whileFalse: [ self add: self element ].
    ^elements

RxCharSetParser>>parseCharOrRange {parsing} · ul 5/24/2015 21:20 (removed)
- parseCharOrRange
- 
-     | firstChar |
-     firstChar := lookahead.
-     self next == $- ifFalse: [ ^self addChar: firstChar ].
-     self next ifNil: [ ^self addChar: firstChar; addChar: $- ].
-     self addRangeFrom: firstChar to: lookahead.
-     self next

RxCharSetParser>>parseEscapeChar {parsing} · tobe 8/12/2021 08:56 (removed)
- parseEscapeChar
- 
-     | first |
-     self match: $\.
-     first := (RxsPredicate forEscapedLetter: lookahead)
-         ifNil: [ RxsCharacter with: lookahead ].
-     self next == $- ifFalse: [^ elements add: first].
-     self next ifNil: [
-         elements add: first.
-         ^ self addChar: $-].
-     self addRangeFrom: first character to: lookahead.
-     self next

RxCharSetParser>>parseNamedSet {parsing} · ul 5/24/2015 22:00 (removed)
- parseNamedSet
- 
-     | name |
-     self match: $[; match: $:.
-     name := (String with: lookahead), (source upTo: $:).
-     self next.
-     self match: $].
-     elements add: (RxsPredicate forNamedClass: name)

RxCharSetParser>>parseStep {parsing} · ul 5/24/2015 21:14 (removed)
- parseStep
- 
-     lookahead == $[ ifTrue:
-         [source peek == $:
-             ifTrue: [^self parseNamedSet]
-             ifFalse: [^self parseCharOrRange]].
-     lookahead == $\ ifTrue:
-         [^self parseEscapeChar].
-     lookahead == $- ifTrue:
-         [RxParser signalSyntaxException: 'invalid range' at: source position].
-     self parseCharOrRange

RxCharSetParser>>rangeFrom:to: {parsing} · ct 10/28/2021 02:20
+ rangeFrom: firstChar to: lastChar
+ 
+     firstChar asInteger > lastChar asInteger ifTrue:
+         [self signalParseError: 'bad character range'].
+     ^ RxsRange from: firstChar to: lastChar

RxMatchOptimizer>>syntaxCharSet: {double dispatch} · ct 10/27/2021 08:55 (changed)
syntaxCharSet: charSetNode 
    "All these (or none of these) characters is the prefix."

    (charSetNode enumerableSetIgnoringCase: ignoreCase) ifNotNil: [ :enumerableSet |
        charSetNode isNegated
            ifTrue: [ self addNonPrefixes: enumerableSet ]
            ifFalse: [ self addPrefixes: enumerableSet ] ].

-     charSetNode predicates ifNotNil: [ :charsetPredicates |
+     (charSetNode predicatesIgnoringCase: ignoreCase) ifNotNil: [ :charsetPredicates |
        charSetNode isNegated
            ifTrue: [ 
                charsetPredicates do: [ :each | self addNonPredicate: each ] ]
            ifFalse: [ 
                charsetPredicates do: [ :each | self addPredicate: each ] ] ]

RxMatchOptimizer>>syntaxPredicate: {double dispatch} · ct 10/27/2021 08:54 (changed)
syntaxPredicate: predicateNode 

-     self addPredicate: predicateNode predicate
+     self addPredicate: (predicateNode predicateIgnoringCase: ignoreCase)

RxMatcher>>syntaxPredicate: {double dispatch} · ct 10/27/2021 08:54 (changed)
syntaxPredicate: predicateNode
    "Double dispatch from the syntax tree. 
    A character set is a few characters, and we either match any of them,
    or match any that is not one of them."

-     ^RxmPredicate with: predicateNode predicate
+     ^RxmPredicate with: (predicateNode predicateIgnoringCase: ignoreCase)

RxParser (changed)
- Object subclass: #RxParser
-     instanceVariableNames: 'input lookahead'
-     classVariableNames: 'BackslashConstants BackslashSpecials'
+ RxAbstractParser subclass: #RxParser
+     instanceVariableNames: ''
+     classVariableNames: 'BackslashConditions'
    poolDictionaries: ''
    category: 'Regex-Core'

RxParser class 
    instanceVariableNames: ''

"-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
- The regular expression parser. Translates a regular expression read from a stream into a parse tree. ('accessing' protocol). The tree can later be passed to a matcher initialization method. All other classes in this category implement the tree. Refer to their comments for any details.
- 
- Instance variables:
-     input        <Stream> A stream with the regular expression being parsed.
-     lookahead    <Character>"
+ The regular expression parser. Translates a regular expression read from a stream into a parse tree. ('accessing' protocol). The tree can later be passed to a matcher initialization method. All other classes in this category implement the tree. Refer to their comments for any details."

RxParser class>>initialize {class initialization} · ct 10/27/2021 07:50 (changed)
initialize
    "self initialize"
-     self
-         initializeBackslashConstants;
-         initializeBackslashSpecials
+     self initializeBackslashConditions

RxParser class>>initializeBackslashConditions {class initialization} · ct 10/27/2021 08:17
+ initializeBackslashConditions
+     "The keys are characters that normally follow a $\, the values are either associations of classes and initialization selectors on their instance side, or evaluables that will be evaluated on the current parser instance."
+     "self initializeBackslashConditions"
+ 
+     (BackslashConditions := Dictionary new)
+         at: $b put: RxsContextCondition -> #beWordBoundary;
+         at: $B put: RxsContextCondition -> #beNonWordBoundary;
+         at: $< put: RxsContextCondition -> #beBeginningOfWord;
+         at: $> put: RxsContextCondition -> #beEndOfWord.

RxParser class>>initializeBackslashConstants {class initialization} · lr 11/4/2009 22:14 (removed)
- initializeBackslashConstants
-     "self initializeBackslashConstants"
- 
-     (BackslashConstants := Dictionary new)
-         at: $e put: Character escape;
-         at: $n put: Character lf;
-         at: $r put: Character cr;
-         at: $f put: Character newPage;
-         at: $t put: Character tab

RxParser class>>initializeBackslashSpecials {class initialization} · vb 4/11/09 21:56 (removed)
- initializeBackslashSpecials
-     "Keys are characters that normally follow a \, the values are
-     associations of classes and initialization selectors on the instance side
-     of the classes."
-     "self initializeBackslashSpecials"
- 
-     (BackslashSpecials := Dictionary new)
-         at: $w put: (Association key: RxsPredicate value: #beWordConstituent);
-         at: $W put: (Association key: RxsPredicate value: #beNotWordConstituent);
-         at: $s put: (Association key: RxsPredicate value: #beSpace);
-         at: $S put: (Association key: RxsPredicate value: #beNotSpace);
-         at: $d put: (Association key: RxsPredicate value: #beDigit);
-         at: $D put: (Association key: RxsPredicate value: #beNotDigit);
-         at: $b put: (Association key: RxsContextCondition value: #beWordBoundary);
-         at: $B put: (Association key: RxsContextCondition value: #beNonWordBoundary);
-         at: $< put: (Association key: RxsContextCondition value: #beBeginningOfWord);
-         at: $> put: (Association key: RxsContextCondition value: #beEndOfWord)

RxParser class>>signalSyntaxException: {exception signaling} · avi 11/30/2003 13:25 (removed)
- signalSyntaxException: errorString
-     RegexSyntaxError new signal: errorString

RxParser class>>signalSyntaxException:at: {exception signaling} · CamilloBruni 10/7/2012 22:50 (removed)
- signalSyntaxException: errorString at: errorPosition
-     RegexSyntaxError signal: errorString at: errorPosition

RxParser>>atom {recursive descent} · ct 10/28/2021 03:03 (changed)
atom
    "An atom is one of a lot of possibilities, see below."

    | atom |
    (lookahead == nil 
    or: [ lookahead == $| 
    or: [ lookahead == $)
    or: [ lookahead == $*
    or: [ lookahead == $+ 
    or: [ lookahead == $? ]]]]])
        ifTrue: [ ^RxsEpsilon new ].
        
    lookahead == $( 
        ifTrue: [
            "<atom> ::= '(' <regex> ')' "
            self match: $(.
            atom := self regex.
            self match: $).
            ^atom ].
    
    lookahead == $[
        ifTrue: [
            "<atom> ::= '[' <characterSet> ']' "
            self match: $[.
            atom := self characterSet.
            self match: $].
            ^atom ].
    
    lookahead == $: 
        ifTrue: [
            "<atom> ::= ':' <messagePredicate> ':' "
            self match: $:.
            atom := self messagePredicate.
            self match: $:.
            ^atom ].
    
    lookahead == $. 
        ifTrue: [
            "any non-whitespace character"
            self next.
            ^RxsContextCondition new beAny].
    
    lookahead == $^ 
        ifTrue: [
            "beginning of line condition"
            self next.
            ^RxsContextCondition new beBeginningOfLine].
    
    lookahead == $$ 
        ifTrue: [
            "end of line condition"
            self next.
            ^RxsContextCondition new beEndOfLine].
        
    lookahead == $\ 
        ifTrue: [
-             "<atom> ::= '\' <character>"
-             self next ifNil: [ self signalParseError: 'bad quotation' ].
-             (BackslashConstants includesKey: lookahead) ifTrue: [
-                 atom := RxsCharacter with: (BackslashConstants at: lookahead).
-                 self next.
-                 ^atom].
-             self ifSpecial: lookahead
-                 then: [:node | self next. ^node]].
-         
+             "<atom> ::= '\' <node>"
+             self match: $\.
+             ^self backslashNode].
+     
    "If passed through the above, the following is a regular character."
    atom := RxsCharacter with: lookahead.
    self next.
    ^atom

RxParser>>backslashCondition {recursive descent} · ct 10/27/2021 07:38
+ backslashCondition
+ 
+     ^ self backslashSpecial: BackslashConditions

RxParser>>basicBackslashNode {recursive descent} · ct 10/28/2021 03:03
+ basicBackslashNode
+ 
+     ^ super basicBackslashNode ifNil: [self backslashCondition]

RxParser>>characterSet {recursive descent} · ct 10/28/2021 03:19 (changed)
characterSet
    "Match a range of characters: something between `[' and `]'.
    Opening bracked has already been seen, and closing should
    not be consumed as well. Set spec is as usual for
    sets in regexes."

-     | spec errorMessage |
-     errorMessage := ' no terminating "]"'.
+     | start spec errorMessage |
+     errorMessage := 'no terminating "]"'.
+     start := source position.
    spec := self inputUpTo: $] nestedOn: $[ errorMessage: errorMessage.
    (spec isEmpty 
    or: [spec = '^']) 
        ifTrue: [
            "This ']' was literal." 
            self next.
            spec := spec, ']', (self inputUpTo: $] nestedOn: $[ errorMessage: errorMessage)].
-     ^self characterSetFrom: spec
+     ^self class
+         doShiftingSyntaxExceptionPositions: [self characterSetFrom: spec]
+         from: start

RxParser>>ifSpecial:then: {private} · vb 4/11/09 21:56 (removed)
- ifSpecial: aCharacter then: aBlock
-     "If the character is such that it defines a special node when follows a $\,
-     then create that node and evaluate aBlock with the node as the parameter.
-     Otherwise just return."
- 
-     | classAndSelector |
-     classAndSelector := BackslashSpecials at: aCharacter ifAbsent: [^self].
-     ^aBlock value: (classAndSelector key new perform: classAndSelector value)

RxParser>>inputUpTo:errorMessage: {private} · ul 9/24/2015 08:25 (removed)
- inputUpTo: aCharacter errorMessage: aString
-     "Accumulate input stream until <aCharacter> is encountered
-     and answer the accumulated chars as String, not including
-     <aCharacter>. Signal error if end of stream is encountered,
-     passing <aString> as the error description."
- 
-     | accumulator |
-     accumulator := WriteStream on: (String new: 20).
-     [ lookahead == aCharacter or: [lookahead == nil ] ]
-         whileFalse: [
-             accumulator nextPut: lookahead.
-             self next].
-     lookahead ifNil: [ self signalParseError: aString ].
-     ^accumulator contents

RxParser>>inputUpTo:nestedOn:errorMessage: {private} · ct 10/27/2021 08:06 (changed)
inputUpTo: aCharacter nestedOn: anotherCharacter errorMessage: aString 
-     "Accumulate input stream until <aCharacter> is encountered
-     and answer the accumulated chars as String, not including
-     <aCharacter>. Signal error if end of stream is encountered,
-     passing <aString> as the error description."
+     "Accumulate input stream until <aCharacter> is encountered without escaping and answer the accumulated chars as String, not including <aCharacter>. Signal error if end of stream is encountered, passing <aString> as the error description."

    | accumulator nestLevel |
    accumulator := WriteStream on: (String new: 20).
    nestLevel := 0.
    [ lookahead == aCharacter and: [ nestLevel = 0 ] ] whileFalse: [
        lookahead ifNil: [ self signalParseError: aString ].
        lookahead == $\
            ifTrue: [ 
                self next ifNil: [ self signalParseError: aString ].
-                 BackslashConstants
-                     at: lookahead
-                     ifPresent: [ :unescapedCharacter | accumulator nextPut: unescapedCharacter ]
-                     ifAbsent: [
-                         accumulator
-                             nextPut: $\;
-                             nextPut: lookahead ] ]
+                 accumulator
+                     nextPut: $\;
+                     nextPut: lookahead ]
            ifFalse: [
                accumulator nextPut: lookahead.
                lookahead == anotherCharacter ifTrue: [ nestLevel := nestLevel + 1 ].
                lookahead == aCharacter ifTrue: [ nestLevel := nestLevel - 1 ] ].
        self next ].
    ^accumulator contents

RxParser>>inputUpToAny:errorMessage: {private} · ul 9/24/2015 08:24 (removed)
- inputUpToAny: aDelimiterString errorMessage: aString
-     "Accumulate input stream until any character from <aDelimiterString> is encountered
-     and answer the accumulated chars as String, not including the matched characters from the
-     <aDelimiterString>. Signal error if end of stream is encountered,
-     passing <aString> as the error description."
- 
-     | accumulator |
-     accumulator := WriteStream on: (String new: 20).
-     [ lookahead == nil or: [ aDelimiterString includes: lookahead ] ]
-         whileFalse: [
-             accumulator nextPut: lookahead.
-             self next ].
-     lookahead ifNil: [ self signalParseError: aString ].
-     ^accumulator contents

RxParser>>match: {private} · ul 5/16/2015 01:51 (removed)
- match: aCharacter
-     "<aCharacter> MUST match the current lookeahead.
-     If this is the case, advance the input. Otherwise, blow up."
- 
-     aCharacter == lookahead ifFalse: [ ^self signalParseError ]. "does not return"
-     self next

RxParser>>messagePredicate {recursive descent} · ct 10/27/2021 22:01 (changed)
messagePredicate
    "Match a message predicate specification: a selector (presumably
    understood by a Character) enclosed in :'s ."

    | spec negated |
-     spec := self inputUpTo: $: errorMessage: ' no terminating ":"'.
+     spec := self inputUpTo: $: errorMessage: 'no terminating ":"'.
+     spec ifEmpty: [self signalParseError ].
    negated := false.
    spec first = $^ 
        ifTrue: [
            negated := true.
            spec := spec copyFrom: 2 to: spec size].
    ^RxsMessagePredicate new 
        initializeSelector: spec asSymbol
        negated: negated

RxParser>>next {private} · ul 9/25/2015 10:02 (removed)
- next
-     "Advance the input storing the just read character
-     as the lookahead."
- 
-     ^lookahead := input next

RxParser>>parseStream: {accessing} · ct 10/27/2021 07:24 (changed)
parseStream: aStream
    "Parse an input from a character stream <aStream>.
    On success, answers an RxsRegex -- parse tree root.
    On error, raises `RxParser syntaxErrorSignal' with the current
    input stream position as the parameter."

    | tree |
-     input := aStream.
-     self next.
+     self initialize: aStream.
    tree := self regex.
    self match: nil.
    ^tree

RxParser>>quantifiedAtom: {recursive descent} · ct 10/27/2021 22:01 (changed)
quantifiedAtom: atom
    "Parse a quanitifer expression which can have one of the following forms
        {<min>,<max>} match <min> to <max> occurences
        {<minmax>} which is the same as with repeated limits: {<number>,<number>}
        {<min>,} match at least <min> occurences
        {,<max>} match maximally <max> occurences, which is the same as {0,<max>}"
    | min max |
    self next.
    lookahead == $,
        ifTrue: [ min := 0 ]
        ifFalse: [
-             max := min := (self inputUpToAny: ',}' errorMessage: ' no terminating "}"') asUnsignedInteger ].
+             max := min := (self inputUpToAny: ',}' errorMessage: 'no terminating "}"') asUnsignedInteger ].
    lookahead == $,
        ifTrue: [
            self next.
-             max := (self inputUpToAny: ',}' errorMessage: ' no terminating "}"') asUnsignedInteger ].    
+             max := (self inputUpToAny: ',}' errorMessage: 'no terminating "}"') asUnsignedInteger ].    
    self match: $}.
    atom isNullable
        ifTrue: [ self signalNullableClosureParserError ].
    (max notNil and: [ max < min ])
        ifTrue: [ self signalParseError: ('wrong quantifier, expected ', min asString, ' <= ', max asString) ].
    ^ RxsPiece new 
        initializeAtom: atom
        min: min
        max: max

RxParser>>signalNullableClosureParserError {private} · ct 10/27/2021 22:00 (changed)
signalNullableClosureParserError
-     self signalParseError: ' nullable closure'.
+     self signalParseError: 'nullable closure'.

RxParser>>signalParseError {private} · CamilloBruni 10/7/2012 22:50 (removed)
- signalParseError
- 
-     self class 
-         signalSyntaxException: 'Regex syntax error' at: input position

RxParser>>signalParseError: {private} · CamilloBruni 10/7/2012 22:49 (removed)
- signalParseError: aString
- 
-     self class signalSyntaxException: aString at: input position

RxsCharSet>>basicMaximumCharacterCodeIgnoringCase: {accessing} · ct 10/27/2021 08:59
+ basicMaximumCharacterCodeIgnoringCase: aBoolean
+ 
+     ^ elements inject: -1 into: [ :max :each |
+         (each maximumCharacterCodeIgnoringCase: aBoolean) max: max ]

RxsCharSet>>enumerableSetIgnoringCase: {privileged} · ct 10/27/2021 08:59 (changed)
enumerableSetIgnoringCase: aBoolean
    "Answer a collection of characters that make up the portion of me that can be enumerated, or nil if there are no such characters. The case check is only used to determine the type of set to be used. The returned set won't contain characters of both cases, because this way the senders of this method can create more efficient checks."

    | highestCharacterCode set |
-     highestCharacterCode := elements inject: -1 into: [ :max :each |
-         (each maximumCharacterCodeIgnoringCase: aBoolean) max: max ].
+     highestCharacterCode := self basicMaximumCharacterCodeIgnoringCase: aBoolean.
    highestCharacterCode = -1 ifTrue: [ ^nil ].
    set := highestCharacterCode <= 255
        ifTrue: [ CharacterSet new ]
        ifFalse: [ WideCharacterSet new ].
    elements do: [ :each | each enumerateTo: set ].
    ^set

RxsCharSet>>enumerateTo: {accessing} · ct 10/27/2021 08:36
+ enumerateTo: aSet
+ 
+     negated ifTrue: [^ self "Not enumerable"].
+     ^ elements do: [:each | each enumerateTo: aSet]

RxsCharSet>>isEnumerable {testing} · ct 10/27/2021 08:50 (changed)
isEnumerable

+     negated ifTrue: [^ false].
    ^elements anySatisfy: [:some | some isEnumerable ]

RxsCharSet>>maximumCharacterCodeIgnoringCase: {accessing} · ct 10/27/2021 08:59
+ maximumCharacterCodeIgnoringCase: aBoolean
+     "Return the largest character code among the characters I represent."
+ 
+     negated ifTrue: [^ -1 "not enumerable"].
+     ^ self basicMaximumCharacterCodeIgnoringCase: aBoolean

RxsCharSet>>negated {converting} · ct 10/27/2021 08:35
+ negated
+ 
+     ^ self class new
+         initializeElements: elements
+         negated: negated not

RxsCharSet>>predicateIgnoringCase: {accessing} · ct 10/27/2021 08:52 (changed)
predicateIgnoringCase: aBoolean

    | enumerable predicate |
    enumerable := self enumerablePartPredicateIgnoringCase: aBoolean.
-     predicate := self predicatePartPredicate ifNil: [ 
+     predicate := (self predicatePartPredicateIgnoringCase: aBoolean) ifNil: [ 
        "There are no predicates in this set."
        ^enumerable ifNil: [ 
            "This set is empty."
            [ :char | negated ] ] ].
    enumerable ifNil: [ ^predicate ].
    negated ifTrue: [
        "enumerable and predicate already negate the result, that's why #not is not needed here."
        ^[ :char | (enumerable value: char) and: [ predicate value: char ] ] ].
    ^[ :char | (enumerable value: char) or: [ predicate value: char ] ]

RxsCharSet>>predicatePartPredicate {privileged} · ul 5/16/2015 01:37 (removed)
- predicatePartPredicate
-     "Answer a predicate that tests all of my elements that cannot be enumerated, or nil if such elements don't exist."
- 
-     | predicates size |
-     predicates := elements reject: [ :some | some isEnumerable ].
-     (size := predicates size) = 0 ifTrue: [ 
-         "We could return a real predicate block - like [ :char | negated ] - here, but it wouldn't be used anyway. This way we signal that this character set has no predicates."
-         ^nil ].
-     size = 1 ifTrue: [
-         negated ifTrue: [ ^predicates first predicateNegation ].
-         ^predicates first predicate ].
-     predicates replace: [ :each | each predicate ].
-     negated ifTrue: [ ^[ [: char | predicates noneSatisfy: [ :some | some value: char ] ] ] ].
-     ^[ :char | predicates anySatisfy: [ :some | some value: char ] ]
-     

RxsCharSet>>predicatePartPredicateIgnoringCase: {privileged} · ct 10/27/2021 08:52
+ predicatePartPredicateIgnoringCase: aBoolean
+     "Answer a predicate that tests all of my elements that cannot be enumerated, or nil if such elements don't exist."
+ 
+     | predicates size |
+     predicates := elements reject: [ :some | some isEnumerable ].
+     (size := predicates size) = 0 ifTrue: [ 
+         "We could return a real predicate block - like [ :char | negated ] - here, but it wouldn't be used anyway. This way we signal that this character set has no predicates."
+         ^nil ].
+     size = 1 ifTrue: [
+         negated ifTrue: [ ^predicates first predicateNegationIgnoringCase: aBoolean ].
+         ^predicates first predicateIgnoringCase: aBoolean ].
+     predicates replace: [ :each | each predicateIgnoringCase: aBoolean ].
+     negated ifTrue: [ ^[ [: char | predicates noneSatisfy: [ :some | some value: char ] ] ] ].
+     ^[ :char | predicates anySatisfy: [ :some | some value: char ] ]

RxsCharSet>>predicates {accessing} · ul 5/16/2015 01:29 (removed)
- predicates
- 
-     | predicates |
-     predicates := elements reject: [ :some | some isEnumerable ].
-     predicates isEmpty ifTrue: [ ^nil ].
-     ^predicates replace: [ :each | each predicate ]

RxsCharSet>>predicatesIgnoringCase: {accessing} · ct 10/27/2021 08:55
+ predicatesIgnoringCase: aBoolean
+ 
+     | predicates |
+     predicates := elements reject: [ :some | some isEnumerable ].
+     predicates isEmpty ifTrue: [ ^nil ].
+     ^predicates replace: [ :each | each predicateIgnoringCase: aBoolean ]

RxsCharacter>>isRegexCharacter {testing} · ct 10/27/2021 20:38
+ isRegexCharacter
+ 
+     ^ true

RxsCharacter>>negated {converting} · ct 10/27/2021 08:32
+ negated
+ 
+     ^ RxsCharSet new
+         initializeElements: {self}
+         negated: true

RxsNode>>isRegexCharacter {testing} · ct 10/27/2021 20:38
+ isRegexCharacter
+ 
+     ^ false

RxsPredicate (changed)
RxsNode subclass: #RxsPredicate
    instanceVariableNames: 'predicate negation'
-     classVariableNames: 'EscapedLetterSelectors NamedClassSelectors'
+     classVariableNames: 'NamedClassSelectors'
    poolDictionaries: ''
    category: 'Regex-Core'

RxsPredicate class 
    instanceVariableNames: ''

"-- Regular Expression Matcher v 1.1 (C) 1996, 1999 Vassili Bykov
--
This represents a character that satisfies a certain predicate.

Instance Variables:

    predicate    <BlockClosure>    A one-argument block. If it evaluates to the value defined by <negated> when it is passed a character, the predicate is considered to match.
    negation    <BlockClosure>    A one-argument block that is a negation of <predicate>."

RxsPredicate class>>forEscapedLetter: {instance creation} · ct 10/27/2021 08:16 (changed)
forEscapedLetter: aCharacter
    "Return a predicate instance for the given character, or nil if there's no such predicate."

-     ^EscapedLetterSelectors
-         at: aCharacter
-         ifPresent: [ :selector | self new perform: selector ]
+     self deprecated.
+     ^ RxParser new
+         initialize: {aCharacter} readStream;
+         backslashPredicate

RxsPredicate class>>initialize {class initialization} · ct 10/27/2021 08:16 (changed)
initialize
    "self initialize"

-     self
-         initializeNamedClassSelectors;
-         initializeEscapedLetterSelectors
+     self initializeNamedClassSelectors

RxsPredicate class>>initializeEscapedLetterSelectors {class initialization} · ul 9/25/2015 09:25 (removed)
- initializeEscapedLetterSelectors
-     "self initializeEscapedLetterSelectors"
- 
-     EscapedLetterSelectors := Dictionary new
-         at: $w put: #beWordConstituent;
-         at: $W put: #beNotWordConstituent;
-         at: $d put: #beDigit;
-         at: $D put: #beNotDigit;
-         at: $s put: #beSpace;
-         at: $S put: #beNotSpace;
-         yourself

RxsPredicate>>beUnicodeCategory: {initialize-release} · ct 8/23/2021 20:50
+ beUnicodeCategory: categoryName
+ 
+     self predicate: [:char |
+         (Unicode generalTagOf: char asUnicode) beginsWith: categoryName].

RxsPredicate>>predicate {accessing} · vb 4/11/09 21:56 (removed)
- predicate
- 
-     ^predicate

RxsPredicate>>predicate: {initialize-release} · ct 8/23/2021 20:50
+ predicate: aBlock
+ 
+     predicate := aBlock.
+     negation := [:char | (predicate value: char) not].

RxsPredicate>>predicateIgnoringCase: {accessing} · ct 10/27/2021 08:53
+ predicateIgnoringCase: aBoolean
+ 
+     ^predicate

RxsPredicate>>predicateNegation {accessing} · vb 4/11/09 21:56 (removed)
- predicateNegation
- 
-     ^negation

RxsPredicate>>predicateNegationIgnoringCase: {accessing} · ct 10/27/2021 08:53
+ predicateNegationIgnoringCase: aBoolean
+ 
+     ^negation

Regex-Core package postscript (changed)
- RxsPredicate initializeEscapedLetterSelectors.
+ RxParser initialize.


---
Sent from Squeak Inbox Talk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20211028/776ce3e3/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Regex-Core-ct.71.mcz
Type: application/octet-stream
Size: 80990 bytes
Desc: not available
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20211028/776ce3e3/attachment-0001.obj>


More information about the Squeak-dev mailing list