[squeak-dev] The Trunk: Collections-nice.771.mcz

Nicolas Cellier nicolas.cellier.aka.nice at gmail.com
Thu Nov 30 21:01:59 UTC 2017


2017-11-30 21:57 GMT+01:00 <commits at source.squeak.org>:

> Nicolas Cellier uploaded a new version of Collections to project The Trunk:
> http://source.squeak.org/trunk/Collections-nice.771.mcz
>
> ==================== Summary ====================
>
> Name: Collections-nice.771
> Author: nice
> Time: 30 November 2017, 9:57:27.873964 pm
> UUID: d8b64711-6119-429b-b3f5-259b46ef864b
> Ancestors: Collections-nice.770
>
> Fix awfully broken CharacterSetComplement select:/reject:
>
> If we want to select:/reject:, we must not only enumerate the absent
> characters, but rather all the characters in the complement.
>
> That's way two many, thus we prefer to do it with a LazyCharacterSet,
> anything else is unfeasible (we don't even know the upper limit of the set
> of characters...).
>
> Introduce an AbstractCharacterSet superclass of all the CharacterSet
> family in order to begin factoring some behavior.
>
> TODO: we should better rename
> CharacterSet -> ByteCharacterSet
> AbstractCharacterSet -> CharacterSet.
>
> We delay this quite technical operation, because we don't want to break
> existing instances, AND we want to redirect (Byte)CharacterSet references
> to (Abstract)CharacterSet, the Abstract one becoming a factory.
>
> =============== Diff against Collections-nice.770 ===============
>
> Item was added:
> + Collection subclass: #AbstractCharacterSet
> +       instanceVariableNames: ''
> +       classVariableNames: ''
> +       poolDictionaries: ''
> +       category: 'Collections-Support'!
>
> Item was added:
> + ----- Method: AbstractCharacterSet>>byteArrayMap (in category
> 'accessing') -----
> + byteArrayMap
> +       ^self subclassReponsibility!
>
> Item was added:
> + ----- Method: AbstractCharacterSet>>findFirstInByteString:startingAt:
> (in category 'enumerating') -----
> + findFirstInByteString: aByteString startingAt: startIndex
> +       "Double dispatching: since we know this is a ByteString, we can
> use a superfast primitive using a ByteArray map with 0 slots for byte
> characters not included and 1 for byte characters included in the receiver."
> +       ^ByteString
> +               findFirstInString: aByteString
> +               inSet: self byteArrayMap
> +               startingAt: startIndex!
>
> Item was added:
> + ----- Method: AbstractCharacterSet>>occurrencesOf: (in category
> 'enumerating') -----
> + occurrencesOf: anObject
> +       "Answer how many of the receiver's elements are equal to anObject.
> Optimized version."
> +
> +       (self includes: anObject) ifTrue: [ ^1 ].
> +       ^0!
>
> Item was added:
> + ----- Method: AbstractCharacterSet>>removeAll (in category 'removing')
> -----
> + removeAll
> +       self becomeForward: CharacterSet new!
>
> Item was changed:
> + AbstractCharacterSet subclass: #CharacterSet
> - Collection subclass: #CharacterSet
>         instanceVariableNames: 'map tally'
>         classVariableNames: 'CrLf NonSeparators Separators'
>         poolDictionaries: ''
>         category: 'Collections-Support'!
>
>   !CharacterSet commentStamp: '<historical>' prior: 0!
>   A set of characters.  Lookups for inclusion are very fast.!
>
> Item was changed:
> + ----- Method: CharacterSet>>= (in category 'comparing') -----
> - ----- Method: CharacterSet>>= (in category 'comparison') -----
>   = anObject
>
>         self species == anObject species ifFalse: [ ^false ].
>         anObject size = tally ifFalse: [ ^false ].
>         ^self byteArrayMap = anObject byteArrayMap!
>
> Item was changed:
> + ----- Method: CharacterSet>>add: (in category 'adding') -----
> - ----- Method: CharacterSet>>add: (in category 'collection ops') -----
>   add: aCharacter
>         "I automatically become a WideCharacterSet if you add a wide
> character to myself"
>
>         | index |
>         (index := aCharacter asInteger + 1) <= 256 ifFalse: [
>                 | wide |
>                 wide := WideCharacterSet new.
>                 wide addAll: self.
>                 wide add: aCharacter.
>                 self becomeForward: wide.
>                 ^aCharacter ].
>         (map at: index) = 1 ifFalse: [
>                 map at: index put: 1.
>                 tally := tally + 1 ].
>         ^aCharacter!
>
> Item was changed:
> + ----- Method: CharacterSet>>do: (in category 'enumerating') -----
> - ----- Method: CharacterSet>>do: (in category 'collection ops') -----
>   do: aBlock
>         "evaluate aBlock with each character in the set"
>
>         | index |
>         tally >= 128 ifTrue: [ "dense"
>                 index := 0.
>                 [ (index := index + 1) <= 256 ] whileTrue: [
>                         (map at: index) = 1 ifTrue: [
>                                 aBlock value: (Character value: index - 1)
> ] ].
>                 ^self ].
>         "sparse"
>         index := 0.
>         [ (index := map indexOf: 1 startingAt: index + 1) = 0 ]
> whileFalse: [
>                 aBlock value: (Character value: index - 1) ].
>         !
>
> Item was changed:
> + ----- Method: CharacterSet>>findFirstInByteString:startingAt: (in
> category 'enumerating') -----
> - ----- Method: CharacterSet>>findFirstInByteString:startingAt: (in
> category 'collection ops') -----
>   findFirstInByteString: aByteString startingAt: startIndex
>         "Double dispatching: since we know this is a ByteString, we can
> use a superfast primitive using a ByteArray map with 0 slots for byte
> characters not included and 1 for byte characters included in the receiver."
>         ^ByteString
>                 findFirstInString: aByteString
>                 inSet: self byteArrayMap
>                 startingAt: startIndex!
>
> Item was changed:
> + ----- Method: CharacterSet>>hash (in category 'comparing') -----
> - ----- Method: CharacterSet>>hash (in category 'comparison') -----
>   hash
>         ^self byteArrayMap hash!
>
> Item was changed:
> + ----- Method: CharacterSet>>includes: (in category 'testing') -----
> - ----- Method: CharacterSet>>includes: (in category 'collection ops')
> -----
>   includes: anObject
>
>         | index |
>         anObject isCharacter ifFalse: [ ^false ].
>         (index := anObject asInteger + 1) > 256 ifTrue: [ ^false ].
>         ^(map at: index) > 0!
>
> Item was removed:
> - ----- Method: CharacterSet>>occurrencesOf: (in category 'enumerating')
> -----
> - occurrencesOf: anObject
> -       "Answer how many of the receiver's elements are equal to anObject.
> Optimized version."
> -
> -       (self includes: anObject) ifTrue: [ ^1 ].
> -       ^0!
>
> Item was changed:
> + ----- Method: CharacterSet>>remove: (in category 'removing') -----
> - ----- Method: CharacterSet>>remove: (in category 'collection ops') -----
>   remove: aCharacter
>
>         ^self remove: aCharacter ifAbsent: aCharacter!
>
> Item was changed:
> + ----- Method: CharacterSet>>remove:ifAbsent: (in category 'removing')
> -----
> - ----- Method: CharacterSet>>remove:ifAbsent: (in category 'collection
> ops') -----
>   remove: aCharacter ifAbsent: aBlock
>
>         | index |
>         (index := aCharacter asciiValue + 1) <= 256 ifFalse: [ ^aBlock
> value ].
>         (map at: index) = 0 ifTrue: [ ^aBlock value ].
>         map at: index put: 0.
>         tally := tally - 1.
>         ^aCharacter!
>
> Item was changed:
> + ----- Method: CharacterSet>>size (in category 'accessing') -----
> - ----- Method: CharacterSet>>size (in category 'collection ops') -----
>   size
>
>         ^tally!
>
> Item was changed:
> + ----- Method: CharacterSet>>species (in category 'comparing') -----
> - ----- Method: CharacterSet>>species (in category 'comparison') -----
>   species
>         ^CharacterSet!
>
> Item was changed:
> + AbstractCharacterSet subclass: #CharacterSetComplement
> - Collection subclass: #CharacterSetComplement
>         instanceVariableNames: 'absent byteArrayMapCache'
>         classVariableNames: ''
>         poolDictionaries: ''
>         category: 'Collections-Support'!
>
>   !CharacterSetComplement commentStamp: 'nice 8/31/2008 14:53' prior: 0!
>   CharacterSetComplement is a space efficient implementation of
> (CharacterSet complement) taking care of WideCharacter (code > 255)
>
>   However, it will maintain a byteArrayMap for character <= 255 in a cache
> keeping
>
>   instance variables:
>         absent <CharacterSet> contains character that are not in the set
> (i.e. my complement)
>         byteArrayMapCache <ByteArray | nil> cache this information because
> it has to be used in tight loops where efficiency matters!
>
> Item was changed:
> + ----- Method: CharacterSetComplement>>add: (in category 'adding') -----
> - ----- Method: CharacterSetComplement>>add: (in category 'collection
> ops') -----
>   add: aCharacter
>         "a character is present if not absent, so adding a character is
> removing it from the absent"
>
>         (absent includes: aCharacter)
>                 ifTrue:
>                         [byteArrayMapCache := nil.
>                         absent remove: aCharacter].
>         ^ aCharacter!
>
> Item was changed:
> + ----- Method: CharacterSetComplement>>do: (in category 'enumerating')
> -----
> - ----- Method: CharacterSetComplement>>do: (in category 'collection ops')
> -----
>   do: aBlock
>         "evaluate aBlock with each character in the set.
>         don't do it, there are too many..."
>
>         self shouldNotImplement!
>
> Item was changed:
> + ----- Method: CharacterSetComplement>>findFirstInByteString:startingAt:
> (in category 'enumerating') -----
> - ----- Method: CharacterSetComplement>>findFirstInByteString:startingAt:
> (in category 'collection ops') -----
>   findFirstInByteString: aByteString startingAt: startIndex
>         "Double dispatching: since we know this is a ByteString, we can
> use a superfast primitive using a ByteArray map with 0 slots for byte
> characters not included and 1 for byte characters included in the receiver."
>         ^ByteString
>                 findFirstInString: aByteString
>                 inSet: self byteArrayMap
>                 startingAt: startIndex!
>
> Item was changed:
> + ----- Method: CharacterSetComplement>>includes: (in category 'testing')
> -----
> - ----- Method: CharacterSetComplement>>includes: (in category
> 'collection ops') -----
>   includes: anObject
>
>         anObject isCharacter ifFalse: [ ^false ].
>         (absent includes: anObject) ifTrue: [ ^false ].
>         ^true!
>
> Item was removed:
> - ----- Method: CharacterSetComplement>>occurrencesOf: (in category
> 'enumerating') -----
> - occurrencesOf: anObject
> -       "Answer how many of the receiver's elements are equal to anObject.
> Optimized version."
> -
> -       (self includes: anObject) ifTrue: [ ^1 ].
> -       ^0!
>
> Item was changed:
> + ----- Method: CharacterSetComplement>>reject: (in category
> 'enumerating') -----
> - ----- Method: CharacterSetComplement>>reject: (in category 'collection
> ops') -----
>   reject: aBlock
> +       ^LazyCharacterSet including: [:c | (absent includes: c) not and:
> [(aBlock value: c) not]]!
> -       "Implementation note: rejecting present is selecting absent"
> -
> -       ^(absent select: aBlock) complement!
>
> Item was changed:
> + ----- Method: CharacterSetComplement>>remove: (in category 'removing')
> -----
> - ----- Method: CharacterSetComplement>>remove: (in category 'collection
> ops') -----
>   remove: aCharacter
>         "This means aCharacter is now absent from myself.
>         It must be added to my absent."
>
>         byteArrayMapCache := nil.
>         ^absent add: aCharacter!
>
> Item was changed:
> + ----- Method: CharacterSetComplement>>remove:ifAbsent: (in category
> 'removing') -----
> - ----- Method: CharacterSetComplement>>remove:ifAbsent: (in category
> 'collection ops') -----
>   remove: aCharacter ifAbsent: aBlock
>         (self includes: aCharacter) ifFalse: [^aBlock value].
>         ^self remove: aCharacter!
>
> Item was removed:
> - ----- Method: CharacterSetComplement>>removeAll (in category
> 'collection ops') -----
> - removeAll
> -
> -       self becomeForward: CharacterSet new!
>
> Item was changed:
> + ----- Method: CharacterSetComplement>>select: (in category
> 'enumerating') -----
> - ----- Method: CharacterSetComplement>>select: (in category 'collection
> ops') -----
>   select: aBlock
> +       ^LazyCharacterSet including: [:c | (absent includes: c) not and:
> [aBlock value: c]]!
> -       "Implementation note: selecting present is rejecting absent"
> -
> -       ^(absent reject: aBlock) complement!
>
> Item was removed:
> - ----- Method: CharacterSetComplement>>size (in category 'collection
> ops') -----
> - size
> -       "Is this 2**32-absent size ?"
> -
> -       ^self shouldNotImplement!
>
> Item was added:
> + ----- Method: Interval>>copyFrom:to: (in category 'copying') -----
> + copyFrom: startIndex to: stopIndex
> +       stopIndex < startIndex ifTrue: [^self copyEmpty].
> +       ^(self at: startIndex) to: (self at: stopIndex) by: step!
>
>
Arghh, sorry, the Interval>>copyFrom:to: was a totally unrelated experiment

I think it is a good change, but it might break some tests and should have
better gone in its own change set/commit.

By now I let it there, but if any one bark, I will revert.

Item was added:
> + AbstractCharacterSet subclass: #LazyCharacterSet
> +       instanceVariableNames: 'block byteArrayMapCache'
> +       classVariableNames: ''
> +       poolDictionaries: ''
> +       category: 'Collections-Support'!
> +
> + !LazyCharacterSet commentStamp: 'nice 11/30/2017 21:40' prior: 0!
> + A LazyCharacterSet is a kind of CharacterSet which does not know in
> advance which Character it contains or not.
> + If will lazily evaluate a block on demand if ever one ask whether it
> includes: a character.
> + It is not feasible to enumerate a LazyCharacterSet, because there are
> way too many characters.
> +
> + Instance Variables
> +       block:          <BlockContext | Symbol>
> +       byteArrayMapCache:              <ByteArray | nil>
> +
> + block
> +       - a valuable, answering either true or false when sent the message
> value: - true means that this set includes the character passed as value:
> argument.
> +
> + byteArrayMapCache
> +       - a cache holding 0 or 1 for the first 256 character codes - 0
> meaning not included, 1 included. This is used in some priitives
> + !
>
> Item was added:
> + ----- Method: LazyCharacterSet class>>including: (in category 'instance
> creation') -----
> + including: aBlock
> +       "Create the set of Character for which aBlock evaluates to true"
> +       ^self class new block: aBlock!
>
> Item was added:
> + ----- Method: LazyCharacterSet>>add: (in category 'adding') -----
> + add: aCharacter
> +       self block: [:c | c = aCharacter or: [block value: c]].
> +       ^aCharacter!
>
> Item was added:
> + ----- Method: LazyCharacterSet>>addAll: (in category 'adding') -----
> + addAll: aCollection
> +       self block: [:c | (aCollection includes: c) or: [block value: c]].
> +       ^aCollection!
>
> Item was added:
> + ----- Method: LazyCharacterSet>>block (in category 'accessing') -----
> + block
> +       ^block!
>
> Item was added:
> + ----- Method: LazyCharacterSet>>block: (in category 'accessing') -----
> + block: aValuable
> +       "Set the block used to determine if I include a Character or not.
> +       aValuable is an object that shoud answer true or false when sent
> value:"
> +
> +       byteArrayMapCache := nil.
> +       ^block := aValuable!
>
> Item was added:
> + ----- Method: LazyCharacterSet>>byteArrayMap (in category 'accessing')
> -----
> + byteArrayMap
> +       "return a ByteArray mapping each ascii value to a 1 if that ascii
> value is in the set, and a 0 if it isn't.  Intended for use by primitives
> only"
> +
> +       ^byteArrayMapCache ifNil: [byteArrayMapCache := (0 to: 255)
> collect: [:i | self includes: (Character value: i)]]!
>
> Item was added:
> + ----- Method: LazyCharacterSet>>complement (in category 'converting')
> -----
> + complement
> +       ^self class including: [:char | (block value: char) not]!
>
> Item was added:
> + ----- Method: LazyCharacterSet>>do: (in category 'enumerating') -----
> + do: aBlock
> +       "evaluate aBlock with each character in the set.
> +       don't do it, there are too many loop..."
> +
> +       self shouldNotImplement!
>
> Item was added:
> + ----- Method: LazyCharacterSet>>includes: (in category 'testing') -----
> + includes: aCharacter
> +       ^block value: aCharacter!
>
> Item was added:
> + ----- Method: LazyCharacterSet>>reject: (in category 'enumerating') -----
> + reject: aBlock
> +       ^self class including: [:char | (aBlock value: char) not and:
> [block value: char]]!
>
> Item was added:
> + ----- Method: LazyCharacterSet>>remove: (in category 'removing') -----
> + remove: aCharacter
> +       self block: [:c | (c = aCharacter) not and: [block value: c]].
> +       ^aCharacter!
>
> Item was added:
> + ----- Method: LazyCharacterSet>>remove:ifAbsent: (in category
> 'removing') -----
> + remove: aCharacter ifAbsent: aBlock
> +       (self includes: aCharacter) ifFalse: [^aBlock value].
> +       ^self remove: aCharacter!
>
> Item was added:
> + ----- Method: LazyCharacterSet>>removeAll: (in category 'removing') -----
> + removeAll: aCollection
> +       self block: [:c | (aCollection include: c) not and: [block value:
> c]].
> +       ^aCollection!
>
> Item was added:
> + ----- Method: LazyCharacterSet>>select: (in category 'enumerating') -----
> + select: aBlock
> +       ^self class including: [:char | (block value: char) and: [aBlock
> value: char]]!
>
> Item was changed:
> + AbstractCharacterSet subclass: #WideCharacterSet
> - Collection subclass: #WideCharacterSet
>         instanceVariableNames: 'map byteArrayMap bitsetCapacity
> highBitsShift lowBitsMask'
>         classVariableNames: ''
>         poolDictionaries: ''
>         category: 'Collections-Support'!
>
>   !WideCharacterSet commentStamp: 'nice 12/10/2009 19:17' prior: 0!
>   WideCharacterSet is used to store a Set of WideCharacter with fast
> access and inclusion test.
>
>   Implementation should be efficient in memory if sets are sufficently
> sparse.
>
>   Wide Characters are at most 32bits.
>   We split them into 16 highBits and 16 lowBits.
>
>   map is a dictionary key: 16 highBits value: map of 16 lowBits.
>
>   Maps of lowBits  are stored as arrays of bits in a ByteArray.
>   If a bit is set to 1, this indicate that corresponding character is
> present.
>   8192 bytes are necessary in each lowmap.
>   Empty lowmap are removed from the map Dictionary.
>
>   A byteArrayMap is maintained in parallel with map for fast handling of
> ByteString.
>   (byteArrayMap at: i+1) = 0 means that character of asciiValue i is
> absent, = 1 means present.!
>
> Item was changed:
> + ----- Method: WideCharacterSet>>add: (in category 'adding') -----
> - ----- Method: WideCharacterSet>>add: (in category 'collection ops') -----
>   add: aCharacter
>
>         | value highBits lowBits |
>         (value := aCharacter asInteger) < 256 ifTrue: [
>                 byteArrayMap at: value + 1 put: 1 ].
>         highBits := value bitShift: highBitsShift.
>         lowBits := value bitAnd: lowBitsMask.
>         (map at: highBits ifAbsentPut: [ Bitset new: bitsetCapacity ])
>                 setBitAt: lowBits.
>         ^aCharacter!
>
> Item was changed:
> + ----- Method: WideCharacterSet>>do: (in category 'enumerating') -----
> - ----- Method: WideCharacterSet>>do: (in category 'collection ops') -----
>   do: aBlock
>
>         map keysAndValuesDo: [ :index :bitset |
>                 | highBits |
>                 highBits := index * bitsetCapacity.
>                 bitset do: [ :lowBits |
>                         aBlock value: (Character value: highBits +
> lowBits) ] ]!
>
> Item was changed:
> + ----- Method: WideCharacterSet>>findFirstInByteString:startingAt: (in
> category 'enumerating') -----
> - ----- Method: WideCharacterSet>>findFirstInByteString:startingAt: (in
> category 'collection ops') -----
>   findFirstInByteString: aByteString startingAt: startIndex
>         "Double dispatching: since we know this is a ByteString, we can
> use a superfast primitive using a ByteArray map with 0 slots for byte
> characters not included and 1 for byte characters included in the receiver."
>
>         ^ByteString
>                 findFirstInString: aByteString
>                 inSet: byteArrayMap
>                 startingAt: startIndex!
>
> Item was changed:
> + ----- Method: WideCharacterSet>>includes: (in category 'testing') -----
> - ----- Method: WideCharacterSet>>includes: (in category 'collection ops')
> -----
>   includes: anObject
>
>         | value |
>         anObject isCharacter ifFalse: [ ^false ].
>         (value := anObject asInteger) < 256 ifTrue: [
>                 ^(byteArrayMap at: value + 1) ~= 0 ].
>         ^((map at: (value bitShift: highBitsShift) ifAbsent: nil) ifNil: [
> ^false ])
>                 includes: (value bitAnd: lowBitsMask)!
>
> Item was removed:
> - ----- Method: WideCharacterSet>>occurrencesOf: (in category
> 'enumerating') -----
> - occurrencesOf: anObject
> -       "Answer how many of the receiver's elements are equal to anObject.
> Optimized version."
> -
> -       (self includes: anObject) ifTrue: [ ^1 ].
> -       ^0!
>
> Item was changed:
> + ----- Method: WideCharacterSet>>remove: (in category 'removing') -----
> - ----- Method: WideCharacterSet>>remove: (in category 'collection ops')
> -----
>   remove: aCharacter
>         "Don't signal an error when aCharacter is not present."
>
>         ^self remove: aCharacter ifAbsent: aCharacter!
>
> Item was changed:
> + ----- Method: WideCharacterSet>>remove:ifAbsent: (in category
> 'removing') -----
> - ----- Method: WideCharacterSet>>remove:ifAbsent: (in category
> 'collection ops') -----
>   remove: aCharacter ifAbsent: aBlock
>
>         | value highBits lowBits bitset |
>         (value := aCharacter asInteger) < 256 ifTrue: [
>                 (byteArrayMap at: value + 1) = 0 ifTrue: [ ^aBlock value ].
>                 byteArrayMap at: value + 1 put: 0 ].
>         highBits := value bitShift: highBitsShift.
>         lowBits := value bitAnd: lowBitsMask.
>         bitset := (map at: highBits ifAbsent: nil) ifNil: [ ^aBlock value
> ].
>         ((bitset clearBitAt: lowBits) and: [ bitset size = 0 ]) ifTrue: [
>                 map removeKey: highBits ].
>         ^aCharacter!
>
> Item was changed:
> + ----- Method: WideCharacterSet>>removeAll (in category 'removing') -----
> - ----- Method: WideCharacterSet>>removeAll (in category 'collection ops')
> -----
>   removeAll
>
>         map isEmpty ifTrue: [ ^self ].
>         map removeAll.
>         byteArrayMap atAllPut: 0!
>
> Item was changed:
> + ----- Method: WideCharacterSet>>size (in category 'accessing') -----
> - ----- Method: WideCharacterSet>>size (in category 'collection ops') -----
>   size
>
>         ^map detectSum: [ :each | each size ]!
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20171130/e326641d/attachment-0001.html>


More information about the Squeak-dev mailing list