[squeak-dev] The Trunk: Collections-nice.771.mcz

commits at source.squeak.org commits at source.squeak.org
Thu Nov 30 20:57:47 UTC 2017


Nicolas Cellier uploaded a new version of Collections to project The Trunk:
http://source.squeak.org/trunk/Collections-nice.771.mcz

==================== Summary ====================

Name: Collections-nice.771
Author: nice
Time: 30 November 2017, 9:57:27.873964 pm
UUID: d8b64711-6119-429b-b3f5-259b46ef864b
Ancestors: Collections-nice.770

Fix awfully broken CharacterSetComplement select:/reject:

If we want to select:/reject:, we must not only enumerate the absent characters, but rather all the characters in the complement.

That's way two many, thus we prefer to do it with a LazyCharacterSet, anything else is unfeasible (we don't even know the upper limit of the set of characters...).

Introduce an AbstractCharacterSet superclass of all the CharacterSet family in order to begin factoring some behavior.

TODO: we should better rename
CharacterSet -> ByteCharacterSet
AbstractCharacterSet -> CharacterSet.

We delay this quite technical operation, because we don't want to break existing instances, AND we want to redirect (Byte)CharacterSet references to (Abstract)CharacterSet, the Abstract one becoming a factory.

=============== Diff against Collections-nice.770 ===============

Item was added:
+ Collection subclass: #AbstractCharacterSet
+ 	instanceVariableNames: ''
+ 	classVariableNames: ''
+ 	poolDictionaries: ''
+ 	category: 'Collections-Support'!

Item was added:
+ ----- Method: AbstractCharacterSet>>byteArrayMap (in category 'accessing') -----
+ byteArrayMap
+ 	^self subclassReponsibility!

Item was added:
+ ----- Method: AbstractCharacterSet>>findFirstInByteString:startingAt: (in category 'enumerating') -----
+ findFirstInByteString: aByteString startingAt: startIndex
+ 	"Double dispatching: since we know this is a ByteString, we can use a superfast primitive using a ByteArray map with 0 slots for byte characters not included and 1 for byte characters included in the receiver."
+ 	^ByteString
+ 		findFirstInString: aByteString
+ 		inSet: self byteArrayMap
+ 		startingAt: startIndex!

Item was added:
+ ----- Method: AbstractCharacterSet>>occurrencesOf: (in category 'enumerating') -----
+ occurrencesOf: anObject
+ 	"Answer how many of the receiver's elements are equal to anObject. Optimized version."
+ 
+ 	(self includes: anObject) ifTrue: [ ^1 ].
+ 	^0!

Item was added:
+ ----- Method: AbstractCharacterSet>>removeAll (in category 'removing') -----
+ removeAll
+ 	self becomeForward: CharacterSet new!

Item was changed:
+ AbstractCharacterSet subclass: #CharacterSet
- Collection subclass: #CharacterSet
  	instanceVariableNames: 'map tally'
  	classVariableNames: 'CrLf NonSeparators Separators'
  	poolDictionaries: ''
  	category: 'Collections-Support'!
  
  !CharacterSet commentStamp: '<historical>' prior: 0!
  A set of characters.  Lookups for inclusion are very fast.!

Item was changed:
+ ----- Method: CharacterSet>>= (in category 'comparing') -----
- ----- Method: CharacterSet>>= (in category 'comparison') -----
  = anObject
  	
  	self species == anObject species ifFalse: [ ^false ].
  	anObject size = tally ifFalse: [ ^false ].
  	^self byteArrayMap = anObject byteArrayMap!

Item was changed:
+ ----- Method: CharacterSet>>add: (in category 'adding') -----
- ----- Method: CharacterSet>>add: (in category 'collection ops') -----
  add: aCharacter
  	"I automatically become a WideCharacterSet if you add a wide character to myself"
  	
  	| index |
  	(index := aCharacter asInteger + 1) <= 256 ifFalse: [
  		| wide |
  		wide := WideCharacterSet new.
  		wide addAll: self.
  		wide add: aCharacter.
  		self becomeForward: wide.
  		^aCharacter ].
  	(map at: index) = 1 ifFalse: [
  		map at: index put: 1.
  		tally := tally + 1 ].
  	^aCharacter!

Item was changed:
+ ----- Method: CharacterSet>>do: (in category 'enumerating') -----
- ----- Method: CharacterSet>>do: (in category 'collection ops') -----
  do: aBlock
  	"evaluate aBlock with each character in the set"
  
  	| index |
  	tally >= 128 ifTrue: [ "dense"
  		index := 0.
  		[ (index := index + 1) <= 256 ] whileTrue: [
  			(map at: index) = 1 ifTrue: [
  				aBlock value: (Character value: index - 1) ] ].
  		^self ].
  	"sparse"
  	index := 0.
  	[ (index := map indexOf: 1 startingAt: index + 1) = 0 ] whileFalse: [
  		aBlock value: (Character value: index - 1) ].
  	!

Item was changed:
+ ----- Method: CharacterSet>>findFirstInByteString:startingAt: (in category 'enumerating') -----
- ----- Method: CharacterSet>>findFirstInByteString:startingAt: (in category 'collection ops') -----
  findFirstInByteString: aByteString startingAt: startIndex
  	"Double dispatching: since we know this is a ByteString, we can use a superfast primitive using a ByteArray map with 0 slots for byte characters not included and 1 for byte characters included in the receiver."
  	^ByteString
  		findFirstInString: aByteString
  		inSet: self byteArrayMap
  		startingAt: startIndex!

Item was changed:
+ ----- Method: CharacterSet>>hash (in category 'comparing') -----
- ----- Method: CharacterSet>>hash (in category 'comparison') -----
  hash
  	^self byteArrayMap hash!

Item was changed:
+ ----- Method: CharacterSet>>includes: (in category 'testing') -----
- ----- Method: CharacterSet>>includes: (in category 'collection ops') -----
  includes: anObject
  
  	| index |
  	anObject isCharacter ifFalse: [ ^false ].
  	(index := anObject asInteger + 1) > 256 ifTrue: [ ^false ].
  	^(map at: index) > 0!

Item was removed:
- ----- Method: CharacterSet>>occurrencesOf: (in category 'enumerating') -----
- occurrencesOf: anObject
- 	"Answer how many of the receiver's elements are equal to anObject. Optimized version."
- 
- 	(self includes: anObject) ifTrue: [ ^1 ].
- 	^0!

Item was changed:
+ ----- Method: CharacterSet>>remove: (in category 'removing') -----
- ----- Method: CharacterSet>>remove: (in category 'collection ops') -----
  remove: aCharacter
  
  	^self remove: aCharacter ifAbsent: aCharacter!

Item was changed:
+ ----- Method: CharacterSet>>remove:ifAbsent: (in category 'removing') -----
- ----- Method: CharacterSet>>remove:ifAbsent: (in category 'collection ops') -----
  remove: aCharacter ifAbsent: aBlock
  
  	| index |
  	(index := aCharacter asciiValue + 1) <= 256 ifFalse: [ ^aBlock value ].
  	(map at: index) = 0 ifTrue: [ ^aBlock value ].
  	map at: index put: 0.
  	tally := tally - 1.
  	^aCharacter!

Item was changed:
+ ----- Method: CharacterSet>>size (in category 'accessing') -----
- ----- Method: CharacterSet>>size (in category 'collection ops') -----
  size
  
  	^tally!

Item was changed:
+ ----- Method: CharacterSet>>species (in category 'comparing') -----
- ----- Method: CharacterSet>>species (in category 'comparison') -----
  species
  	^CharacterSet!

Item was changed:
+ AbstractCharacterSet subclass: #CharacterSetComplement
- Collection subclass: #CharacterSetComplement
  	instanceVariableNames: 'absent byteArrayMapCache'
  	classVariableNames: ''
  	poolDictionaries: ''
  	category: 'Collections-Support'!
  
  !CharacterSetComplement commentStamp: 'nice 8/31/2008 14:53' prior: 0!
  CharacterSetComplement is a space efficient implementation of (CharacterSet complement) taking care of WideCharacter (code > 255)
  
  However, it will maintain a byteArrayMap for character <= 255 in a cache keeping 
  
  instance variables:
  	absent <CharacterSet> contains character that are not in the set (i.e. my complement)
  	byteArrayMapCache <ByteArray | nil> cache this information because it has to be used in tight loops where efficiency matters!

Item was changed:
+ ----- Method: CharacterSetComplement>>add: (in category 'adding') -----
- ----- Method: CharacterSetComplement>>add: (in category 'collection ops') -----
  add: aCharacter 
  	"a character is present if not absent, so adding a character is removing it from the absent"
  	
  	(absent includes: aCharacter)
  		ifTrue:
  			[byteArrayMapCache := nil.
  			absent remove: aCharacter].
  	^ aCharacter!

Item was changed:
+ ----- Method: CharacterSetComplement>>do: (in category 'enumerating') -----
- ----- Method: CharacterSetComplement>>do: (in category 'collection ops') -----
  do: aBlock
  	"evaluate aBlock with each character in the set.
  	don't do it, there are too many..."
  
  	self shouldNotImplement!

Item was changed:
+ ----- Method: CharacterSetComplement>>findFirstInByteString:startingAt: (in category 'enumerating') -----
- ----- Method: CharacterSetComplement>>findFirstInByteString:startingAt: (in category 'collection ops') -----
  findFirstInByteString: aByteString startingAt: startIndex
  	"Double dispatching: since we know this is a ByteString, we can use a superfast primitive using a ByteArray map with 0 slots for byte characters not included and 1 for byte characters included in the receiver."
  	^ByteString
  		findFirstInString: aByteString
  		inSet: self byteArrayMap
  		startingAt: startIndex!

Item was changed:
+ ----- Method: CharacterSetComplement>>includes: (in category 'testing') -----
- ----- Method: CharacterSetComplement>>includes: (in category 'collection ops') -----
  includes: anObject
  
  	anObject isCharacter ifFalse: [ ^false ].
  	(absent includes: anObject) ifTrue: [ ^false ].
  	^true!

Item was removed:
- ----- Method: CharacterSetComplement>>occurrencesOf: (in category 'enumerating') -----
- occurrencesOf: anObject
- 	"Answer how many of the receiver's elements are equal to anObject. Optimized version."
- 
- 	(self includes: anObject) ifTrue: [ ^1 ].
- 	^0!

Item was changed:
+ ----- Method: CharacterSetComplement>>reject: (in category 'enumerating') -----
- ----- Method: CharacterSetComplement>>reject: (in category 'collection ops') -----
  reject: aBlock
+ 	^LazyCharacterSet including: [:c | (absent includes: c) not and: [(aBlock value: c) not]]!
- 	"Implementation note: rejecting present is selecting absent"
- 	
- 	^(absent select: aBlock) complement!

Item was changed:
+ ----- Method: CharacterSetComplement>>remove: (in category 'removing') -----
- ----- Method: CharacterSetComplement>>remove: (in category 'collection ops') -----
  remove: aCharacter
  	"This means aCharacter is now absent from myself.
  	It must be added to my absent."
  	
  	byteArrayMapCache := nil.
  	^absent add: aCharacter!

Item was changed:
+ ----- Method: CharacterSetComplement>>remove:ifAbsent: (in category 'removing') -----
- ----- Method: CharacterSetComplement>>remove:ifAbsent: (in category 'collection ops') -----
  remove: aCharacter ifAbsent: aBlock
  	(self includes: aCharacter) ifFalse: [^aBlock value].
  	^self remove: aCharacter!

Item was removed:
- ----- Method: CharacterSetComplement>>removeAll (in category 'collection ops') -----
- removeAll
- 
- 	self becomeForward: CharacterSet new!

Item was changed:
+ ----- Method: CharacterSetComplement>>select: (in category 'enumerating') -----
- ----- Method: CharacterSetComplement>>select: (in category 'collection ops') -----
  select: aBlock
+ 	^LazyCharacterSet including: [:c | (absent includes: c) not and: [aBlock value: c]]!
- 	"Implementation note: selecting present is rejecting absent"
- 	
- 	^(absent reject: aBlock) complement!

Item was removed:
- ----- Method: CharacterSetComplement>>size (in category 'collection ops') -----
- size
- 	"Is this 2**32-absent size ?"
- 	
- 	^self shouldNotImplement!

Item was added:
+ ----- Method: Interval>>copyFrom:to: (in category 'copying') -----
+ copyFrom: startIndex to: stopIndex
+ 	stopIndex < startIndex ifTrue: [^self copyEmpty].
+ 	^(self at: startIndex) to: (self at: stopIndex) by: step!

Item was added:
+ AbstractCharacterSet subclass: #LazyCharacterSet
+ 	instanceVariableNames: 'block byteArrayMapCache'
+ 	classVariableNames: ''
+ 	poolDictionaries: ''
+ 	category: 'Collections-Support'!
+ 
+ !LazyCharacterSet commentStamp: 'nice 11/30/2017 21:40' prior: 0!
+ A LazyCharacterSet is a kind of CharacterSet which does not know in advance which Character it contains or not.
+ If will lazily evaluate a block on demand if ever one ask whether it includes: a character.
+ It is not feasible to enumerate a LazyCharacterSet, because there are way too many characters.
+ 
+ Instance Variables
+ 	block:		<BlockContext | Symbol>
+ 	byteArrayMapCache:		<ByteArray | nil>
+ 
+ block
+ 	- a valuable, answering either true or false when sent the message value: - true means that this set includes the character passed as value: argument.
+ 
+ byteArrayMapCache
+ 	- a cache holding 0 or 1 for the first 256 character codes - 0 meaning not included, 1 included. This is used in some priitives
+ !

Item was added:
+ ----- Method: LazyCharacterSet class>>including: (in category 'instance creation') -----
+ including: aBlock
+ 	"Create the set of Character for which aBlock evaluates to true"
+ 	^self class new block: aBlock!

Item was added:
+ ----- Method: LazyCharacterSet>>add: (in category 'adding') -----
+ add: aCharacter
+ 	self block: [:c | c = aCharacter or: [block value: c]].
+ 	^aCharacter!

Item was added:
+ ----- Method: LazyCharacterSet>>addAll: (in category 'adding') -----
+ addAll: aCollection
+ 	self block: [:c | (aCollection includes: c) or: [block value: c]].
+ 	^aCollection!

Item was added:
+ ----- Method: LazyCharacterSet>>block (in category 'accessing') -----
+ block
+ 	^block!

Item was added:
+ ----- Method: LazyCharacterSet>>block: (in category 'accessing') -----
+ block: aValuable
+ 	"Set the block used to determine if I include a Character or not.
+ 	aValuable is an object that shoud answer true or false when sent value:"
+ 	
+ 	byteArrayMapCache := nil.
+ 	^block := aValuable!

Item was added:
+ ----- Method: LazyCharacterSet>>byteArrayMap (in category 'accessing') -----
+ byteArrayMap
+ 	"return a ByteArray mapping each ascii value to a 1 if that ascii value is in the set, and a 0 if it isn't.  Intended for use by primitives only"
+ 
+ 	^byteArrayMapCache ifNil: [byteArrayMapCache := (0 to: 255) collect: [:i | self includes: (Character value: i)]]!

Item was added:
+ ----- Method: LazyCharacterSet>>complement (in category 'converting') -----
+ complement
+ 	^self class including: [:char | (block value: char) not]!

Item was added:
+ ----- Method: LazyCharacterSet>>do: (in category 'enumerating') -----
+ do: aBlock
+ 	"evaluate aBlock with each character in the set.
+ 	don't do it, there are too many loop..."
+ 
+ 	self shouldNotImplement!

Item was added:
+ ----- Method: LazyCharacterSet>>includes: (in category 'testing') -----
+ includes: aCharacter
+ 	^block value: aCharacter!

Item was added:
+ ----- Method: LazyCharacterSet>>reject: (in category 'enumerating') -----
+ reject: aBlock
+ 	^self class including: [:char | (aBlock value: char) not and: [block value: char]]!

Item was added:
+ ----- Method: LazyCharacterSet>>remove: (in category 'removing') -----
+ remove: aCharacter
+ 	self block: [:c | (c = aCharacter) not and: [block value: c]].
+ 	^aCharacter!

Item was added:
+ ----- Method: LazyCharacterSet>>remove:ifAbsent: (in category 'removing') -----
+ remove: aCharacter ifAbsent: aBlock
+ 	(self includes: aCharacter) ifFalse: [^aBlock value].
+ 	^self remove: aCharacter!

Item was added:
+ ----- Method: LazyCharacterSet>>removeAll: (in category 'removing') -----
+ removeAll: aCollection
+ 	self block: [:c | (aCollection include: c) not and: [block value: c]].
+ 	^aCollection!

Item was added:
+ ----- Method: LazyCharacterSet>>select: (in category 'enumerating') -----
+ select: aBlock
+ 	^self class including: [:char | (block value: char) and: [aBlock value: char]]!

Item was changed:
+ AbstractCharacterSet subclass: #WideCharacterSet
- Collection subclass: #WideCharacterSet
  	instanceVariableNames: 'map byteArrayMap bitsetCapacity highBitsShift lowBitsMask'
  	classVariableNames: ''
  	poolDictionaries: ''
  	category: 'Collections-Support'!
  
  !WideCharacterSet commentStamp: 'nice 12/10/2009 19:17' prior: 0!
  WideCharacterSet is used to store a Set of WideCharacter with fast access and inclusion test.
  
  Implementation should be efficient in memory if sets are sufficently sparse.
  
  Wide Characters are at most 32bits.
  We split them into 16 highBits and 16 lowBits.
  
  map is a dictionary key: 16 highBits value: map of 16 lowBits.
  
  Maps of lowBits  are stored as arrays of bits in a ByteArray.
  If a bit is set to 1, this indicate that corresponding character is present.
  8192 bytes are necessary in each lowmap.
  Empty lowmap are removed from the map Dictionary.
  
  A byteArrayMap is maintained in parallel with map for fast handling of ByteString.
  (byteArrayMap at: i+1) = 0 means that character of asciiValue i is absent, = 1 means present.!

Item was changed:
+ ----- Method: WideCharacterSet>>add: (in category 'adding') -----
- ----- Method: WideCharacterSet>>add: (in category 'collection ops') -----
  add: aCharacter
  
  	| value highBits lowBits |
  	(value := aCharacter asInteger) < 256 ifTrue: [
  		byteArrayMap at: value + 1 put: 1 ].
  	highBits := value bitShift: highBitsShift.
  	lowBits := value bitAnd: lowBitsMask.
  	(map at: highBits ifAbsentPut: [ Bitset new: bitsetCapacity ])
  		setBitAt: lowBits.
  	^aCharacter!

Item was changed:
+ ----- Method: WideCharacterSet>>do: (in category 'enumerating') -----
- ----- Method: WideCharacterSet>>do: (in category 'collection ops') -----
  do: aBlock
   
  	map keysAndValuesDo: [ :index :bitset |
  		| highBits |
  		highBits := index * bitsetCapacity.
  		bitset do: [ :lowBits |
  			aBlock value: (Character value: highBits + lowBits) ] ]!

Item was changed:
+ ----- Method: WideCharacterSet>>findFirstInByteString:startingAt: (in category 'enumerating') -----
- ----- Method: WideCharacterSet>>findFirstInByteString:startingAt: (in category 'collection ops') -----
  findFirstInByteString: aByteString startingAt: startIndex
  	"Double dispatching: since we know this is a ByteString, we can use a superfast primitive using a ByteArray map with 0 slots for byte characters not included and 1 for byte characters included in the receiver."
  
  	^ByteString
  		findFirstInString: aByteString
  		inSet: byteArrayMap
  		startingAt: startIndex!

Item was changed:
+ ----- Method: WideCharacterSet>>includes: (in category 'testing') -----
- ----- Method: WideCharacterSet>>includes: (in category 'collection ops') -----
  includes: anObject 
  
  	| value |
  	anObject isCharacter ifFalse: [ ^false ].
  	(value := anObject asInteger) < 256 ifTrue: [
  		^(byteArrayMap at: value + 1) ~= 0 ].
  	^((map at: (value bitShift: highBitsShift) ifAbsent: nil) ifNil: [ ^false ])
  		includes: (value bitAnd: lowBitsMask)!

Item was removed:
- ----- Method: WideCharacterSet>>occurrencesOf: (in category 'enumerating') -----
- occurrencesOf: anObject
- 	"Answer how many of the receiver's elements are equal to anObject. Optimized version."
- 
- 	(self includes: anObject) ifTrue: [ ^1 ].
- 	^0!

Item was changed:
+ ----- Method: WideCharacterSet>>remove: (in category 'removing') -----
- ----- Method: WideCharacterSet>>remove: (in category 'collection ops') -----
  remove: aCharacter
  	"Don't signal an error when aCharacter is not present."
  
  	^self remove: aCharacter ifAbsent: aCharacter!

Item was changed:
+ ----- Method: WideCharacterSet>>remove:ifAbsent: (in category 'removing') -----
- ----- Method: WideCharacterSet>>remove:ifAbsent: (in category 'collection ops') -----
  remove: aCharacter ifAbsent: aBlock
  
  	| value highBits lowBits bitset |
  	(value := aCharacter asInteger) < 256 ifTrue: [
  		(byteArrayMap at: value + 1) = 0 ifTrue: [ ^aBlock value ].
  		byteArrayMap at: value + 1 put: 0 ].
  	highBits := value bitShift: highBitsShift.
  	lowBits := value bitAnd: lowBitsMask.
  	bitset := (map at: highBits ifAbsent: nil) ifNil: [ ^aBlock value ].
  	((bitset clearBitAt: lowBits) and: [ bitset size = 0 ]) ifTrue: [
  		map removeKey: highBits ].
  	^aCharacter!

Item was changed:
+ ----- Method: WideCharacterSet>>removeAll (in category 'removing') -----
- ----- Method: WideCharacterSet>>removeAll (in category 'collection ops') -----
  removeAll
  
  	map isEmpty ifTrue: [ ^self ].
  	map removeAll.
  	byteArrayMap atAllPut: 0!

Item was changed:
+ ----- Method: WideCharacterSet>>size (in category 'accessing') -----
- ----- Method: WideCharacterSet>>size (in category 'collection ops') -----
  size
  
  	^map detectSum: [ :each | each size ]!



More information about the Squeak-dev mailing list