[squeak-dev] The Inbox: Collections-ul.871.mcz

commits at source.squeak.org commits at source.squeak.org
Tue Jan 21 02:06:16 UTC 2020


Levente Uzonyi uploaded a new version of Collections to project The Inbox:
http://source.squeak.org/inbox/Collections-ul.871.mcz

==================== Summary ====================

Name: Collections-ul.871
Author: ul
Time: 21 January 2020, 2:57:28.300065 am
UUID: 8f6d668d-19ff-434b-827c-345862901105
Ancestors: Collections-nice.870

HashedCollection changes:
- speed up #goodPrimeAtLeast: when its argument is small
- let #sizeFor: return the smallest possible capacity (3)
- use #sizeFor: instead of #goodPrimeAtLeast: in #compact. This fixes the case when a 4 element dictionary is compacted to 80% capacity (array size = 5), which is higher than the advertised maximum: 75%.
- fixed the formula in #sizeFor:, which made the shortcut for small values unnecessary. Removed the shortcut. The fix often yields smaller dictionaries.
- store good primes in a class variable GoodPrimes for faster access and possible customization
- use blocks instead of symbols in #compactAll and #rehashAll

=============== Diff against Collections-nice.870 ===============

Item was changed:
  Collection subclass: #HashedCollection
  	instanceVariableNames: 'tally array'
+ 	classVariableNames: 'GoodPrimes'
- 	classVariableNames: ''
  	poolDictionaries: ''
  	category: 'Collections-Abstract'!
  
  !HashedCollection commentStamp: 'ul 4/12/2010 22:37' prior: 0!
  I am an abstract collection of objects that implement hash and equality in a consitent way. This means that whenever two objects are equal, their hashes have to be equal too. If two objects are equal then I can only store one of them. Hashes are expected to be integers (preferably SmallIntegers). I also expect that the objects contained by me do not change their hashes. If that happens, hash invariants have to be re-established, which can be done by #rehash.
  
  Since I'm abstract, no instances of me should exist. My subclasses should implement #scanFor:, #fixCollisionsFrom: and #noCheckNoGrowFillFrom:.
  
  Instance Variables
  	array:		<ArrayedCollection> (typically Array or WeakArray)
  	tally:		<Integer> (non-negative)
  
  array
  	- An array whose size is a prime number, it's non-nil elements are the elements of the collection, and whose nil elements are empty slots. There is always at least one nil. In fact I try to keep my "load" at 75% or less so that hashing will work well.
  
  tally
  	- The number of elements in the collection. The array size is always greater than this.
  
  Implementation details:
  I implement a hash table which uses open addressing with linear probing as the method of collision resolution. Searching for an element or a free slot for an element is done by #scanFor: which should return the index of the slot in array corresponding to it's argument. When an element is removed #fixCollisionsFrom: should rehash all elements in array between the original index of the removed element, wrapping around after the last slot until reaching an empty slot. My maximum load factor (75%) is hardcoded in #atNewIndex:put:, so it can only be changed by overriding that method. When my load factor reaches this limit I replace my array with a larger one (see #grow) ensuring that my load factor will be less than or equal to 50%. The new array is filled by #noCheckNoGrowFillFrom: which should use #scanForEmptySlotFor: instead of #scanFor: for better performance. I do not shrink.
  !

Item was changed:
  ----- Method: HashedCollection class>>compactAll (in category 'initialize-release') -----
  compactAll
  	"HashedCollection compactAll"	
  		
+ 	self allSubclassesDo: [ :each | each compactAllInstances ]!
- 	self allSubclassesDo: #compactAllInstances!

Item was changed:
  ----- Method: HashedCollection class>>goodPrimeAtLeast: (in category 'sizing') -----
  goodPrimeAtLeast: lowerLimit
+ 	"Answer the smallest good prime >= lowerlimit.
+ 	If lowerLimit is larger than the largest known good prime, just make it odd.
+ 	Use linear search, and exponential search to speed up cases when lowerLimit is small (<2500 and <100000, respectively).
+ 	Assume that there are goodPrimes greater than 100000."
- 	"Answer the next good prime >= lowerlimit.
- 	If lowerLimit is larger than the largest known good prime,
- 	just make it odd."
  	
+ 	| highIndex midIndex lowIndex prime |
+ 	lowerLimit <= 7 ifTrue: [
+ 		lowerLimit <= 3 ifTrue: [ ^3 ].
+ 		lowerLimit <= 5 ifTrue: [ ^5 ].
+ 		^7 ].
+ 	GoodPrimes ifNil: [ "migration only"
+ 		self initializeGoodPrimes ].
+ 	lowerLimit < 2500 ifTrue: [
+ 		"Use linear search when the limit is small. The boundary is based on measurements."
+ 		highIndex := 4. "skip 3 5 and 7"
+ 		[ (GoodPrimes at: highIndex) < lowerLimit ] whileTrue: [
+ 			highIndex := highIndex + 1 ].
+ 		^GoodPrimes at: highIndex ].
+ 	lowerLimit < 100000 
+ 		ifTrue: [
+ 			"Use exponential search when the limit is not too large. The boundary is based on measurements."
+ 			highIndex := 1.
+ 			[ (GoodPrimes at: highIndex) < lowerLimit ] whileTrue: [
+ 				highIndex := highIndex * 2 ].
+ 			lowIndex := highIndex // 2 + 1. "highIndex // 2 was smaller than lowerLimit" ]
+ 		ifFalse: [
+ 			"Regular binary search."
+ 			lowIndex := 1.
+ 			highIndex := GoodPrimes size.
+ 			"Check whether the largest prime would fit"
+ 			(GoodPrimes at: highIndex) < lowerLimit ifTrue: [
+ 				^lowerLimit bitOr: 1 ]. ].
+ 	[ highIndex - lowIndex <= 1 ] whileFalse: [
+ 		midIndex := highIndex + lowIndex // 2.
+ 		prime := GoodPrimes at: midIndex.
- 	| low mid high prime primes |
- 	primes := self goodPrimes.
- 	high := primes size.
- 	lowerLimit > (primes at: high) ifTrue: [
- 		^lowerLimit bitOr: 1 ].
- 	low := 1.
- 	[ high - low <= 1 ] whileFalse: [
- 		mid := high + low // 2.
- 		prime := primes at: mid.
  		lowerLimit < prime
+ 			ifTrue: [ highIndex := midIndex ]
- 			ifTrue: [ high := mid ]
  			ifFalse: [
  				lowerLimit > prime
+ 					ifTrue: [ lowIndex := midIndex ]
- 					ifTrue: [ low := mid ]
  					ifFalse: [ ^prime ] ] ].
+ 	(GoodPrimes at: lowIndex) >= lowerLimit ifTrue: [ ^GoodPrimes at: lowIndex ].
+ 	^GoodPrimes at: highIndex!
- 	(primes at: low) >= lowerLimit ifTrue: [ ^primes at: low ].
- 	^primes at: high!

Item was changed:
  ----- Method: HashedCollection class>>goodPrimes (in category 'sizing') -----
  goodPrimes
+ 	"Answer a sorted array of prime numbers less than one billion that make good hash table sizes. See #initializeGoodPrimes."
- 	"Answer a sorted array of prime numbers less than one billion that make good
- 	hash table sizes. Should be expanded as needed.  See comments below code"
  	
+ 	^GoodPrimes ifNil: [
+ 		self initializeGoodPrimes.
+ 		GoodPrimes ]!
- 	^#(3 5 7 11 13 17 23 31 43 59 79 107 149 199 269 359 479 641 857 1151 1549 2069
- 		2237 2423 2617 2797 2999 3167 3359 3539 3727 3911
- 		4441 4787 5119 5471 5801 6143 6521 6827 7177 7517 7853
- 		8783 9601 10243 10867 11549 12239 12919 13679 14293 15013 15731
- 		17569 19051 20443 21767 23159 24611 25847 27397 28571 30047 31397
- 		35771 38201 40841 43973 46633 48989 51631 54371 57349 60139 62969
- 		70589 76091 80347 85843 90697 95791 101051 106261 111143 115777 120691 126311
- 		140863 150523 160969 170557 181243 190717 201653 211891 221251 232591 242873 251443
- 		282089 300869 321949 341227 362353 383681 401411 422927 443231 464951 482033 504011
- 		562621 605779 647659 681607 723623 763307 808261 844709 886163 926623 967229 1014617
- 		1121987 1201469 1268789 1345651 1429531 1492177 1577839 1651547 1722601 1800377 1878623 1942141 2028401
- 		2242727 2399581 2559173 2686813 2836357 3005579 3144971 3283993 3460133 3582923 3757093 3903769 4061261
- 		4455361 4783837 5068529 5418079 5680243 6000023 6292981 6611497 6884641 7211599 7514189 7798313 8077189
- 		9031853 9612721 10226107 10745291 11338417 11939203 12567671 13212697 13816333 14337529 14938571 15595673 16147291
- 		17851577 18993941 20180239 21228533 22375079 23450491 24635579 25683871 26850101 27921689 29090911 30153841 31292507 32467307
- 		35817611 37983761 40234253 42457253 44750177 46957969 49175831 51442639 53726417 55954637 58126987 60365939 62666977 64826669
- 		71582779 76039231 80534381 84995153 89500331 93956777 98470819 102879613 107400389 111856841 116365721 120819287 125246581 129732203
- 		143163379 152076289 161031319 169981667 179000669 187913573 196826447 205826729 214748357 223713691 232679021 241591901 250504801 259470131
- 		285162679 301939921 318717121 335494331 352271573 369148753 385926017 402603193 419480419 436157621 453034849 469712051 486589307 503366497 520043707 
- 		570475349 603929813 637584271 671138659 704693081 738247541 771801929 805356457 838910803 872365267 905919671 939574117 973128521 1006682977 1040137411 
- 		1073741833)
- 
- "The above primes past 2069 were chosen carefully so that they do not interact badly with 1664525 (used by hashMultiply), and so that gcd(p, (256^k) +/- a) = 1, for 0<a<=32 and 0<k<=8.  See Knuth's TAOCP for details."
- 
- "The above primes also try to map the values of ((0 to: 4095) collect: [ :each | each << 18 \\ prime ]) sort to an equidistant sequence of numbers. This helps to avoid the collision of chains in identity-based hashed collections. To do that  they were chosen to return a low value when the following block is evaluated with them as argument:
-  [ :prime |
- 	| n slots cost optimalDistance |
- 	n := 1 bitShift: 22.
- 	slots := Array new: n + 1.
- 	0 to: n - 1 do: [ :ea | slots at: ea + 1 put: (ea bitShift: 8) \\ prime ].
- 	slots at: n + 1 put: prime.
- 	slots sort.
- 	cost := 0.
- 	optimalDistance := prime // n.
- 	2 to: n + 1 do: [ :index |
- 		| newCost |
- 		newCost := optimalDistance - ((slots at: index) - (slots at: index - 1)).
- 		newCost > cost ifTrue: [ cost := newCost ] ].
- 	result add: prime -> cost ]
- 
- The shifts in the block relate to the numer of bits the #identityHash consists of (22) and the number of bits #scaledIdentityHash shifts it (8)"!

Item was added:
+ ----- Method: HashedCollection class>>initialize (in category 'sizing') -----
+ initialize
+ 
+ 	self initializeGoodPrimes!

Item was added:
+ ----- Method: HashedCollection class>>initializeGoodPrimes (in category 'sizing') -----
+ initializeGoodPrimes
+ 	"GoodPrimes is a sorted array of prime numbers less than one billion that make good hash table sizes. Should be expanded as needed. See comments below code."
+ 	
+ 	GoodPrimes := #(3 5 7 11 13 17 23 31 43 59 79 107 149 199 269 359 479 641 857 1151 1549 2069
+ 		2237 2423 2617 2797 2999 3167 3359 3539 3727 3911
+ 		4441 4787 5119 5471 5801 6143 6521 6827 7177 7517 7853
+ 		8783 9601 10243 10867 11549 12239 12919 13679 14293 15013 15731
+ 		17569 19051 20443 21767 23159 24611 25847 27397 28571 30047 31397
+ 		35771 38201 40841 43973 46633 48989 51631 54371 57349 60139 62969
+ 		70589 76091 80347 85843 90697 95791 101051 106261 111143 115777 120691 126311
+ 		140863 150523 160969 170557 181243 190717 201653 211891 221251 232591 242873 251443
+ 		282089 300869 321949 341227 362353 383681 401411 422927 443231 464951 482033 504011
+ 		562621 605779 647659 681607 723623 763307 808261 844709 886163 926623 967229 1014617
+ 		1121987 1201469 1268789 1345651 1429531 1492177 1577839 1651547 1722601 1800377 1878623 1942141 2028401
+ 		2242727 2399581 2559173 2686813 2836357 3005579 3144971 3283993 3460133 3582923 3757093 3903769 4061261
+ 		4455361 4783837 5068529 5418079 5680243 6000023 6292981 6611497 6884641 7211599 7514189 7798313 8077189
+ 		9031853 9612721 10226107 10745291 11338417 11939203 12567671 13212697 13816333 14337529 14938571 15595673 16147291
+ 		17851577 18993941 20180239 21228533 22375079 23450491 24635579 25683871 26850101 27921689 29090911 30153841 31292507 32467307
+ 		35817611 37983761 40234253 42457253 44750177 46957969 49175831 51442639 53726417 55954637 58126987 60365939 62666977 64826669
+ 		71582779 76039231 80534381 84995153 89500331 93956777 98470819 102879613 107400389 111856841 116365721 120819287 125246581 129732203
+ 		143163379 152076289 161031319 169981667 179000669 187913573 196826447 205826729 214748357 223713691 232679021 241591901 250504801 259470131
+ 		285162679 301939921 318717121 335494331 352271573 369148753 385926017 402603193 419480419 436157621 453034849 469712051 486589307 503366497 520043707 
+ 		570475349 603929813 637584271 671138659 704693081 738247541 771801929 805356457 838910803 872365267 905919671 939574117 973128521 1006682977 1040137411 
+ 		1073741833)
+ 
+ "The above primes past 2069 were chosen carefully so that they do not interact badly with 1664525 (used by hashMultiply), and so that gcd(p, (256^k) +/- a) = 1, for 0<a<=32 and 0<k<=8.  See Knuth's TAOCP for details."
+ 
+ "The above primes also try to map the values of ((0 to: 4095) collect: [ :each | each << 18 \\ prime ]) sort to an equidistant sequence of numbers. This helps to avoid the collision of chains in identity-based hashed collections. To do that  they were chosen to return a low value when the following block is evaluated with them as argument:
+  [ :prime |
+ 	| n slots cost optimalDistance |
+ 	n := 1 bitShift: 22.
+ 	slots := Array new: n + 1.
+ 	0 to: n - 1 do: [ :ea | slots at: ea + 1 put: (ea bitShift: 8) \\ prime ].
+ 	slots at: n + 1 put: prime.
+ 	slots sort.
+ 	cost := 0.
+ 	optimalDistance := prime // n.
+ 	2 to: n + 1 do: [ :index |
+ 		| newCost |
+ 		newCost := optimalDistance - ((slots at: index) - (slots at: index - 1)).
+ 		newCost > cost ifTrue: [ cost := newCost ] ].
+ 	result add: prime -> cost ]
+ 
+ The shifts in the block relate to the numer of bits the #identityHash consists of (22) and the number of bits #scaledIdentityHash shifts it (8)"!

Item was changed:
  ----- Method: HashedCollection class>>rehashAll (in category 'initialize-release') -----
  rehashAll
  	"HashedCollection rehashAll"	
  		
+ 	self allSubclassesDo: [ :each | each rehashAllInstances ]!
- 	self allSubclassesDo: #rehashAllInstances!

Item was changed:
  ----- Method: HashedCollection class>>sizeFor: (in category 'sizing') -----
+ sizeFor: numberOfElements
+ 	"Large enough prime (or odd if too large) size to hold numberOfElements with some slop."
- sizeFor: nElements
- 	"Large enough prime (or odd if too large) size to hold nElements with some slop (see fullCheck)"
  	
+ 	^self goodPrimeAtLeast: numberOfElements * 4 + 2 // 3 "Optimized version of (numberOfElements * 4 / 3) ceiling."!
- 	nElements < 4 ifTrue: [ ^5 ].
- 	^self goodPrimeAtLeast: nElements + 1 * 4 // 3!



More information about the Squeak-dev mailing list