[squeak-dev] The Inbox: Collections-ul.873.mcz
Chris Muller
asqueaker at gmail.com
Wed Feb 5 23:24:32 UTC 2020
Thanks Levente. I really appreciate it.
- Chris
On Wed, Feb 5, 2020 at 9:47 AM <commits at source.squeak.org> wrote:
> Levente Uzonyi uploaded a new version of Collections to project The Inbox:
> http://source.squeak.org/inbox/Collections-ul.873.mcz
>
> ==================== Summary ====================
>
> Name: Collections-ul.873
> Author: ul
> Time: 5 February 2020, 4:29:23.449183 pm
> UUID: fc9428f9-07fc-43fb-90b0-dc7368b64cec
> Ancestors: Collections-ul.872
>
> HashedCollection changes:
> - speed up #goodPrimeAtLeast: when its argument is small
> - let #sizeFor: return the smallest possible capacity (3)
> - use #sizeFor: instead of #goodPrimeAtLeast: in #compact. This fixes the
> case when a 4 element dictionary is compacted to 80% capacity (array size =
> 5), which is higher than the advertised maximum: 75%.
> - fixed the formula in #sizeFor:, which made the shortcut for small values
> unnecessary. Removed the shortcut. The fix often yields smaller
> dictionaries.
> - store good primes in a class variable GoodPrimes for faster access and
> possible customization
> - use blocks instead of symbols in #compactAll and #rehashAll
> - introduced #slowSize in HashedCollection, which makes it possible to not
> override #growSize and #compact in weak subclasses
>
> =============== Diff against Collections-ul.872 ===============
>
> Item was changed:
> Collection subclass: #HashedCollection
> instanceVariableNames: 'tally array'
> + classVariableNames: 'GoodPrimes'
> - classVariableNames: ''
> poolDictionaries: ''
> category: 'Collections-Abstract'!
>
> !HashedCollection commentStamp: 'ul 4/12/2010 22:37' prior: 0!
> I am an abstract collection of objects that implement hash and equality
> in a consitent way. This means that whenever two objects are equal, their
> hashes have to be equal too. If two objects are equal then I can only store
> one of them. Hashes are expected to be integers (preferably SmallIntegers).
> I also expect that the objects contained by me do not change their hashes.
> If that happens, hash invariants have to be re-established, which can be
> done by #rehash.
>
> Since I'm abstract, no instances of me should exist. My subclasses
> should implement #scanFor:, #fixCollisionsFrom: and #noCheckNoGrowFillFrom:.
>
> Instance Variables
> array: <ArrayedCollection> (typically Array or WeakArray)
> tally: <Integer> (non-negative)
>
> array
> - An array whose size is a prime number, it's non-nil elements are
> the elements of the collection, and whose nil elements are empty slots.
> There is always at least one nil. In fact I try to keep my "load" at 75% or
> less so that hashing will work well.
>
> tally
> - The number of elements in the collection. The array size is
> always greater than this.
>
> Implementation details:
> I implement a hash table which uses open addressing with linear probing
> as the method of collision resolution. Searching for an element or a free
> slot for an element is done by #scanFor: which should return the index of
> the slot in array corresponding to it's argument. When an element is
> removed #fixCollisionsFrom: should rehash all elements in array between the
> original index of the removed element, wrapping around after the last slot
> until reaching an empty slot. My maximum load factor (75%) is hardcoded in
> #atNewIndex:put:, so it can only be changed by overriding that method. When
> my load factor reaches this limit I replace my array with a larger one (see
> #grow) ensuring that my load factor will be less than or equal to 50%. The
> new array is filled by #noCheckNoGrowFillFrom: which should use
> #scanForEmptySlotFor: instead of #scanFor: for better performance. I do not
> shrink.
> !
>
> Item was changed:
> ----- Method: HashedCollection class>>compactAll (in category
> 'initialize-release') -----
> compactAll
> "HashedCollection compactAll"
>
> + self allSubclassesDo: [ :each | each compactAllInstances ]!
> - self allSubclassesDo: #compactAllInstances!
>
> Item was changed:
> ----- Method: HashedCollection class>>goodPrimeAtLeast: (in category
> 'sizing') -----
> goodPrimeAtLeast: lowerLimit
> + "Answer the smallest good prime >= lowerlimit.
> + If lowerLimit is larger than the largest known good prime, just
> make it odd.
> + Use linear search, and exponential search to speed up cases when
> lowerLimit is small (<2500 and <100000, respectively).
> + Assume that there are goodPrimes greater than 100000."
> - "Answer the next good prime >= lowerlimit.
> - If lowerLimit is larger than the largest known good prime,
> - just make it odd."
>
> + | highIndex midIndex lowIndex prime |
> + lowerLimit <= 7 ifTrue: [
> + lowerLimit <= 3 ifTrue: [ ^3 ].
> + lowerLimit <= 5 ifTrue: [ ^5 ].
> + ^7 ].
> + GoodPrimes ifNil: [ "migration only"
> + self initializeGoodPrimes ].
> + lowerLimit < 2500 ifTrue: [
> + "Use linear search when the limit is small. The boundary
> is based on measurements."
> + highIndex := 4. "skip 3 5 and 7"
> + [ (GoodPrimes at: highIndex) < lowerLimit ] whileTrue: [
> + highIndex := highIndex + 1 ].
> + ^GoodPrimes at: highIndex ].
> + lowerLimit < 100000
> + ifTrue: [
> + "Use exponential search when the limit is not too
> large. The boundary is based on measurements."
> + highIndex := 1.
> + [ (GoodPrimes at: highIndex) < lowerLimit ]
> whileTrue: [
> + highIndex := highIndex * 2 ].
> + lowIndex := highIndex // 2 + 1. "highIndex // 2
> was smaller than lowerLimit" ]
> + ifFalse: [
> + "Regular binary search."
> + lowIndex := 1.
> + highIndex := GoodPrimes size.
> + "Check whether the largest prime would fit"
> + (GoodPrimes at: highIndex) < lowerLimit ifTrue: [
> + ^lowerLimit bitOr: 1 ]. ].
> + [ highIndex - lowIndex <= 1 ] whileFalse: [
> + midIndex := highIndex + lowIndex // 2.
> + prime := GoodPrimes at: midIndex.
> - | low mid high prime primes |
> - primes := self goodPrimes.
> - high := primes size.
> - lowerLimit > (primes at: high) ifTrue: [
> - ^lowerLimit bitOr: 1 ].
> - low := 1.
> - [ high - low <= 1 ] whileFalse: [
> - mid := high + low // 2.
> - prime := primes at: mid.
> lowerLimit < prime
> + ifTrue: [ highIndex := midIndex ]
> - ifTrue: [ high := mid ]
> ifFalse: [
> lowerLimit > prime
> + ifTrue: [ lowIndex := midIndex ]
> - ifTrue: [ low := mid ]
> ifFalse: [ ^prime ] ] ].
> + (GoodPrimes at: lowIndex) >= lowerLimit ifTrue: [ ^GoodPrimes at:
> lowIndex ].
> + ^GoodPrimes at: highIndex!
> - (primes at: low) >= lowerLimit ifTrue: [ ^primes at: low ].
> - ^primes at: high!
>
> Item was changed:
> ----- Method: HashedCollection class>>goodPrimes (in category 'sizing')
> -----
> goodPrimes
> + "Answer a sorted array of prime numbers less than one billion that
> make good hash table sizes. See #initializeGoodPrimes."
> - "Answer a sorted array of prime numbers less than one billion that
> make good
> - hash table sizes. Should be expanded as needed. See comments
> below code"
>
> + ^GoodPrimes ifNil: [
> + self initializeGoodPrimes.
> + GoodPrimes ]!
> - ^#(3 5 7 11 13 17 23 31 43 59 79 107 149 199 269 359 479 641 857
> 1151 1549 2069
> - 2237 2423 2617 2797 2999 3167 3359 3539 3727 3911
> - 4441 4787 5119 5471 5801 6143 6521 6827 7177 7517 7853
> - 8783 9601 10243 10867 11549 12239 12919 13679 14293 15013
> 15731
> - 17569 19051 20443 21767 23159 24611 25847 27397 28571
> 30047 31397
> - 35771 38201 40841 43973 46633 48989 51631 54371 57349
> 60139 62969
> - 70589 76091 80347 85843 90697 95791 101051 106261 111143
> 115777 120691 126311
> - 140863 150523 160969 170557 181243 190717 201653 211891
> 221251 232591 242873 251443
> - 282089 300869 321949 341227 362353 383681 401411 422927
> 443231 464951 482033 504011
> - 562621 605779 647659 681607 723623 763307 808261 844709
> 886163 926623 967229 1014617
> - 1121987 1201469 1268789 1345651 1429531 1492177 1577839
> 1651547 1722601 1800377 1878623 1942141 2028401
> - 2242727 2399581 2559173 2686813 2836357 3005579 3144971
> 3283993 3460133 3582923 3757093 3903769 4061261
> - 4455361 4783837 5068529 5418079 5680243 6000023 6292981
> 6611497 6884641 7211599 7514189 7798313 8077189
> - 9031853 9612721 10226107 10745291 11338417 11939203
> 12567671 13212697 13816333 14337529 14938571 15595673 16147291
> - 17851577 18993941 20180239 21228533 22375079 23450491
> 24635579 25683871 26850101 27921689 29090911 30153841 31292507 32467307
> - 35817611 37983761 40234253 42457253 44750177 46957969
> 49175831 51442639 53726417 55954637 58126987 60365939 62666977 64826669
> - 71582779 76039231 80534381 84995153 89500331 93956777
> 98470819 102879613 107400389 111856841 116365721 120819287 125246581
> 129732203
> - 143163379 152076289 161031319 169981667 179000669
> 187913573 196826447 205826729 214748357 223713691 232679021 241591901
> 250504801 259470131
> - 285162679 301939921 318717121 335494331 352271573
> 369148753 385926017 402603193 419480419 436157621 453034849 469712051
> 486589307 503366497 520043707
> - 570475349 603929813 637584271 671138659 704693081
> 738247541 771801929 805356457 838910803 872365267 905919671 939574117
> 973128521 1006682977 1040137411
> - 1073741833)
> -
> - "The above primes past 2069 were chosen carefully so that they do not
> interact badly with 1664525 (used by hashMultiply), and so that gcd(p,
> (256^k) +/- a) = 1, for 0<a<=32 and 0<k<=8. See Knuth's TAOCP for details."
> -
> - "The above primes also try to map the values of ((0 to: 4095) collect: [
> :each | each << 18 \\ prime ]) sort to an equidistant sequence of numbers.
> This helps to avoid the collision of chains in identity-based hashed
> collections. To do that they were chosen to return a low value when the
> following block is evaluated with them as argument:
> - [ :prime |
> - | n slots cost optimalDistance |
> - n := 1 bitShift: 22.
> - slots := Array new: n + 1.
> - 0 to: n - 1 do: [ :ea | slots at: ea + 1 put: (ea bitShift: 8) \\
> prime ].
> - slots at: n + 1 put: prime.
> - slots sort.
> - cost := 0.
> - optimalDistance := prime // n.
> - 2 to: n + 1 do: [ :index |
> - | newCost |
> - newCost := optimalDistance - ((slots at: index) - (slots
> at: index - 1)).
> - newCost > cost ifTrue: [ cost := newCost ] ].
> - result add: prime -> cost ]
> -
> - The shifts in the block relate to the numer of bits the #identityHash
> consists of (22) and the number of bits #scaledIdentityHash shifts it (8)"!
>
> Item was added:
> + ----- Method: HashedCollection class>>initialize (in category 'sizing')
> -----
> + initialize
> +
> + self initializeGoodPrimes!
>
> Item was added:
> + ----- Method: HashedCollection class>>initializeGoodPrimes (in category
> 'sizing') -----
> + initializeGoodPrimes
> + "GoodPrimes is a sorted array of prime numbers less than one
> billion that make good hash table sizes. Should be expanded as needed. See
> comments below code."
> +
> + GoodPrimes := #(3 5 7 11 13 17 23 31 43 59 79 107 149 199 269 359
> 479 641 857 1151 1549 2069
> + 2237 2423 2617 2797 2999 3167 3359 3539 3727 3911
> + 4441 4787 5119 5471 5801 6143 6521 6827 7177 7517 7853
> + 8783 9601 10243 10867 11549 12239 12919 13679 14293 15013
> 15731
> + 17569 19051 20443 21767 23159 24611 25847 27397 28571
> 30047 31397
> + 35771 38201 40841 43973 46633 48989 51631 54371 57349
> 60139 62969
> + 70589 76091 80347 85843 90697 95791 101051 106261 111143
> 115777 120691 126311
> + 140863 150523 160969 170557 181243 190717 201653 211891
> 221251 232591 242873 251443
> + 282089 300869 321949 341227 362353 383681 401411 422927
> 443231 464951 482033 504011
> + 562621 605779 647659 681607 723623 763307 808261 844709
> 886163 926623 967229 1014617
> + 1121987 1201469 1268789 1345651 1429531 1492177 1577839
> 1651547 1722601 1800377 1878623 1942141 2028401
> + 2242727 2399581 2559173 2686813 2836357 3005579 3144971
> 3283993 3460133 3582923 3757093 3903769 4061261
> + 4455361 4783837 5068529 5418079 5680243 6000023 6292981
> 6611497 6884641 7211599 7514189 7798313 8077189
> + 9031853 9612721 10226107 10745291 11338417 11939203
> 12567671 13212697 13816333 14337529 14938571 15595673 16147291
> + 17851577 18993941 20180239 21228533 22375079 23450491
> 24635579 25683871 26850101 27921689 29090911 30153841 31292507 32467307
> + 35817611 37983761 40234253 42457253 44750177 46957969
> 49175831 51442639 53726417 55954637 58126987 60365939 62666977 64826669
> + 71582779 76039231 80534381 84995153 89500331 93956777
> 98470819 102879613 107400389 111856841 116365721 120819287 125246581
> 129732203
> + 143163379 152076289 161031319 169981667 179000669
> 187913573 196826447 205826729 214748357 223713691 232679021 241591901
> 250504801 259470131
> + 285162679 301939921 318717121 335494331 352271573
> 369148753 385926017 402603193 419480419 436157621 453034849 469712051
> 486589307 503366497 520043707
> + 570475349 603929813 637584271 671138659 704693081
> 738247541 771801929 805356457 838910803 872365267 905919671 939574117
> 973128521 1006682977 1040137411
> + 1073741833)
> +
> + "The above primes past 2069 were chosen carefully so that they do not
> interact badly with 1664525 (used by hashMultiply), and so that gcd(p,
> (256^k) +/- a) = 1, for 0<a<=32 and 0<k<=8. See Knuth's TAOCP for details."
> +
> + "The above primes also try to map the values of ((0 to: 4095) collect: [
> :each | each << 18 \\ prime ]) sort to an equidistant sequence of numbers.
> This helps to avoid the collision of chains in identity-based hashed
> collections. To do that they were chosen to return a low value when the
> following block is evaluated with them as argument:
> + [ :prime |
> + | n slots cost optimalDistance |
> + n := 1 bitShift: 22.
> + slots := Array new: n + 1.
> + 0 to: n - 1 do: [ :ea | slots at: ea + 1 put: (ea bitShift: 8) \\
> prime ].
> + slots at: n + 1 put: prime.
> + slots sort.
> + cost := 0.
> + optimalDistance := prime // n.
> + 2 to: n + 1 do: [ :index |
> + | newCost |
> + newCost := optimalDistance - ((slots at: index) - (slots
> at: index - 1)).
> + newCost > cost ifTrue: [ cost := newCost ] ].
> + result add: prime -> cost ]
> +
> + The shifts in the block relate to the numer of bits the #identityHash
> consists of (22) and the number of bits #scaledIdentityHash shifts it (8)"!
>
> Item was changed:
> ----- Method: HashedCollection class>>new (in category 'instance
> creation') -----
> new
> + "Create a HashedCollection large enough to hold 3 different
> objects without growing."
> +
> + ^self basicNew initialize: 5 "For performance, inline the value 5
> which would normally be returned by #sizeFor:."!
> - ^ self basicNew initialize: 5!
>
> Item was changed:
> ----- Method: HashedCollection class>>new: (in category 'instance
> creation') -----
> + new: numberOfElements
> + "Create a HashedCollection large enough to hold numberOfElements
> different objects without growing."
> +
> + ^self basicNew initialize: (numberOfElements <= 3
> + ifFalse: [ self sizeFor: numberOfElements ]
> + ifTrue: [ "Inline values returned by #sizeFor: to ensure
> that #new: is not significantly slower than #new for small values."
> + numberOfElements < 3
> + ifTrue: [ 3 ]
> + ifFalse: [ 5 ] ])!
> - new: nElements
> - "Create a Set large enough to hold nElements without growing"
> - ^ self basicNew initialize: (self sizeFor: nElements)!
>
> Item was changed:
> ----- Method: HashedCollection class>>rehashAll (in category
> 'initialize-release') -----
> rehashAll
> "HashedCollection rehashAll"
>
> + self allSubclassesDo: [ :each | each rehashAllInstances ]!
> - self allSubclassesDo: #rehashAllInstances!
>
> Item was changed:
> ----- Method: HashedCollection class>>sizeFor: (in category 'sizing')
> -----
> + sizeFor: numberOfElements
> + "Return a large enough prime (or odd if too large), the size of
> the internal array to hold numberOfElements with at most 75% load factor."
> - sizeFor: nElements
> - "Large enough prime (or odd if too large) size to hold nElements
> with some slop (see fullCheck)"
>
> + ^self goodPrimeAtLeast: numberOfElements * 4 + 2 // 3 "Optimized
> version of (numberOfElements * 4 / 3) ceiling."!
> - nElements < 4 ifTrue: [ ^5 ].
> - ^self goodPrimeAtLeast: nElements + 1 * 4 // 3!
>
> Item was changed:
> ----- Method: HashedCollection>>compact (in category 'private') -----
> compact
> "Reduce the size of array so that the load factor will be ~75%."
>
> | newCapacity |
> + newCapacity := self class sizeFor: self slowSize.
> - newCapacity := self class goodPrimeAtLeast: tally * 4 // 3.
> self growTo: newCapacity!
>
> Item was changed:
> ----- Method: HashedCollection>>growSize (in category 'private') -----
> growSize
> "Answer what my next higher table size should be"
>
> + ^self class sizeFor: self slowSize * 2!
> - ^self class goodPrimeAtLeast: array size * 3 // 2 + 2!
>
> Item was added:
> + ----- Method: HashedCollection>>slowSize (in category 'accessing') -----
> + slowSize
> + "Answer an upper bound of the number of elements in this
> collection. For regular collections, this can simply be the value of tally,
> but for collections that cannot maintain an exact value, like current weak
> collections, this has to be calculated on the fly."
> +
> + ^tally!
>
> Item was removed:
> - ----- Method: WeakIdentityDictionary>>compact (in category 'private')
> -----
> - compact
> - "Reduce the size of array so that the load factor will be ~75%."
> -
> - | newCapacity |
> - tally := self slowSize.
> - newCapacity := self class goodPrimeAtLeast: tally * 4 // 3.
> - self growTo: newCapacity!
>
> Item was removed:
> - ----- Method: WeakKeyDictionary>>compact (in category 'private') -----
> - compact
> - "Reduce the size of array so that the load factor will be ~75%."
> -
> - | newCapacity |
> - newCapacity := self class goodPrimeAtLeast: self slowSize * 4 // 3.
> - self growTo: newCapacity!
>
> Item was removed:
> - ----- Method: WeakKeyDictionary>>growSize (in category 'private') -----
> - growSize
> - "Answer what my next table size should be.
> - Note that, it can be less than the current."
> -
> - ^self class goodPrimeAtLeast: self slowSize * 2 + 2!
>
> Item was removed:
> - ----- Method: WeakSet>>compact (in category 'private') -----
> - compact
> - "Reduce the size of array so that the load factor will be ~75%."
> -
> - | newCapacity |
> - newCapacity := self class goodPrimeAtLeast: self slowSize * 4 // 3.
> - self growTo: newCapacity!
>
> Item was removed:
> - ----- Method: WeakSet>>growSize (in category 'private') -----
> - growSize
> - "Answer what my next table size should be.
> - Note that, it can be less than the current."
> -
> - ^self class goodPrimeAtLeast: self slowSize * 2 + 2!
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20200205/50de8ec2/attachment.html>
More information about the Squeak-dev
mailing list
|