[squeak-dev] Re: Hash changes

Tue Dec 1 07:42:21 UTC 2009

2009/12/1 Andreas Raab <andreas.raab at gmx.de>:
> Thanks. Now I see what's happening. The problem is that adding objects
> without scaled hash to a "regular" Dictionary creates hugely disproportional
> collision chain(s) as soon as we get north of 4096 objects. At this point
> all the possible entries that any new object could have by default are
> already taken and since we use a linear collision chain they combine into
> one gigantonormeous (hope I spelled that correctly :-) collision chain.
>
> That explains the difference - for larger number of elements we end up (with
> your changes) with an average length of the collision chain being (dict size
> // 4096) for both adding and accessing where without the scale we end up
> with a collision chain of (dict size) length (for adding) and (dict size //
> 2) for accessing elements once all possible initial slots are filled up and
> spill over.
>
> (and we should probably add a comment explaining the issue)
>
> Cheers,
>  - Andreas
>

Yes, that's what I tried to explain previously...
The change does not change the RATE of collisions, only the COST of collisions.
But I must be just bad in English ;)

Cheers

Nicolas

> Levente Uzonyi wrote:
>>
>> On Mon, 30 Nov 2009, Andreas Raab wrote:
>>
>>>> | test array |
>>>> Transcript open; cr.
>>>> test := Dictionary new.
>>>
>>> "There used to be an old issue with growth of Sets that were initially
>>> sized with goodSizeFor: 3. Try working around it by starting at 5."
>>> test := Dictionary new: 5.
>>>
>>
>> This shouldn't be an issue in this case. The capacity of new
>> HashedCollections are already modified to be carefully selected primes.
>> AFAIK only hashMultiply doesn't like 5 as capacity but it's not involved
>> here. Anyway I added this to the new test.
>>
>>> "Having sequences of objects with strictly consecutive hashes is not
>>> representative for real systems. Instead, create 4096 objects and choose 100
>>> atRandom."
>>>
>>> objSet := Array new: 4096 streamContents: [ :stream |
>>>    4096 timesRepeat: [ stream nextPut: Object new ] ].
>>> array := (1 to: 100) collect:[:i| objSet atRandom].
>>>
>>
>> Good point but the hashes are not consecutive (at least with my vm). Since
>> we need 10000 different Objects I added this idea to the test with some
>> modifications. 10000 Objects are created, then shuffled and used when
>> needed.
>>
>> The modified test:
>>
>> | test array objects |
>> Transcript open; cr.
>> test := Dictionary new: 5.
>> objects := ((1 to: 10000) collect: [ :each | Object new ]) shuffled
>> readStream.
>> [ test size >= 10000 ] whileFalse: [
>>    Smalltalk garbageCollect.
>>    array := Array new: 100 streamContents: [ :stream |
>>        100 timesRepeat: [ stream nextPut: objects next  ] ]. "Create 100
>> new Objects"
>>    Transcript
>>        print: [ "Add the 100 new Objects to the Dictionary as keys with
>> nil values"
>>            1 to: array size do: [ :index |
>>                test at: (array at: index) put: nil ] ] timeToRun;
>>        tab;
>>        print: test size;
>>        tab;
>>        print: [ "Access the added Objects 1000 times"
>>            1 to: 1000 do: [ :iteration |
>>                1 to: array size do: [ :index |
>>                    test at: (array at: index) ] ] ] timeToRun;
>>        tab;
>>        print: [ "This should be substracted from the previous value"
>>            1 to: 1000 do: [ :iteration |
>>                1 to: array size do: [ :index |
>>                    array at: index ] ] ] timeToRun;
>>        cr;
>>        flush ]
>>
>> The results:
>> Updated image:
>> 0    100    93    4
>> 0    200    144    4
>> 0    300    125    4
>> 0    400    105    3
>> 0    500    111    4
>> 0    600    150    4
>> 0    700    115    3
>> -- snip --
>> 0    9600    297    3
>> 0    9700    296    4
>> 0    9800    299    5
>> 1    9900    312    4
>> 0    10000    309    4
>>
>> Values in the third column are increasing somewhat, but not significantly,
>> no spikes etc.
>>
>> Current image:
>> 0    100    76    3
>> 1    200    119    4
>> 0    300    105    4
>> 0    400    80    4
>> 0    500    112    4
>> 0    600    111    3
>> 1    700    98    4
>> -- snip --
>> 1    3700    1279    4
>> 2    3800    1998    4
>> 3    3900    3196    4
>> 5    4000    4473    4
>> 11    4100    11301    4
>> 19    4200    18823    4
>> 36    4300    36391    3
>> 93    4400    45294    4
>> 47    4500    46674    3
>> ...
>>
>> Around 4000 things get really slow. There are also smaller spikes near
>> 2500. I didn't wait for 4600. As you can see in the first column #at:put: is
>> consuming more time every iteration.
>>
>>
>> Levente
>>
>>
>
>
>