[squeak-dev] Re: Hash changes

Tue Dec 1 01:45:27 UTC 2009

Thanks. Now I see what's happening. The problem is that adding objects 
without scaled hash to a "regular" Dictionary creates hugely 
disproportional collision chain(s) as soon as we get north of 4096 
objects. At this point all the possible entries that any new object 
could have by default are already taken and since we use a linear 
collision chain they combine into one gigantonormeous (hope I spelled 
that correctly :-) collision chain.

That explains the difference - for larger number of elements we end up 
(with your changes) with an average length of the collision chain being 
(dict size // 4096) for both adding and accessing where without the 
scale we end up with a collision chain of (dict size) length (for 
adding) and (dict size // 2) for accessing elements once all possible 
initial slots are filled up and spill over.

(and we should probably add a comment explaining the issue)

Cheers,
   - Andreas

Levente Uzonyi wrote:
> On Mon, 30 Nov 2009, Andreas Raab wrote:
> 
>>> | test array |
>>> Transcript open; cr.
>>> test := Dictionary new.
>>
>> "There used to be an old issue with growth of Sets that were initially
>> sized with goodSizeFor: 3. Try working around it by starting at 5."
>> test := Dictionary new: 5.
>>
> 
> This shouldn't be an issue in this case. The capacity of new 
> HashedCollections are already modified to be carefully selected primes. 
> AFAIK only hashMultiply doesn't like 5 as capacity but it's not involved 
> here. Anyway I added this to the new test.
> 
>> "Having sequences of objects with strictly consecutive hashes is not 
>> representative for real systems. Instead, create 4096 objects and 
>> choose 100 atRandom."
>>
>> objSet := Array new: 4096 streamContents: [ :stream |
>>     4096 timesRepeat: [ stream nextPut: Object new ] ].
>> array := (1 to: 100) collect:[:i| objSet atRandom].
>>
> 
> Good point but the hashes are not consecutive (at least with my vm). 
> Since we need 10000 different Objects I added this idea to the test with 
> some modifications. 10000 Objects are created, then shuffled and used 
> when needed.
> 
> The modified test:
> 
> | test array objects |
> Transcript open; cr.
> test := Dictionary new: 5.
> objects := ((1 to: 10000) collect: [ :each | Object new ]) shuffled 
> readStream.
> [ test size >= 10000 ] whileFalse: [
>     Smalltalk garbageCollect.
>     array := Array new: 100 streamContents: [ :stream |
>         100 timesRepeat: [ stream nextPut: objects next  ] ]. "Create 
> 100 new Objects"
>     Transcript
>         print: [ "Add the 100 new Objects to the Dictionary as keys with 
> nil values"
>             1 to: array size do: [ :index |
>                 test at: (array at: index) put: nil ] ] timeToRun;
>         tab;
>         print: test size;
>         tab;
>         print: [ "Access the added Objects 1000 times"
>             1 to: 1000 do: [ :iteration |
>                 1 to: array size do: [ :index |
>                     test at: (array at: index) ] ] ] timeToRun;
>         tab;
>         print: [ "This should be substracted from the previous value"
>             1 to: 1000 do: [ :iteration |
>                 1 to: array size do: [ :index |
>                     array at: index ] ] ] timeToRun;
>         cr;
>         flush ]
> 
> The results:
> Updated image:
> 0    100    93    4
> 0    200    144    4
> 0    300    125    4
> 0    400    105    3
> 0    500    111    4
> 0    600    150    4
> 0    700    115    3
> -- snip --
> 0    9600    297    3
> 0    9700    296    4
> 0    9800    299    5
> 1    9900    312    4
> 0    10000    309    4
> 
> Values in the third column are increasing somewhat, but not 
> significantly, no spikes etc.
> 
> Current image:
> 0    100    76    3
> 1    200    119    4
> 0    300    105    4
> 0    400    80    4
> 0    500    112    4
> 0    600    111    3
> 1    700    98    4
> -- snip --
> 1    3700    1279    4
> 2    3800    1998    4
> 3    3900    3196    4
> 5    4000    4473    4
> 11    4100    11301    4
> 19    4200    18823    4
> 36    4300    36391    3
> 93    4400    45294    4
> 47    4500    46674    3
> ...
> 
> Around 4000 things get really slow. There are also smaller spikes near 
> 2500. I didn't wait for 4600. As you can see in the first column 
> #at:put: is consuming more time every iteration.
> 
> 
> Levente
> 
>