[squeak-dev] 4.1 - hashed collections still a problem

Levente Uzonyi leves at elte.hu
Sun Mar 28 22:42:51 UTC 2010

On Sun, 28 Mar 2010, Levente Uzonyi wrote:

> On Sun, 28 Mar 2010, Chris Muller wrote:
>> Hi Levente, thanks a lot for all of your great help with the hashed
>> collections.  I would like to test your LargeIdentitySet and
>> LargeIdentityDictionary with Magma and some of my proprietary
>> applications.  May I assume use of them under a MIT license?
> Sure. Note that LargeIdentityDictionary cannot have nil as key at the moment.

I was a bit too fast. It does support nil as key. :)


> Levente
>> - Chris
>> 2010/3/23 Levente Uzonyi <leves at elte.hu>:
>>> On Mon, 22 Mar 2010, Chris Muller wrote:
>>>> 4.1 hashed collections, across the board, small to large, are slower
>>>> by a factor of 2?!  I just don't think we can keep doing this; getting
>>>> slower and slower and slower, like molasses..  I'm sorry, but I really
>>>> care about this and I know you do too because speed was the whole
>>>> premise of putting these changes in.
>>>> What went wrong?  More importantly, how can we fix this?
>>> What went wrong?
>>> I think nothing. :) IdentitySet and IdentityDictionary wasn't ment to
>>> support really large collections. The main reason is that there are only
>>> 4096 different hash values. So practically these collections are growing
>>> 4096 lists in a single array. In 3.9 and 3.10 these collections used a 
>>> hash
>>> expansion technique which distributed the elements uniformly. This was
>>> changed when we integrated Andrés' hash changes. As you noticed some of 
>>> the
>>> primes didn't work well with #scaledIdentityHash, it's far better now,
>>> though there may be even better primes. Finding such primes is a
>>> computationally intensive task and the current ones (up to 10000000) are
>>> pretty close to optimal.
>>> Other than that there are two things that cause slowdown:
>>>  +1 extra message send/scanning: #scaledIdentityHash (the new hash 
>>> expansion
>>> scheme, but we save one by not using #findElementOrNil:)
>>>  +k (worst case) extra message send/scanning: #enclosedSetElement (OO nil
>>> support, this only applies for IdentitySet)
>>> Where k is the length of the list. Since there are only 4096 different
>>> identity hash values for n = 250000 k will be ~61 (if the identity hashes
>>> have a uniform distribution). For n = 1000000 it will be ~244. Note that
>>> your benchmark exploits the worst case.
>>> The long lists are bad, because HashedCollection is optimized for O(1) 
>>> list
>>> length. In this case the length of the list is not O(1), but O(n) (with a
>>> very small constant).
>>> How can we fix this?
>>> I see two possible solutions for the problem:
>>> 1) use #largeHash instead of #identityHash, which is the identity hash of
>>> the object mixed with the identity hash of its class. This helps if there
>>> are objects from different classes in the set, but it doesn't help with 
>>> your
>>> benchmark. SystemTracer uses this method.
>>> 2) use differently implemented collections which are optimized for your 
>>> use
>>> case. For example I wrote LargeIdentitySet which probably has the best
>>> performance you can have:
>>> http://leves.web.elte.hu/squeak/LargeIdentitySet.st
>>> (note that it's hardly tested, probably contains bugs)
>>> Levente

More information about the Squeak-dev mailing list