Re: [squeak-dev] 4.1 - hashed collections still a problem

28 Mar 2010

      On Sun, 28 Mar 2010, Chris Muller wrote:
...
Hi Levente, thanks a lot for all of your great help with the hashed
collections.  I would like to test your LargeIdentitySet and
LargeIdentityDictionary with Magma and some of my proprietary
applications.  May I assume use of them under a MIT license?
Sure. Note that LargeIdentityDictionary cannot have nil as key at the 
moment.
Levente
...

Chris

2010/3/23 Levente Uzonyi leves@elte.hu:
...
On Mon, 22 Mar 2010, Chris Muller wrote:
...
4.1 hashed collections, across the board, small to large, are slower
by a factor of 2?!  I just don't think we can keep doing this; getting
slower and slower and slower, like molasses..  I'm sorry, but I really
care about this and I know you do too because speed was the whole
premise of putting these changes in.
What went wrong?  More importantly, how can we fix this?
What went wrong?
I think nothing. :) IdentitySet and IdentityDictionary wasn't ment to
support really large collections. The main reason is that there are only
4096 different hash values. So practically these collections are growing
4096 lists in a single array. In 3.9 and 3.10 these collections used a hash
expansion technique which distributed the elements uniformly. This was
changed when we integrated Andrés' hash changes. As you noticed some of the
primes didn't work well with #scaledIdentityHash, it's far better now,
though there may be even better primes. Finding such primes is a
computationally intensive task and the current ones (up to 10000000) are
pretty close to optimal.
Other than that there are two things that cause slowdown:
 +1 extra message send/scanning: #scaledIdentityHash (the new hash expansion
scheme, but we save one by not using #findElementOrNil:)
 +k (worst case) extra message send/scanning: #enclosedSetElement (OO nil
support, this only applies for IdentitySet)
Where k is the length of the list. Since there are only 4096 different
identity hash values for n = 250000 k will be ~61 (if the identity hashes
have a uniform distribution). For n = 1000000 it will be ~244. Note that
your benchmark exploits the worst case.
The long lists are bad, because HashedCollection is optimized for O(1) list
length. In this case the length of the list is not O(1), but O(n) (with a
very small constant).
How can we fix this?
I see two possible solutions for the problem:

use #largeHash instead of #identityHash, which is the identity hash of

the object mixed with the identity hash of its class. This helps if there
are objects from different classes in the set, but it doesn't help with your
benchmark. SystemTracer uses this method.
2) use differently implemented collections which are optimized for your use
case. For example I wrote LargeIdentitySet which probably has the best
performance you can have:
http://leves.web.elte.hu/squeak/LargeIdentitySet.st
(note that it's hardly tested, probably contains bugs)
Levente