[squeak-dev] 4.1 - hashed collections still a problem
avalloud at smalltalk.comcastbiz.net
Wed Mar 24 01:18:23 UTC 2010
IMO, it's not worth the hassle for hashed collections to store nil.
Finding good primes is not that expensive either, usually it takes no
more than a couple seconds. You can look at the bottom of the prime
table in VisualWorks and see an expression that finds them from scratch
(but note that isPrime *MUST BE DETERMINISTIC*).
In the case examined, just refine #hash or #identityHash as needed for
your application, and everything should be just fine. For more
information, get the hash book here: www.lulu.com/avSmalltalkBooks.
On 3/23/10 3:36 , Levente Uzonyi wrote:
> On Mon, 22 Mar 2010, Chris Muller wrote:
>> 4.1 hashed collections, across the board, small to large, are slower
>> by a factor of 2?! I just don't think we can keep doing this; getting
>> slower and slower and slower, like molasses.. I'm sorry, but I really
>> care about this and I know you do too because speed was the whole
>> premise of putting these changes in.
>> What went wrong? More importantly, how can we fix this?
> What went wrong?
> I think nothing. :) IdentitySet and IdentityDictionary wasn't ment to
> support really large collections. The main reason is that there are only
> 4096 different hash values. So practically these collections are growing
> 4096 lists in a single array. In 3.9 and 3.10 these collections used a
> hash expansion technique which distributed the elements uniformly. This
> was changed when we integrated Andrés' hash changes. As you noticed some
> of the primes didn't work well with #scaledIdentityHash, it's far better
> now, though there may be even better primes. Finding such primes is a
> computationally intensive task and the current ones (up to 10000000) are
> pretty close to optimal.
> Other than that there are two things that cause slowdown:
> +1 extra message send/scanning: #scaledIdentityHash (the new hash
> expansion scheme, but we save one by not using #findElementOrNil:)
> +k (worst case) extra message send/scanning: #enclosedSetElement (OO nil
> support, this only applies for IdentitySet)
> Where k is the length of the list. Since there are only 4096 different
> identity hash values for n = 250000 k will be ~61 (if the identity hashes
> have a uniform distribution). For n = 1000000 it will be ~244. Note that
> your benchmark exploits the worst case.
> The long lists are bad, because HashedCollection is optimized for O(1)
> list length. In this case the length of the list is not O(1), but
> O(n) (with a very small constant).
> How can we fix this?
> I see two possible solutions for the problem:
> 1) use #largeHash instead of #identityHash, which is the identity hash of
> the object mixed with the identity hash of its class. This helps if there
> are objects from different classes in the set, but it doesn't help with
> your benchmark. SystemTracer uses this method.
> 2) use differently implemented collections which are optimized for your
> use case. For example I wrote LargeIdentitySet which probably has the best
> performance you can have: http://leves.web.elte.hu/squeak/LargeIdentitySet.st
> (note that it's hardly tested, probably contains bugs)
More information about the Squeak-dev