Sets, hashing and #fullCheck

danielv at technion.ac.il danielv at technion.ac.il
Thu Mar 11 09:11:09 UTC 2004


Hi Andreas, everyone.

Maybe I can shed a little light on the problem. The key visible part of
the problem is this -
IDSet>>scanFor: (excerpt)
	finish > 4096
		ifTrue: [hash _ anObject identityHash * (finish // 4096)]
		ifFalse: [hash _ anObject identityHash].
	start _ (hash \\ array size) + 1.

What this code is trying to do is cope with the fact that identity
hashes in Squeak only come in 4096 different varieties. Before this code
was there, as soon as IDCollections went over 4096, they were doing
linear searches with length of a few thousand just to get to a free slot
beyond the range of the previous identityHashes. After this change,
assuming perfectly distributed identityHashes, beyond 4096 elements,
we're still doing linear searches, but they should be (about) 4096 times
as short, because you'll have elementsCount//4096 elements colliding
into each hash value, and they get an arraySize//4096 long portion of
the array to live in. But that's still linear, so you can't expect to
get constant time queries even with this improvement. 

And then there's that sneaky assumption about the distribution of the
hashes. If the distribution of the 12 bit hash values is anything but
perfectly regular, you'll get areas of the array where more than
elementsCount//4096 items are trying to live in arraySize//4096 spots,
which makes for congestion. The catastrophic deterioration you
demonstrate so visibly might be explained by the fact that these areas
spill on one another, because scanFor:, not finding a free spot, simply
keeps searching, clogging up the free spots for the next hash value,
too.

I don't know of a good and simple *general* improvement, other than
having more identity hash bits. However, most applications don't need
>10000 sets.Your application might benefit from a customized hash
function, as you can find in PluggableIdentitySet integerSet, which is
how I solved this problem for Celeste (which works comfortably for me
with 66252 emails in one Dictionary). If we do think a general
improvement is important, then separate growing arrays for hash
collision overflow would prevent the spillover effect.

Daniel Vainsencher

Andreas Raab <andreas.raab at gmx.de> wrote:
> Date: Thu, 11 Mar 2004 00:01:19 +0100
> From: Andreas Raab <andreas.raab at gmx.de>
> Subject: Sets, hashing and #fullCheck
> To: The general-purpose Squeak developers list <squeak-dev at lists.squeakfoundation.org>
> envelope-to: danielv at localhost
> delivery-date: Thu, 11 Mar 2004 11:32:35 +0200
> reply-to: The general-purpose Squeak developers list <squeak-dev at lists.squeakfoundation.org>
> 
> Hi Guys,
> 
> I recently ran into some situations where identity set operations became
> horribly slow and since I know there are some people out there who really
> understand about this, I thought maybe there's something we can do.
> 
> First of all, here's a little benchmark which you can run to see what I'm
> talking about:
> 
> objList := CompiledMethod allInstances.
> objList size < 10000 ifTrue:[self error:'Not enough objects'].
> times := Array new: 1000.
> 1 to: 1000 do:[:max|
>     set := IdentitySet new.
>     1 to: max * 10 do:[:i| set add: (objList at: i)].
>     time := [objList do:[:each| set includes: each]] timeToRun.
>     times at: max put: time.
>     Display getCanvas line: max at 0 to: max@(time//10) color: Color black.
> ].
> 
> The test simply puts a varying number of objects into an identity set and
> then does a (constant) number of inclusion queries. If you run it, you will
> notice that after a while we get some *really* huge spikes in there. Playing
> with it more by replacing the "IdentitySet new" with "IdentitySet new:
> 20000" shows that the queries can be run in pretty much constant time (as
> they should) so it seems that this has to do with the number of free slots
> in the set.
> 
> I then went further and made a custom identity set class that used 1/3rd
> (instead of the standard 1/4th) of the size as the trigger for growing
> inside #fullCheck. It helped a little bit, but basically the spikes were
> still there. Changing it to 1/2 however did the trick nicely.
> 
> The questions I have are basically: Is there a reasonable explanation for
> the behavior I'm seeing when changing it to 1/2 or is this just something
> that worked for this specific experiment? More generally, is it possible to
> establish a reasonable upper bound for what the "expected fill rate" of an
> IDSet should be (considering things like number of hash bits, hash function
> used etc)? And finally, would it be worthwhile to change the identity based
> sets/dictionaries to grow more agressively?
> 
> Thanks for any insights,
>   - Andreas



More information about the Squeak-dev mailing list