Sets, hashing and #fullCheck

Thu Mar 11 10:33:06 UTC 2004

Hi Andreas,

You have hit a topic near and dear to me.  Poor hashing is one of 
Smalltalk's dirtiest little secrets.

Your problem is a combination of two things -- Set's internal collision 
handling algorithm, and the identity hashing algorithm.  The second 
usually the more important in determining hash performance in a hash 
table.  However two poor algorithms together can be downright deadly.

First a little review: set and dictionaries are normally implemented as 
hash table.  In Squeak's case the internal content array is used as the 
set's or dictionary's hash table.

The goal of a hash table is to provide constant time look up.  It does 
this by using a hash algorithm (a hash for short) on each element that 
will be included in the table.  An ideal table will hash each unique 
element to its own indexed slot in the table.  Unfortunate, sometimes 
more than one object for inclusion share the same hash value and 
"collide" when they hash to the same slot.

For #identityHash, Squeak defaults to using an object's internally 
generated hash value when it is was first allocated.  This hash value 
is immutable, and stays with the object for its entire life. 
Unfortunately (in this case) that hash value is only 12 bits wide 
(IIRC).  This means that it only can represent only 4096 different 
values.  So that in hash tables with more than 4K items, there are 
going to be collisions. In your set with 10K slots there will be on 
average ~2.5 collisions per object.

If the distribution of the hash values is not evenly distributed across 
the range, it can be much worse.

Now this is where the collision handling method comes into play.  
Squeak's Set's use a linear collision handling method.  So if it 
detects a collision is it goes to the next slot and tries it.  If it is 
occupied too, it tries the next, etc.  When there is clumping of 
objects in the hash table, it degenerates to a expensive linear search. 
  As the hash table fills, particularly one that uses a poor hash 
algorithm, the hash table doesn't need to be that full for lots of 
collisions happen regularly.  That is why changing the growth trigger 
"solved" your problem.

A better collision handling method is known as double hashing.  If 
there is a collision, a second hash algorithm is used to select an 
offset to be used to select the second (and if needed further) slot(s) 
to check.  Again however, with a bad hash algorithm double hashing 
won't save you.

I find the timing of your question really interesting.  I am planning 
to release for Squeak an excellent hashing and equality framework that 
I developed a few years back.  It was proprietary but I am open 
sourcing it.  It has a few cobwebs that need to be dusted first.  As 
you know adjusting #identityHash willy nilly can be disastrous for 
existing IdentitySets.  I am traveling through this weekend but I will 
get on it when I return.  I don't want to rush, so I can be careful.

Now if you can't wait...

You can improve your results dramatically by increasing the bitwidth of 
your range of hash values to be greater the the bitwidth of the size of 
your hash table.  To do so you include other attributes of the object 
in calculating its hash.  However, you must be careful in your 
selection.  These attributes must be immutable while the object is in 
the table.  For inclusion in IdentitySet they must be completely 
immutable characteristics of the objects.  If the target objects are 
CompiledMethods as in your example, it should be easy.  The have a lot 
of immutable meat to play with.  Plus they vary quite a bit between 
each other.  Check out String>>#hash for ideas.  If not, try this:

identityHash
	^((self class asOop bitShift: 8)
		bitXor: (self size bitShift: 4))
		bitXor: self asOop

With regard to your question about when to trigger a hash table to grow 
-- in my Hashing framework, tables would easily fill to >80% before the 
collisions would cause them to thrash.  Again, if you can wait till I 
get back we will get to see for sure. ;-)

Whatever you do, if you decide to change Object>>#identityHash do it 
carefully!

Set
	IdentitySet
	IdentityDictionary
		SystemDictionary
			Environment
				SmalltalkEnvironment

In particular, IdentityDictionary has some pretty scary children!

Now this, you know far better than I.  As soon as you change 
Object>>#identityHash, you will need to call:

	IdentitySet rehashAllSets.
	IdentityDictionary rehashAllSets.

Is the safest way to do it via a file-in that changes the method and 
then calls the rehashing?

Good luck!

Maurice

On Mar 10, 2004, at 17:01, Andreas Raab wrote:

> Hi Guys,
>
> I recently ran into some situations where identity set operations 
> became
> horribly slow and since I know there are some people out there who 
> really
> understand about this, I thought maybe there's something we can do.
>
> First of all, here's a little benchmark which you can run to see what 
> I'm
> talking about:
>
> objList := CompiledMethod allInstances.
> objList size < 10000 ifTrue:[self error:'Not enough objects'].
> times := Array new: 1000.
> 1 to: 1000 do:[:max|
>     set := IdentitySet new.
>     1 to: max * 10 do:[:i| set add: (objList at: i)].
>     time := [objList do:[:each| set includes: each]] timeToRun.
>     times at: max put: time.
>     Display getCanvas line: max at 0 to: max@(time//10) color: Color 
> black.
> ].
>
> The test simply puts a varying number of objects into an identity set 
> and
> then does a (constant) number of inclusion queries. If you run it, you 
> will
> notice that after a while we get some *really* huge spikes in there. 
> Playing
> with it more by replacing the "IdentitySet new" with "IdentitySet new:
> 20000" shows that the queries can be run in pretty much constant time 
> (as
> they should) so it seems that this has to do with the number of free 
> slots
> in the set.
>
> I then went further and made a custom identity set class that used 
> 1/3rd
> (instead of the standard 1/4th) of the size as the trigger for growing
> inside #fullCheck. It helped a little bit, but basically the spikes 
> were
> still there. Changing it to 1/2 however did the trick nicely.
>
> The questions I have are basically: Is there a reasonable explanation 
> for
> the behavior I'm seeing when changing it to 1/2 or is this just 
> something
> that worked for this specific experiment? More generally, is it 
> possible to
> establish a reasonable upper bound for what the "expected fill rate" 
> of an
> IDSet should be (considering things like number of hash bits, hash 
> function
> used etc)? And finally, would it be worthwhile to change the identity 
> based
> sets/dictionaries to grow more agressively?
>
> Thanks for any insights,
>   - Andreas
>
>