Sets, hashing and #fullCheck
Maurice Rabb
m3rabb at microlingua.com
Thu Mar 11 10:33:06 UTC 2004
Hi Andreas,
You have hit a topic near and dear to me. Poor hashing is one of
Smalltalk's dirtiest little secrets.
Your problem is a combination of two things -- Set's internal collision
handling algorithm, and the identity hashing algorithm. The second
usually the more important in determining hash performance in a hash
table. However two poor algorithms together can be downright deadly.
First a little review: set and dictionaries are normally implemented as
hash table. In Squeak's case the internal content array is used as the
set's or dictionary's hash table.
The goal of a hash table is to provide constant time look up. It does
this by using a hash algorithm (a hash for short) on each element that
will be included in the table. An ideal table will hash each unique
element to its own indexed slot in the table. Unfortunate, sometimes
more than one object for inclusion share the same hash value and
"collide" when they hash to the same slot.
For #identityHash, Squeak defaults to using an object's internally
generated hash value when it is was first allocated. This hash value
is immutable, and stays with the object for its entire life.
Unfortunately (in this case) that hash value is only 12 bits wide
(IIRC). This means that it only can represent only 4096 different
values. So that in hash tables with more than 4K items, there are
going to be collisions. In your set with 10K slots there will be on
average ~2.5 collisions per object.
If the distribution of the hash values is not evenly distributed across
the range, it can be much worse.
Now this is where the collision handling method comes into play.
Squeak's Set's use a linear collision handling method. So if it
detects a collision is it goes to the next slot and tries it. If it is
occupied too, it tries the next, etc. When there is clumping of
objects in the hash table, it degenerates to a expensive linear search.
As the hash table fills, particularly one that uses a poor hash
algorithm, the hash table doesn't need to be that full for lots of
collisions happen regularly. That is why changing the growth trigger
"solved" your problem.
A better collision handling method is known as double hashing. If
there is a collision, a second hash algorithm is used to select an
offset to be used to select the second (and if needed further) slot(s)
to check. Again however, with a bad hash algorithm double hashing
won't save you.
I find the timing of your question really interesting. I am planning
to release for Squeak an excellent hashing and equality framework that
I developed a few years back. It was proprietary but I am open
sourcing it. It has a few cobwebs that need to be dusted first. As
you know adjusting #identityHash willy nilly can be disastrous for
existing IdentitySets. I am traveling through this weekend but I will
get on it when I return. I don't want to rush, so I can be careful.
Now if you can't wait...
You can improve your results dramatically by increasing the bitwidth of
your range of hash values to be greater the the bitwidth of the size of
your hash table. To do so you include other attributes of the object
in calculating its hash. However, you must be careful in your
selection. These attributes must be immutable while the object is in
the table. For inclusion in IdentitySet they must be completely
immutable characteristics of the objects. If the target objects are
CompiledMethods as in your example, it should be easy. The have a lot
of immutable meat to play with. Plus they vary quite a bit between
each other. Check out String>>#hash for ideas. If not, try this:
identityHash
^((self class asOop bitShift: 8)
bitXor: (self size bitShift: 4))
bitXor: self asOop
With regard to your question about when to trigger a hash table to grow
-- in my Hashing framework, tables would easily fill to >80% before the
collisions would cause them to thrash. Again, if you can wait till I
get back we will get to see for sure. ;-)
Whatever you do, if you decide to change Object>>#identityHash do it
carefully!
Set
IdentitySet
IdentityDictionary
SystemDictionary
Environment
SmalltalkEnvironment
In particular, IdentityDictionary has some pretty scary children!
Now this, you know far better than I. As soon as you change
Object>>#identityHash, you will need to call:
IdentitySet rehashAllSets.
IdentityDictionary rehashAllSets.
Is the safest way to do it via a file-in that changes the method and
then calls the rehashing?
Good luck!
Maurice
On Mar 10, 2004, at 17:01, Andreas Raab wrote:
> Hi Guys,
>
> I recently ran into some situations where identity set operations
> became
> horribly slow and since I know there are some people out there who
> really
> understand about this, I thought maybe there's something we can do.
>
> First of all, here's a little benchmark which you can run to see what
> I'm
> talking about:
>
> objList := CompiledMethod allInstances.
> objList size < 10000 ifTrue:[self error:'Not enough objects'].
> times := Array new: 1000.
> 1 to: 1000 do:[:max|
> set := IdentitySet new.
> 1 to: max * 10 do:[:i| set add: (objList at: i)].
> time := [objList do:[:each| set includes: each]] timeToRun.
> times at: max put: time.
> Display getCanvas line: max at 0 to: max@(time//10) color: Color
> black.
> ].
>
> The test simply puts a varying number of objects into an identity set
> and
> then does a (constant) number of inclusion queries. If you run it, you
> will
> notice that after a while we get some *really* huge spikes in there.
> Playing
> with it more by replacing the "IdentitySet new" with "IdentitySet new:
> 20000" shows that the queries can be run in pretty much constant time
> (as
> they should) so it seems that this has to do with the number of free
> slots
> in the set.
>
> I then went further and made a custom identity set class that used
> 1/3rd
> (instead of the standard 1/4th) of the size as the trigger for growing
> inside #fullCheck. It helped a little bit, but basically the spikes
> were
> still there. Changing it to 1/2 however did the trick nicely.
>
> The questions I have are basically: Is there a reasonable explanation
> for
> the behavior I'm seeing when changing it to 1/2 or is this just
> something
> that worked for this specific experiment? More generally, is it
> possible to
> establish a reasonable upper bound for what the "expected fill rate"
> of an
> IDSet should be (considering things like number of hash bits, hash
> function
> used etc)? And finally, would it be worthwhile to change the identity
> based
> sets/dictionaries to grow more agressively?
>
> Thanks for any insights,
> - Andreas
>
>
More information about the Squeak-dev
mailing list
|