Hashing

Thu Apr 16 20:40:34 UTC 1998

All of this talk about hashing, again. :) A while ago a posted what I
thought was a nice solution to this whole problem. I got no response on
it, so I was never sure whether it was deemed a good thing, a weird
thing (green plane idea), or a bad thing.

The idea was to use the same pattern that SortedCollection uses by
giving Set a two block instance variables, one a two argument block to
compare two objects by (equivalenceBlock) and the other a single
argument block (hashBlock) used to compute the hash value of an object.
I never toyed with using symbol to send via perform: and perform:with:,
though that might be faster?

I'm tired of seeing the classical "I forgot to implement hash when I
implemented equality" bug. The Smalltalk library for some reason or
another has interned this notion of hash into Object itself. I don't
understand why that is. I can think of plenty of applications where
hashing an object would never be necessary. IOW, IMO this is not an
essential Object behavior. What really bugs me about the current state
of affairs is the violation of encapsulation. Hashing is a function used
by Sets (and their various subclasses). It has nothing to do with what
the object does itself. How to hash an Object should be bound to a Set.
And not necessarily a subclass type of Set, but the instance itself. If
we used blocks, then we wouldn't have to have IdentitySet and Set (along
with the various subclasses of each). You would just create a Set
initialized for either equality or identity. And you could do other
things too. When I did the implementation, I ran a number of tests to
see what the performance hit was. For a set of random integers, the
speed hit was between 10-20%. For a random set of floats between 0.0 and
1.0 though, I could get as much as 40% speed increase by using a hash
function that multiplied by 1000 and truncated.

Over the last couple of months there have been discussions on how to
hash Floats, Points, and now Collections. Every time we argue the same
sorts of things. Somebody comes up with a nifty way of hashing for their
domain usage of said type of object and proposes that we impelement hash
as such. Someone responds with a domain where proposed hash is not very
good or just plain falls apart. Why does every object type have to have
just one way of hashing. Which object type will it be next?

I have no problem with implementing a hash method in an object for
convenience's sake, but I think we should generalize the Set's
themselves a bit further and put this thing to rest once and for all.
Then you could document the invariant between hash and equality in the
Set's documentation, which is where it belongs since that's who it
affects.

Another thought along this line is that quite often, one's Sets are
typed. In other words I create a Set of Integers, or Numbers, or
Collections, or even Objects. One could add class side accessors that
would return various hash/equality algorithms for that kind of object.

Thoughts?

--
Travis Griggs
Key Technology
tgriggs at keyww.com
To Smalltalk! - and Beyond!