Good Hashing paper

Lex Spoon lex at cc.gatech.edu
Thu Feb 21 05:11:27 UTC 2002


> >By cutting it off at 100, you're assuming that almost all object that are
> >the same for the first 100 bytes are the same throughout.. Which is IMHO a
> >fairly safe assumption, and it avoids stupid cases of, say, trying to hash
> >8kb of email text. :)
> >
> >Yes, it'll suck if someone has hundreds of strings that differ only on the
> >101'st byte, but.... Except for a contrived example, where would you see
> >this? :) [*]
> 
> Well, you'd think so, but beware. I can't think of a non-contrived 
> example, but just because I can't readily see a real-world example 
> doesn't mean there isn't a common one. Folks have been tripped up by 
> this before.
> 

Well, any data with a constant header on it.  For example, a lot of C
files will have big README's in the comments saying what the file is
part of.  Also, email messages in fact tend to have quite a few bytes at
the beginning that are at least very similar, if not exactly the same. 
Oh I dunno, it's just that, when it bites, it will bite hard!


Anyway, if we could all step away from the micro-optimization for a
moment, consider this cool alternative: cache the hash.  This was
described either here or on comp.lang.smalltalk, I don't remember. 
Anyway, the idea (and the guy had implemented it, in a different system)
is that once the hash has been computed, store it in an instance
variable.  Then, whenever an at:put: or replace:from: or whatever else
happens that changes the characters in the string, update the hash.  And
that's it!

In this scheme, #hash itself is of course extremely fast, and the hash
value is sensitive to the entire string's contents.  You do pay some
overhead on update operations, but I'd guess it's only a small constant
overhead in most cases: "is hashCache nil?  yes."  But even the worst
case is still linear in the number of characters being updated.

Pretty neat, scheme, isn't it?  I bet there aren't even all that many
update operations in String, because most should be defering to at:put:
or replaceFrom:to:with:startingAt:.  A tricky bit, sadly, is figuring
out where to stash the hash value; I don't think you can just add an
instance variable to String, can you?


-Lex



More information about the Squeak-dev mailing list