Horrible unscalability, how do I change hashBitsOf: ?

Sat Oct 6 19:49:34 UTC 2001

Hi Scott.
Thanks for worrying about Squeak performance, saves us all some money
... ;-)

Celeste can hold tens of thousands of messages in a dictionary, and see
pretty reasonable performance. It didn't use to, exactly because of the
problem you note.

But first note you overstate your case when saying "every collection" -
most aren't hashed, and the huge majority of those have between 0 and 3
items. 14864 of 18961 Set allSubInstances in my current image.

Anyway, two things to do when you care about the performance of big
hashed collections, are use the Identity* variant, or the Pluggable*
variant. The firsts' #scanFor: has a little built trick to get that
O(n/2^b) behavior back, for the usual b(its) = 12 case. The second lets
you define a tailored hash function for this particular collection (thus
you don't need to change your stored class to implement your work
around).

For a PluggableSet example, see the method integerSet, whose hash
function is tailored to spread SmallIntegers over their whole domain,
without making them LargeIntegers (which would be slow), and while
keeping sequential numbers well apart. Celeste message IDs are (or at
least used to be when I looked at it) sequentially given out, which made
their hashing performance fragile in other ways.

Either of these routes could serve someone whose application requires
scalable speed. If you make dictionaries predictably fast, without
having to mess with this sort thing, that'll be great. But note that
keeping all the different cases working fast is not trivial.

Both routes I pointed out are slower for small, well hashed
dictionaries, where calculating a hash might be as expensive as doing
the one or two comparisons actually needed. In my current image, all but
1 are < 4096 items big, and the only rascal is the SymbolTable, whose
hash seems to still function okay...

BTW, I've looked around, and though Squeak spends some time scanning
those Sets and Dictionaries and Strings, there doesn't seem to be any
primitive for doing scanning. Maybe this could help the common case?

Daniel Vainsencher

Scott A Crosby <crosby at qwes.math.cmu.edu> wrote:
> On Sat, 6 Oct 2001, Tim Rowledge wrote:
> 
> >
> > > The fix is to make the following change:
> > >
> > > ```
> > > hashBitsOf: oop
> > >         ^ ((self baseHeader: oop)
> > >                 >> 17 bitAnd: 4095) << 10
> > err, I think that will simply leave the same number of hash values,
> > simply multiplied by 1024 or whatever. It's hard to see how that could
> > help with the problem. To really improve things would require a biger
> 
> You misunderstood.. The reasoning is very subtle. Let me elaborate..
> 
> What the dictionary search does is it searches from
> 
>   'key hash \\ array size' to the end of the array, then wraps around.
> 
> Now assume we have a large set/dictionary, For illustration, let 'array
> size' be 100,000 or more.
> 
> I proceed to insert 10000 objects into it. Because those objects have hash
> values in the range 0-4095, those objects will all be inserted in the
> *beginning* of the array.  (because 'key hash < array size' ) They'll all
> be in the first 10k entries. The other 90% of the array is empty.
> 
> Thus, most searches will start at 0<= hash value <= 4095, and have to
> scan 3000 or more objects for a match.
> 
> Now, if we make a single change, we use '20 * key hash' for figuring out
> the offset into the array to start scanning, then, we'd expect a cluster
> of 2.5 objects at offsets 0, 20, 40, 60, ... 79980. Thus, after hashing,
> any linear search has to scan 2.5 objects.
> 
> 
> If you have a set with >5000 objects in it, the difference between that
> extra multiply and not doing the extra multiply is roughly a factor of
> 1000 in size of the linear scan.
> 
> Probing ``$N$/4000'' objects compared to probing ``$N$/2 - 1000''.
> 
> --
> 
> I simplified things somewhat by assuming that the hash table is relatively
> empty, so I can ignore collissions. Because the table is expected to be,
> at most, 75% loaded, so you'd expet collisions between groups, and to
> probe a little more than I have, so the difference will, in practice, be
> more like a factor of 300x-1000x, not 1000x-3000x
> 
> > hash field in the object headers. Of course, that would cost other
> > useuful bits, or an extra header word or some other complication.
> 
> I'm not talking about the number of hash bits. I'm talking about how those
> hash bits are misused to make them useless.. You'd expect that with k hash
> bits, that a hash table would save you roughly 2^{k-1} work over a linear
> scan, but because of the above reason those bits are wasted.
> 
> I'll take a 100x improvement over a linear scan any day. :)  Until I'm
> faced with a problem that requires sets of tens of thousands of objects,
> I'll not worry about it. IMHO, 12 bits seems more than plenty, if they're
> properly used.. I am *not* suggesting increasing this. I am suggesting
> that the code multiplies that 12-bit value by 1024.
> 
> Also, because strings/symbols have their own hashing algorithm, they're
> not faced with this problem, they generate a full range of hash values.
> 
> --
> 
> I hope you see and understand what I'm talking about here-- I know its
> subtle. And why making the suggested change would be a good thing, its
> something simple, short, elegant, and removes a *massive* scalability
> problem that just about every collection faces the moment you ask it to
> hold more than 4000 objects.
> 
> The change is just a bit shift in the right place in the source code, for
> a hundred-fold improvement in any large collection.   :)
> 
> The trick is effecting it without breaking every aspect of the image.
> 
> This is beyond my skill or understanding of the entire squeak image. If
> the change isn't made, I can workaround the lack (for example, define
> 'hash' for my objects to be 'self identityhash bitShift: 10') but why not
> trivially fix it for everyone?
> 
> >
> > Don't forget that you can implement hash for any object you like; the
> > primitive is there mainly as a convenient default these days. Again
> 
> Exactly, and the moment that I am faced with having to store a million
> objects in a set and the current 12 bits are insufficient, I shall do
> such a thing. Until then, I *like* not having to worry about it.
> 
> >
> > I'm surprised that symbol hashing seemed expensive - didn't Andres
> > Valloud's rework of all that stuff get in some time ago? It was supposed
> > to make a big difference.
> 
> It was a few percent of the runtime, IMHO, this probably isn't something
> to be too worried about.
> 
> 
> Scott
> 
> 
> --
> No DVD movie will ever enter the public domain, nor will any CD. The last CD
> and the last DVD will have moldered away decades before they leave copyright.
> This is not encouraging the creation of knowledge in the public domain.