## Horrible unscalability, how do I change hashBitsOf: ?

Scott A Crosby crosby at qwes.math.cmu.edu
Sat Oct 6 18:17:37 UTC 2001

```On Sat, 6 Oct 2001, Tim Rowledge wrote:

>
> > The fix is to make the following change:
> >
> > ```
> > hashBitsOf: oop
> >         ^ ((self baseHeader: oop)
> >                 >> 17 bitAnd: 4095) << 10
> err, I think that will simply leave the same number of hash values,
> simply multiplied by 1024 or whatever. It's hard to see how that could
> help with the problem. To really improve things would require a biger

You misunderstood.. The reasoning is very subtle. Let me elaborate..

What the dictionary search does is it searches from

'key hash \\ array size' to the end of the array, then wraps around.

Now assume we have a large set/dictionary, For illustration, let 'array
size' be 100,000 or more.

I proceed to insert 10000 objects into it. Because those objects have hash
values in the range 0-4095, those objects will all be inserted in the
*beginning* of the array.  (because 'key hash < array size' ) They'll all
be in the first 10k entries. The other 90% of the array is empty.

Thus, most searches will start at 0<= hash value <= 4095, and have to
scan 3000 or more objects for a match.

Now, if we make a single change, we use '20 * key hash' for figuring out
the offset into the array to start scanning, then, we'd expect a cluster
of 2.5 objects at offsets 0, 20, 40, 60, ... 79980. Thus, after hashing,
any linear search has to scan 2.5 objects.

If you have a set with >5000 objects in it, the difference between that
extra multiply and not doing the extra multiply is roughly a factor of
1000 in size of the linear scan.

Probing ``\$N\$/4000'' objects compared to probing ``\$N\$/2 - 1000''.

--

I simplified things somewhat by assuming that the hash table is relatively
empty, so I can ignore collissions. Because the table is expected to be,
at most, 75% loaded, so you'd expet collisions between groups, and to
probe a little more than I have, so the difference will, in practice, be
more like a factor of 300x-1000x, not 1000x-3000x

> hash field in the object headers. Of course, that would cost other
> useuful bits, or an extra header word or some other complication.

I'm not talking about the number of hash bits. I'm talking about how those
hash bits are misused to make them useless.. You'd expect that with k hash
bits, that a hash table would save you roughly 2^{k-1} work over a linear
scan, but because of the above reason those bits are wasted.

I'll take a 100x improvement over a linear scan any day. :)  Until I'm
faced with a problem that requires sets of tens of thousands of objects,
I'll not worry about it. IMHO, 12 bits seems more than plenty, if they're
properly used.. I am *not* suggesting increasing this. I am suggesting
that the code multiplies that 12-bit value by 1024.

Also, because strings/symbols have their own hashing algorithm, they're
not faced with this problem, they generate a full range of hash values.

--

I hope you see and understand what I'm talking about here-- I know its
subtle. And why making the suggested change would be a good thing, its
something simple, short, elegant, and removes a *massive* scalability
problem that just about every collection faces the moment you ask it to
hold more than 4000 objects.

The change is just a bit shift in the right place in the source code, for
a hundred-fold improvement in any large collection.   :)

The trick is effecting it without breaking every aspect of the image.

This is beyond my skill or understanding of the entire squeak image. If
the change isn't made, I can workaround the lack (for example, define
'hash' for my objects to be 'self identityhash bitShift: 10') but why not
trivially fix it for everyone?

>
> Don't forget that you can implement hash for any object you like; the
> primitive is there mainly as a convenient default these days. Again

Exactly, and the moment that I am faced with having to store a million
objects in a set and the current 12 bits are insufficient, I shall do
such a thing. Until then, I *like* not having to worry about it.

>
> I'm surprised that symbol hashing seemed expensive - didn't Andres
> Valloud's rework of all that stuff get in some time ago? It was supposed
> to make a big difference.

It was a few percent of the runtime, IMHO, this probably isn't something

Scott

--
No DVD movie will ever enter the public domain, nor will any CD. The last CD
and the last DVD will have moldered away decades before they leave copyright.
This is not encouraging the creation of knowledge in the public domain.

```