[BUG] Unscalability in collections.

Sun Oct 7 15:35:19 UTC 2001

Sorry for being confusing..

My question is: How do I alter a hashing method and at the same time
rehash all sets with the new method?

If you can answer that, then you don't need to know details; I'll do the
work and submit it. :)

If you want the details, see below.

Thanks for the help though.

Scott

--

Sorry for being confusing, I was wrong.

I wanted to change *something*. I had first thought to change hashBitsOf,
bad idea, as illustrated by you and others; too low and a bad layer to do
it.

The general problem is that:

 Object>>hash
     self identityHash.

has a reduced range, in that it will ONLY return values in the range
0--4095, though it will return 4096 different values. Given how
collections search for empty slots to insert values, this behaves very
very badly for a Set/Dictionary that contains more than 3500 elements. The
fix for this is fairly simple: Multiply the hashBitsOf value of the object
by a relatively large integer and use that. (I explain this in the
forwarded message I attach)

The question is where to make that change. Do:

Object>>hash
    self identityHash * 4097

But, given that IdentitySet/IdentityDictionary have specialcasing code to
workaround the limited range of identityHash. (See: IdentitySet>>scanFor:)

I can remove that specialcasing code and clean up Identity*/Pluggable*
by a small refactor.

If I can roughly:
  #1 create a ProtoObject>>hashBitsOf, and make method dictionaries use
this method when finding out where to insert a method.
  #2 alter ProtoObject>>identityHash to return '4097 * self hashBitsOf'
  #3 Add a new method to the interpreter that does this all internally.
  #4 Some other details, refactors, altering existing code, etc.

I've done #1 to my image already. #2 causes the image to die in
interesting ways. I tried using the doOnlyOnce mechanism, for #2, but that
uses dictionaries which themselves break under the change.

Scott

---------- Forwarded message ----------
Date: Sat, 6 Oct 2001 14:17:37 -0400 (EDT)
From: Scott A Crosby <crosby at qwes.math.cmu.edu>
Reply-To: squeak-dev at lists.squeakfoundation.org
To: Tim Rowledge <tim at sumeru.stanford.edu>
Cc: squeak-dev at lists.squeakfoundation.org
Subject: Re: Horrible unscalability, how do I change hashBitsOf: ?

On Sat, 6 Oct 2001, Tim Rowledge wrote:

>
> > The fix is to make the following change:
> >
> > ```
> > hashBitsOf: oop
> >         ^ ((self baseHeader: oop)
> >                 >> 17 bitAnd: 4095) << 10
> err, I think that will simply leave the same number of hash values,
> simply multiplied by 1024 or whatever. It's hard to see how that could
> help with the problem. To really improve things would require a biger

You misunderstood.. The reasoning is very subtle. Let me elaborate..

What the dictionary search does is it searches from

  'key hash \\ array size' to the end of the array, then wraps around.

Now assume we have a large set/dictionary, For illustration, let 'array
size' be 100,000 or more.

I proceed to insert 10000 objects into it. Because those objects have hash
values in the range 0-4095, those objects will all be inserted in the
*beginning* of the array.  (because 'key hash < array size' ) They'll all
be in the first 10k entries. The other 90% of the array is empty.

Thus, most searches will start at 0<= hash value <= 4095, and have to
scan 3000 or more objects for a match.

Now, if we make a single change, we use '20 * key hash' for figuring out
the offset into the array to start scanning, then, we'd expect a cluster
of 2.5 objects at offsets 0, 20, 40, 60, ... 79980. Thus, after hashing,
any linear search has to scan 2.5 objects.

If you have a set with >5000 objects in it, the difference between that
extra multiply and not doing the extra multiply is roughly a factor of
1000 in size of the linear scan.

Probing ``$N$/4000'' objects compared to probing ``$N$/2 - 1000''.

--

I simplified things somewhat by assuming that the hash table is relatively
empty, so I can ignore collissions. Because the table is expected to be,
at most, 75% loaded, so you'd expet collisions between groups, and to
probe a little more than I have, so the difference will, in practice, be
more like a factor of 300x-1000x, not 1000x-3000x

> hash field in the object headers. Of course, that would cost other
> useuful bits, or an extra header word or some other complication.

I'm not talking about the number of hash bits. I'm talking about how those
hash bits are misused to make them useless.. You'd expect that with k hash
bits, that a hash table would save you roughly 2^{k-1} work over a linear
scan, but because of the above reason those bits are wasted.

I'll take a 100x improvement over a linear scan any day. :)  Until I'm
faced with a problem that requires sets of tens of thousands of objects,
I'll not worry about it. IMHO, 12 bits seems more than plenty, if they're
properly used.. I am *not* suggesting increasing this. I am suggesting
that the code multiplies that 12-bit value by 1024.

Also, because strings/symbols have their own hashing algorithm, they're
not faced with this problem, they generate a full range of hash values.

--

I hope you see and understand what I'm talking about here-- I know its
subtle. And why making the suggested change would be a good thing, its
something simple, short, elegant, and removes a *massive* scalability
problem that just about every collection faces the moment you ask it to
hold more than 4000 objects.

The change is just a bit shift in the right place in the source code, for
a hundred-fold improvement in any large collection.   :)

The trick is effecting it without breaking every aspect of the image.

This is beyond my skill or understanding of the entire squeak image. If
the change isn't made, I can workaround the lack (for example, define
'hash' for my objects to be 'self identityhash bitShift: 10') but why not
trivially fix it for everyone?

>
> Don't forget that you can implement hash for any object you like; the
> primitive is there mainly as a convenient default these days. Again

Exactly, and the moment that I am faced with having to store a million
objects in a set and the current 12 bits are insufficient, I shall do
such a thing. Until then, I *like* not having to worry about it.

>
> I'm surprised that symbol hashing seemed expensive - didn't Andres
> Valloud's rework of all that stuff get in some time ago? It was supposed
> to make a big difference.

It was a few percent of the runtime, IMHO, this probably isn't something
to be too worried about.

Scott

--
No DVD movie will ever enter the public domain, nor will any CD. The last CD
and the last DVD will have moldered away decades before they leave copyright.
This is not encouraging the creation of knowledge in the public domain.