identityHash patch discussion.

Tue Dec 11 03:33:05 UTC 2001

Stephen, to reply to you, there appears to be a small slowdown due to the
squeak interpreter running the multiplication instead of doing it builtin.

On Mon, 10 Dec 2001, Andres Valloud wrote:

> Hi Scott,
>
> > The problem is that those bits are used as-is, to construct a number
> > in the *specific* range 0...4095, which is used as the starting index
> > in Set>>scanFor: This flaw means that *any* set and *any* dictionary
> > that has 4000 elements in it will devolve into a linear search
> > through all of the elements, when it could be 4000 times faster.
>
> Wouldn't this only happen for IdentitySets and IdentityDictionaries?

Any class which does not have its own #hash, uses #identityHash by
default, this includes basically all of Morphic... Very few classes define
their own #hash.

>
> It is not very difficult to change #hash if you don't modify
> #identityHash.
>

I am subscribing to the motto: fix it once, fix it in the right place.

This fix simplifies the code. If you want to study it to see why, study
it. If you don't care, just trust my word that its the right thing.

> At any rate, when it comes to the modular arithmetic hashed collections
> use to map hash values into an array, it really doesn't matter if you
> multiply the values by constants or things like that.  After all, after
> \\ sizeOfArray is sent, all those values will be constrained to the
> range 1..sizeOfArray anyway.
>

No the values will be constrained to '1... min(4096,sizeOfArray)'.

Which is the problem....

> If my memory serves me right, when the 4096 possible original
> #identityHash values are well distributed, after being multiplied by a
> constant the results will also be well distributed over a larger range.

... because #identityHash values are *not* well-distributed when taken
modulo any number > 4096.

Contrast the distributions of:
   X='object hashBits \\ array size  +1'
and
   Y='(object hashBits * 4297) \\ array size   +1'

For 'array size' at 100, 1000, 10000, 100000.

And 'object hashBits' evenly distributed in the range 0...4095

This is *NOT* what we want. because, with the first, this tickles a bug in
Set/Dictionaries/etc, that will lead to a 1000x performace degradation for
any Set/Dictionary >4000 elements.

Look at Set>>scanFor:, and how it will react, given the above
distributions of X,Y for the different array sizes.

IE, what happens if X is ONLY distributed in the range 1...4096, *AND*
`array size' > 10,000.

> I think that in this kind of situations, it's easier to use Sets and
> Dictionaries instead of their identity counterparts so you can take
> advantage of #hash.  Then you don't have to do anything to achieve
> better lookup performance.

Few things define their own 'hash'. The reason to fix it at #identityHash,
is to avoid these bugs in future collections. Do it once, do it right.

>
> I don't have all the previous discussion here.  I'd be interested to
> read about something that is inefficient because of the short
> #identityHash which can't be solved without using a larger
> #identityHash.

Huh? Define 'larger identityHash'? Do you mean an even distribution on
0 ... 4095 ... 2^k-1, (k>12). THis would be nicer, but would require
*SERIOUS* image changes.

Or some new distribution of 4096 values over the range of SmallInteger?
Some distribution other than the first 4096 (for example, something like
[0,3,6,9, ... 12285])?

If so, then I wholly agree; this is what I advocate and what my patch
does.

Read Set>>scanFor: very closely to see why #identityHash is broken. Its
subtle, look at the mailing list archives.

Scott

--
No DVD movie will ever enter the public domain, nor will any CD. The last CD
and the last DVD will have moldered away decades before they leave copyright.
This is not encouraging the creation of knowledge in the public domain.