identityHash patch discussion.

Stephan Rudlof sr at evolgo.de
Tue Dec 11 03:36:38 UTC 2001


Andres Valloud wrote:
> 
> Hi Scott,
> 
> > The problem is that those bits are used as-is, to construct a number
> > in the *specific* range 0...4095, which is used as the starting index
> > in Set>>scanFor: This flaw means that *any* set and *any* dictionary
> > that has 4000 elements in it will devolve into a linear search
> > through all of the elements, when it could be 4000 times faster.
> 
> Wouldn't this only happen for IdentitySets and IdentityDictionaries?

No, there it is repaired:

>From a previous mail of mine:
-----------------------
This problem in
Set>>scanFor: anObject
        "Scan the key array for the first slot containing either a nil
(indicating
an empty slot) or an element that matches anObject. Answer the index of that
slot or zero if no slot is found. This method will be overridden in various
subclasses that have different interpretations for matching elements."
        | element start finish |
        start _ (anObject hash \\ array size) + 1.
        finish _ array size.
        ...

is repaired in
IdentitySet>>scanFor: anObject
        | finish hash start element |
        finish _ array size.
        finish > 4096
                ifTrue: [hash _ anObject identityHash * (finish // 4096)]
                ifFalse: [hash _ anObject identityHash].
        start _ (hash \\ array size) + 1.
        ...
-----------------------

But this repair doesn't work well for very large sets, since the spreading
of the identity hash leads to big chunks which have to be searched linearly.

After many mails and some thinking ;-)
The last suggestion of Scott in
	Subject: [FIX] Re: [FIX] identityHash and fixing collections. (repost)
	Date: Sun, 9 Dec 2001 19:18:52 -0500 (EST)
to use a factor of $p=135859$ for multiplying the hashbits solves this
problem for Sets up to a size of about 100k regarding objects using
identityHash as hash with a similar effect as the repair (for IdentitySets)
above.

For sets sized above the 100k limit a better hash function is needed in
every case, since further spreading decreases performance too much.


Greetings,

Stephan
-- 
Stephan Rudlof (sr at evolgo.de)
   "Genius doesn't work on an assembly line basis.
    You can't simply say, 'Today I will be brilliant.'"
    -- Kirk, "The Ultimate Computer", stardate 4731.3




More information about the Squeak-dev mailing list