New method cache, 30% faster macrobenchmarks and ineffeciencies.

Scott A Crosby crosby at qwes.math.cmu.edu
Mon Dec 10 06:49:02 UTC 2001


On Mon, 10 Dec 2001, Stephan Rudlof wrote:

> Scott A Crosby wrote:
> >
> > On Sun, 9 Dec 2001, Stephan Rudlof wrote:
> >
> >
> > > But for ordinary - not VM related - hashed collections an alternative hash
> > > scheme with VM support (named #longHash?) could make sense. And the use of
> >
> > Its probably not relevant. My hashed collection stuff avoids a nasty big
> > performance degradation for when you get to >4000 elements (where
> > performace will fall by a factor of 1000 as the algorithm devolves into a
> > linear search).. If someone needs to store >100k objects in a hash table,
> > they'll probably be far better served by writing a custom 'hash' function
> > and store a 30-bit hash value in a slot.
>
> I think I've taken the point: spreading a 12 bit hash to *much* larger
> values doesn't help for very large collections: it leads to big leaks in the
> collections just reachable by linear search...
>

Not quite. In theory, if you use it right, a hash-table where the hashes
can collect (say) 2^12 distinct values should be about 2^12 times faster
than a linear search.

The problem (see my earlier posts) is that large (say) 40000 element Set
SHOULD be about 4000 times faster than a linear search on 40000 elements.
(You can think of it as the hash value breaks things up into 4000 chains
of about 10 elements each.).. Squeak Set's and dictionaries do NOT have
this behaivor, ANY set over 4000 elements will always devolve into a
linear search, when it should be about 4000 times faster than a linear
search. The *why* is a subtle design flaw. See my earlier post.

In your other email. 4097 as the multiplier is not a problem except that
if use ``object hashBits * 4097 \\ array size '' to index into an array,
if 'array size' is not coprime with 4097, this will map several values
into the same bucket. For example, ``array size'' was 4097, then this will
map all values into the same bucket. Since 4097 is not prime, the odds of
'array size' having a common factor with it are 6%.

Ergo, using prime numbers, preferably big prime numbers, makes
things safer.

If you want wisdom, read and understand the subtle flaw in Set. :) Its
far more interesting than primality of hash-multipliers.


Scott






More information about the Squeak-dev mailing list