[squeak-dev] 32 vs 64 bits and large integer hash

Sun Nov 25 05:06:50 UTC 2018

Hi Eliot!

On Sat, Nov 24, 2018 at 9:18 PM Eliot Miranda <eliot.miranda at gmail.com>
wrote:

>
> On Sat, Nov 24, 2018 at 7:40 AM Luciano Notarfrancesco <luchiano at gmail.com>
> wrote:
>
>>
>> Yes! I had this problem blow in my face a couple of years ago, and Juan
>> agreed to include the change in Cuis to make SmallIntegers NOT their own
>> hash.
>>
>
> This seems to be a basic error.  The idea of a hash function is to produce
> a well-distributed set of integers for some set of values.  Since the
> SmallIntegers are themselves perfectly distributed (each unique
> SmallInteger is a unique value), it is impossible to produce a better
> distributed hash function than  the integers themselves. For some
> application it may indeed be possible to produce a better distributed set
> of hashes for the integers; for example an application which considers only
> powers of two could use the log base 2 to produce a smaller and better
> distributed set fo values modulo N than the integers themselves.  But in
> general the integers are definitionally well-distributed.  In fact, unless
> one has a perfect hash function one is in danger of producing a less
> well-distributed set of values from a hash function than the SmallIntegers
> themselves.
>
>
The problem I see is that sequences of SmallIntegers that we humans tend
add to hashed collections are not usually uniformly distributed. Hopefully
they have some meaning, and thus some patterns to them, most of the time
they don't look random, and thus if they are their own hash the performance
of hashed collections will most likely suffer.

I found this problem when profiling my code. I guess some people might
assume that Sets or Dictionaries are fast and use them with SmallIntegers
without much care, and never discover that their code could be orders of
magnitude faster. So, another option might be to modify scanFor: to
randomize the hash of the argument (for example sending an additional
hashMultiply). This is because scanFor: is assuming the arguments of
successive calls will have uniformly distributed hashes:
    start := anObject hash \\ array size + 1.
the algorithm needs 'start' to be uniformly distributed between 1 and array
size, and in the case of SmallIntegers current hash this is very unlikely
and lookups degenerate in linear searches.

Also, any cryptographic hash will do great in pretty much ALL use cases.
Finding a sequence of integers that produces non-uniform hashes is very
hard, and is equivalent to breaking the cryptographic hash and make it
unusable for cryptography. This is a sort of silver bullet, I seem to
remember that Java does this now, but I think it is overkill and something
like hashMultiply is enough (plus, it is cheaper).

My 2 maos,
Luciano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20181125/650306cc/attachment.html>