[squeak-dev] 32 vs 64 bits and large integer hash

Sun Nov 25 23:22:13 UTC 2018

Le dim. 25 nov. 2018 à 22:21, Eliot Miranda <eliot.miranda at gmail.com> a
écrit :

> Hi Luciano,
>
> On Nov 24, 2018, at 9:06 PM, Luciano Notarfrancesco <luchiano at gmail.com>
> wrote:
>
> Hi Eliot!
>
> On Sat, Nov 24, 2018 at 9:18 PM Eliot Miranda <eliot.miranda at gmail.com>
> wrote:
>
>>
>> On Sat, Nov 24, 2018 at 7:40 AM Luciano Notarfrancesco <
>> luchiano at gmail.com> wrote:
>>
>>>
>>> Yes! I had this problem blow in my face a couple of years ago, and Juan
>>> agreed to include the change in Cuis to make SmallIntegers NOT their own
>>> hash.
>>>
>>
>> This seems to be a basic error.  The idea of a hash function is to
>> produce a well-distributed set of integers for some set of values.  Since
>> the SmallIntegers are themselves perfectly distributed (each unique
>> SmallInteger is a unique value), it is impossible to produce a better
>> distributed hash function than  the integers themselves. For some
>> application it may indeed be possible to produce a better distributed set
>> of hashes for the integers; for example an application which considers only
>> powers of two could use the log base 2 to produce a smaller and better
>> distributed set fo values modulo N than the integers themselves.  But in
>> general the integers are definitionally well-distributed.  In fact, unless
>> one has a perfect hash function one is in danger of producing a less
>> well-distributed set of values from a hash function than the SmallIntegers
>> themselves.
>>
>>
> The problem I see is that sequences of SmallIntegers that we humans tend
> add to hashed collections are not usually uniformly distributed. Hopefully
> they have some meaning, and thus some patterns to them, most of the time
> they don't look random, and thus if they are their own hash the performance
> of hashed collections will most likely suffer.
>
>
> As I said, god specific cases where the distribution of interludes used
> fits some pattern, there is PluggableDictionary and PluggableSet.  In
> cases where integers are randomly or evenly distributed one cannot improve
> on integers being their own hashes, so KISS, and use the available
> alternatives.
>
> Hi Eliot,
my bet is as Luciano: set of integers with uniform distribution is the
least probable in my experience: it's just noise. Every other data of
interest will show different patterns.
So I'm not sure that forcing usage of Pluggable* is that simple, for
preserving theoretical property that we don't encounter...

> I found this problem when profiling my code. I guess some people might
> assume that Sets or Dictionaries are fast and use them with SmallIntegers
> without much care, and never discover that their code could be orders of
> magnitude faster. So, another option might be to modify scanFor: to
> randomize the hash of the argument (for example sending an additional
> hashMultiply). This is because scanFor: is assuming the arguments of
> successive calls will have uniformly distributed hashes:
>     start := anObject hash \\ array size + 1.
> the algorithm needs 'start' to be uniformly distributed between 1 and
> array size, and in the case of SmallIntegers current hash this is very
> unlikely and lookups degenerate in linear searches.
>
>
> Whether it is likely or unlikely depends on the specific set of integers.
> Any hash function can suffer from the same issue; and if a hash function
> produces fewer values than its input set then that hash function may make
> things worse.  Hence it makes sense to keep things simple and be aware of
> alternatives for specific situations.
>
>
But that is the case for any other hash. What is important is the length of
hash bits compared to the size of the Set.
If hashMultiply result in too short hashes, then we get a problem for Set
of any other objects...

> Also, any cryptographic hash will do great in pretty much ALL use cases.
> Finding a sequence of integers that produces non-uniform hashes is very
> hard,
>
>
> Since one takes the result of the hash modulo N I think this statement is
> false.  It depends on the hash table size and the specific set of integers
> one is hashing.
>
> Hmm number theory, let's see... Hmm, that's really a question for Luciano
or Andres ;)

> and is equivalent to breaking the cryptographic hash and make it unusable
> for cryptography. This is a sort of silver bullet, I seem to remember that
> Java does this now, but I think it is overkill and something like
> hashMultiply is enough (plus, it is cheaper).
>
> My 2 maos,
> Luciano
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20181126/516bcb3e/attachment.html>