[squeak-dev] 32 vs 64 bits and large integer hash

Sat Nov 24 21:18:14 UTC 2018

On Sat, Nov 24, 2018 at 7:40 AM Luciano Notarfrancesco <luchiano at gmail.com>
wrote:

> On Fri, Nov 23, 2018 at 12:36 AM Chris Muller <asqueaker at gmail.com> wrote:
>
>> > >> I'm a bit puzzled. I thought  (small)Integers being their own hash
>> is a good thing?
>>
>> I was wondering exactly the same thing!
>>
>> > > I would call it simple but not necessarily good.
>> > > The problem with it is that consecutive numbers generate long chains
>> in HashedCollections:
>> > >
>> > > a := (1 to: 1000) asArray.
>> > > s := Set withAll: a.
>> > > [ 1 to: 1000000 do: [ :each | s includes: each ] ] timeToRun.
>> > > "==> 7014"
>> > >
>> > > The solution in Squeak is to use PluggableSet instead of Set, because
>> it applies #hashMultiply on the hash value:
>> > >
>> > > ps := PluggableSet integerSet.
>> > > ps addAll: a.
>> > > [ 1 to: 1000000 do: [ :each | ps includes: each ] ] timeToRun.
>> > > "==> 95"
>> > >
>> > > IIRC in Pharo SmallInteger's hash is based on #hashMultiply to avoid
>> the long chains. That was probably the main reason for the push to make
>> #hashMultply a numbered primitive.
>> > >
>> >
>> > Interesting!
>>
>> Indeed!  When making a #hash methods, one always focuses on hash
>> distribution and finding the elements, but its easy to forget about
>> performance of NOT finding an element.
>>
>>
> Yes! I had this problem blow in my face a couple of years ago, and Juan
> agreed to include the change in Cuis to make SmallIntegers NOT their own
> hash.
>

This seems to be a basic error.  The idea of a hash function is to produce
a well-distributed set of integers for some set of values.  Since the
SmallIntegers are themselves perfectly distributed (each unique
SmallInteger is a unique value), it is impossible to produce a better
distributed hash function than  the integers themselves. For some
application it may indeed be possible to produce a better distributed set
of hashes for the integers; for example an application which considers only
powers of two could use the log base 2 to produce a smaller and better
distributed set fo values modulo N than the integers themselves.  But in
general the integers are definitionally well-distributed.  In fact, unless
one has a perfect hash function one is in danger of producing a less
well-distributed set of values from a hash function than the SmallIntegers
themselves.

This argument doesn't apply as integers grow beyond the SmallInteger, but
not because we want better distribution of values than the large integers
themselves, but b because we want to avoid large integer arithmetic.

> I think it is a good idea in general when programming a hash method in
> Smalltalk to make it somewhat random.
>

As I've indicated, I think this is impossible in general.  It is only in
specific applications, using specific subsets of the SmallIn tigers for
which a better hash function could be derived, but this would be specific
to that application.  For purposes such as these we have
PluggableDictionary et al which can exploit an application-specific hash.
But in general the SmallIntegers are ideal values for their own hashes.

> It will never be perfect for every use case, and it doesn't need to be
> cryptographically secure, but there shouldn't be simple use cases (e.g.,
> consecutive integers) that produce hashes that are not uniformly
> distributed. This can be achieved with minimal performance impact (and
> potentially big performance gains in hashed collections) quite simply by
> sending some hashMultiply message in your hash method.
>
> Regards,
> Luciano
>

_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20181124/aabecf65/attachment.html>