[squeak-dev] 32 vs 64 bits and large integer hash
eliot.miranda at gmail.com
Sat Nov 24 21:18:14 UTC 2018
On Sat, Nov 24, 2018 at 7:40 AM Luciano Notarfrancesco <luchiano at gmail.com>
> On Fri, Nov 23, 2018 at 12:36 AM Chris Muller <asqueaker at gmail.com> wrote:
>> > >> I'm a bit puzzled. I thought (small)Integers being their own hash
>> is a good thing?
>> I was wondering exactly the same thing!
>> > > I would call it simple but not necessarily good.
>> > > The problem with it is that consecutive numbers generate long chains
>> in HashedCollections:
>> > >
>> > > a := (1 to: 1000) asArray.
>> > > s := Set withAll: a.
>> > > [ 1 to: 1000000 do: [ :each | s includes: each ] ] timeToRun.
>> > > "==> 7014"
>> > >
>> > > The solution in Squeak is to use PluggableSet instead of Set, because
>> it applies #hashMultiply on the hash value:
>> > >
>> > > ps := PluggableSet integerSet.
>> > > ps addAll: a.
>> > > [ 1 to: 1000000 do: [ :each | ps includes: each ] ] timeToRun.
>> > > "==> 95"
>> > >
>> > > IIRC in Pharo SmallInteger's hash is based on #hashMultiply to avoid
>> the long chains. That was probably the main reason for the push to make
>> #hashMultply a numbered primitive.
>> > >
>> > Interesting!
>> Indeed! When making a #hash methods, one always focuses on hash
>> distribution and finding the elements, but its easy to forget about
>> performance of NOT finding an element.
> Yes! I had this problem blow in my face a couple of years ago, and Juan
> agreed to include the change in Cuis to make SmallIntegers NOT their own
This seems to be a basic error. The idea of a hash function is to produce
a well-distributed set of integers for some set of values. Since the
SmallIntegers are themselves perfectly distributed (each unique
SmallInteger is a unique value), it is impossible to produce a better
distributed hash function than the integers themselves. For some
application it may indeed be possible to produce a better distributed set
of hashes for the integers; for example an application which considers only
powers of two could use the log base 2 to produce a smaller and better
distributed set fo values modulo N than the integers themselves. But in
general the integers are definitionally well-distributed. In fact, unless
one has a perfect hash function one is in danger of producing a less
well-distributed set of values from a hash function than the SmallIntegers
This argument doesn't apply as integers grow beyond the SmallInteger, but
not because we want better distribution of values than the large integers
themselves, but b because we want to avoid large integer arithmetic.
> I think it is a good idea in general when programming a hash method in
> Smalltalk to make it somewhat random.
As I've indicated, I think this is impossible in general. It is only in
specific applications, using specific subsets of the SmallIn tigers for
which a better hash function could be derived, but this would be specific
to that application. For purposes such as these we have
PluggableDictionary et al which can exploit an application-specific hash.
But in general the SmallIntegers are ideal values for their own hashes.
> It will never be perfect for every use case, and it doesn't need to be
> cryptographically secure, but there shouldn't be simple use cases (e.g.,
> consecutive integers) that produce hashes that are not uniformly
> distributed. This can be achieved with minimal performance impact (and
> potentially big performance gains in hashed collections) quite simply by
> sending some hashMultiply message in your hash method.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Squeak-dev