[squeak-dev] 32 vs 64 bits and large integer hash
eliot.miranda at gmail.com
Sun Nov 25 21:21:29 UTC 2018
> On Nov 24, 2018, at 9:06 PM, Luciano Notarfrancesco <luchiano at gmail.com> wrote:
> Hi Eliot!
>> On Sat, Nov 24, 2018 at 9:18 PM Eliot Miranda <eliot.miranda at gmail.com> wrote:
>>> On Sat, Nov 24, 2018 at 7:40 AM Luciano Notarfrancesco <luchiano at gmail.com> wrote:
>>> Yes! I had this problem blow in my face a couple of years ago, and Juan agreed to include the change in Cuis to make SmallIntegers NOT their own hash.
>> This seems to be a basic error. The idea of a hash function is to produce a well-distributed set of integers for some set of values. Since the SmallIntegers are themselves perfectly distributed (each unique SmallInteger is a unique value), it is impossible to produce a better distributed hash function than the integers themselves. For some application it may indeed be possible to produce a better distributed set of hashes for the integers; for example an application which considers only powers of two could use the log base 2 to produce a smaller and better distributed set fo values modulo N than the integers themselves. But in general the integers are definitionally well-distributed. In fact, unless one has a perfect hash function one is in danger of producing a less well-distributed set of values from a hash function than the SmallIntegers themselves.
> The problem I see is that sequences of SmallIntegers that we humans tend add to hashed collections are not usually uniformly distributed. Hopefully they have some meaning, and thus some patterns to them, most of the time they don't look random, and thus if they are their own hash the performance of hashed collections will most likely suffer.
As I said, god specific cases where the distribution of interludes used fits some pattern, there is PluggableDictionary and PluggableSet. In cases where integers are randomly or evenly distributed one cannot improve on integers being their own hashes, so KISS, and use the available alternatives.
> I found this problem when profiling my code. I guess some people might assume that Sets or Dictionaries are fast and use them with SmallIntegers without much care, and never discover that their code could be orders of magnitude faster. So, another option might be to modify scanFor: to randomize the hash of the argument (for example sending an additional hashMultiply). This is because scanFor: is assuming the arguments of successive calls will have uniformly distributed hashes:
> start := anObject hash \\ array size + 1.
> the algorithm needs 'start' to be uniformly distributed between 1 and array size, and in the case of SmallIntegers current hash this is very unlikely and lookups degenerate in linear searches.
Whether it is likely or unlikely depends on the specific set of integers. Any hash function can suffer from the same issue; and if a hash function produces fewer values than its input set then that hash function may make things worse. Hence it makes sense to keep things simple and be aware of alternatives for specific situations.
> Also, any cryptographic hash will do great in pretty much ALL use cases. Finding a sequence of integers that produces non-uniform hashes is very hard,
Since one takes the result of the hash modulo N I think this statement is false. It depends on the hash table size and the specific set of integers one is hashing.
> and is equivalent to breaking the cryptographic hash and make it unusable for cryptography. This is a sort of silver bullet, I seem to remember that Java does this now, but I think it is overkill and something like hashMultiply is enough (plus, it is cheaper).
> My 2 maos,
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Squeak-dev