[Vm-dev] [Pharo-dev] String hash function

Nicolas Cellier nicolas.cellier.aka.nice at gmail.com
Fri Apr 14 06:04:47 UTC 2017


2017-04-14 8:03 GMT+02:00 Nicolas Cellier <
nicolas.cellier.aka.nice at gmail.com>:

> However, since Symbol are unique, we could have a different hashing policy.
> Like for example pinning Symbol and then using identity hash.
> It would probably complexify the handing of memory, but that's the deal.
>
>
oops forget it, the identity hash does not rely on object position in
memory, so no need pinning.


> Currently we can't because images are relying on #foo = 'foo', so we have
> to use the String hash.
> But it does not have to be so. I have tried to remove this equivalence
> long ago (in squeak 3.9 maybe), and it was not such a big deal if I
> remember...
>
> 2017-04-14 6:05 GMT+02:00 Max Leske <maxleske at gmail.com>:
>
>>
>> That would radically increase the number of collisions from the hash
>> function. I don’t think that is a good idea. *Especially* when we are
>> talking about a hash function that will be used for dictionaries. When you
>> test the performance of a hash function you should also test it’s
>> properties w.r.t. to hashed collections (Andrés Valloud explains that
>> nicely in “Hashing in Smalltalk”).
>>
>> Cheers,
>> Max
>>
>> On 13 Apr 2017, at 23:19, Clément Bera <bera.clement at gmail.com> wrote:
>>
>> Hi all,
>>
>> I am trying to investigate performance overhead in benchmarks to improve
>> the VM performance. In a benchmark strings are used as dictionary keys,
>> which seems to be an OK pattern to me that may be actually used in real
>> code from time to time. Something stroke me: *the hash function of a
>> string is actually iterating over the whole string. *
>>
>> As a result, I get something like this:
>>
>> #(
>> 'String size: 0, Hash function performance: 17,817,776 per second'
>> 'String size: 1, Hash function performance: 15,395,388 per second'
>> 'String size: 3, Hash function performance: 14,853,109 per second'
>> 'String size: 4, Hash function performance: 15,151,954 per second'
>> 'String size: 5, Hash function performance: 14,139,881 per second'
>> 'String size: 10, Hash function performance: 11,545,749 per second'
>> 'String size: 65, Hash function performance: 3,431,067 per second'
>> 'String size: 130, Hash function performance: 1,879,047 per second'
>> 'String size: 520, Hash function performance: 511,934 per second'
>> )
>>
>> This means that if a hash of a 1MB String is computed it takes a hell of
>> a time.
>>
>> I wonder why we don't compute the string hash from a small number of
>> inner characters taken randomly but deterministically from the String. From
>> example, for all strings longer than 5 characters, one could pick only 5
>> characters in the String to compute its hash. The hash function is this way
>> decent enough and the performance does not drop for large strings.
>>
>> One implementation could be to replace:
>> 1 to: stringSize do: [:pos |
>> by:
>> 1 to: stringSize by: stringSize // 5 do: [:pos |
>> in:
>> ByteString>>stringHash: aString initialHash: speciesHash
>>
>> Then we need to rehash the whole image.
>>
>> What do you think ? Is there a hash expert around that can confirm this
>> is a good idea ?
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20170414/f3c75e73/attachment-0001.html>


More information about the Vm-dev mailing list