[Vm-dev] String hash function

Clément Bera bera.clement at gmail.com
Thu Apr 13 21:19:07 UTC 2017


Hi all,

I am trying to investigate performance overhead in benchmarks to improve
the VM performance. In a benchmark strings are used as dictionary keys,
which seems to be an OK pattern to me that may be actually used in real
code from time to time. Something stroke me: *the hash function of a string
is actually iterating over the whole string. *

As a result, I get something like this:

#(
'String size: 0, Hash function performance: 17,817,776 per second'
'String size: 1, Hash function performance: 15,395,388 per second'
'String size: 3, Hash function performance: 14,853,109 per second'
'String size: 4, Hash function performance: 15,151,954 per second'
'String size: 5, Hash function performance: 14,139,881 per second'
'String size: 10, Hash function performance: 11,545,749 per second'
'String size: 65, Hash function performance: 3,431,067 per second'
'String size: 130, Hash function performance: 1,879,047 per second'
'String size: 520, Hash function performance: 511,934 per second'
)

This means that if a hash of a 1MB String is computed it takes a hell of a
time.

I wonder why we don't compute the string hash from a small number of inner
characters taken randomly but deterministically from the String. From
example, for all strings longer than 5 characters, one could pick only 5
characters in the String to compute its hash. The hash function is this way
decent enough and the performance does not drop for large strings.

One implementation could be to replace:
1 to: stringSize do: [:pos |
by:
1 to: stringSize by: stringSize // 5 do: [:pos |
in:
ByteString>>stringHash: aString initialHash: speciesHash

Then we need to rehash the whole image.

What do you think ? Is there a hash expert around that can confirm this is
a good idea ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20170413/f170f4d6/attachment.html>


More information about the Vm-dev mailing list