[Vm-dev] [Pharo-dev] String hash function

Martin McClure martin at hand2mouse.com
Fri Apr 14 19:05:29 UTC 2017


On 04/13/2017 02:19 PM, Clément Bera wrote:
> I wonder why we don't compute the string hash from a small number of
> inner characters taken randomly but deterministically from the String.
> From example, for all strings longer than 5 characters, one could pick
> only 5 characters in the String to compute its hash. The hash function
> is this way decent enough and the performance does not drop for large
> strings.
> 
> One implementation could be to replace:
> 1 to: stringSize do: [:pos |
> by:
> 1 to: stringSize by: stringSize // 5 do: [:pos |
> in:
> ByteString>>stringHash: aString initialHash: speciesHash
> 
> Then we need to rehash the whole image.
> 
> What do you think ? Is there a hash expert around that can confirm this
> is a good idea ? 

This kind of thing has been tried before, and failed miserably. One of
the biggest breaking incompatibilities, IIRC, between Java 1 and Java 2
was that they had to change their string hash function to hash all
characters, because their scheme of hashing only a subset of characters
had a huge number of collisions in real-world string like URLs, which
have a lot of characters in common.

I think it's worth the time to hash all characters.

Regards,

-Martin


More information about the Vm-dev mailing list