[Vm-dev] String hash function

Levente Uzonyi leves at caesar.elte.hu
Thu Apr 13 22:28:56 UTC 2017


Hi Clément,

On Thu, 13 Apr 2017, Clément Bera wrote:

> What do you think ? Is there a hash expert around that can confirm this 
is a good idea ?

I'm not a hash expert. But while I'm sure it's tempting to make this 
change, I'm also sure it won't be a good idea. Why?

What makes a hash function good?
A good hash function is expected to give different values for similar, but 
not equal inputs. If you omit information, as you proposed, you can't 
satisfy this requirement.

Let's say we have strings of length 10, because those are the shortest 
strings affected by your proposal. Let the strings have the form 
'ProductXXXX' where XXXX are numbers identifying a given product type.

Let's see what would happen with the current implementation (step=1) and 
your suggested change (step=2) to the different hash values (code has 
extra complexity to be cross-dialect):

#(1 2) collect: [ :step |
 	step ->
 		((0 to: 9999)
 			collect: [ :index |
 				| aString  speciesHash stringSize hash low |
 				aString := 'Product', ('0000' first: (4 - (index numberOfDigitsInBase: 10))), index asString.
 				speciesHash := ByteString identityHash.
 				stringSize := aString size.
 				hash := speciesHash bitAnd: 16rFFFFFFF.
 				1 to: stringSize by: step do: [:pos |
 					hash := hash + (aString basicAt: pos).
 					"Begin hashMultiply"
 					low := hash bitAnd: 16383.
 					hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 
16r0FFFFFFF ].
 				hash ]
 			as: Set) size ]

The result is:  {1->10000. 2->100}.

So the number of different hash values went down from the optimal 10000 to 
100 - aka 1%. What's even worse is that the problem exists if you take any 
small consecutive range, only the ratio changes.

So, IMO the implementation should stay as is - the hash should include 
information of all bytes of the string.

The good news is that there is PluggableDictionary, which allows you to 
implement your suggestion on the image side. Actually, if we had a really 
good JIT - the overhead of using blocks were negligible, then we could 
just make Dictionary itself pluggable and collapse large parts of the 
HashedCollection hierarchy.

Levente


More information about the Vm-dev mailing list