[Vm-dev] String hash function
Levente Uzonyi
leves at caesar.elte.hu
Thu Apr 13 22:28:56 UTC 2017
Hi Clément,
On Thu, 13 Apr 2017, Clément Bera wrote:
> What do you think ? Is there a hash expert around that can confirm this
is a good idea ?
I'm not a hash expert. But while I'm sure it's tempting to make this
change, I'm also sure it won't be a good idea. Why?
What makes a hash function good?
A good hash function is expected to give different values for similar, but
not equal inputs. If you omit information, as you proposed, you can't
satisfy this requirement.
Let's say we have strings of length 10, because those are the shortest
strings affected by your proposal. Let the strings have the form
'ProductXXXX' where XXXX are numbers identifying a given product type.
Let's see what would happen with the current implementation (step=1) and
your suggested change (step=2) to the different hash values (code has
extra complexity to be cross-dialect):
#(1 2) collect: [ :step |
step ->
((0 to: 9999)
collect: [ :index |
| aString speciesHash stringSize hash low |
aString := 'Product', ('0000' first: (4 - (index numberOfDigitsInBase: 10))), index asString.
speciesHash := ByteString identityHash.
stringSize := aString size.
hash := speciesHash bitAnd: 16rFFFFFFF.
1 to: stringSize by: step do: [:pos |
hash := hash + (aString basicAt: pos).
"Begin hashMultiply"
low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd:
16r0FFFFFFF ].
hash ]
as: Set) size ]
The result is: {1->10000. 2->100}.
So the number of different hash values went down from the optimal 10000 to
100 - aka 1%. What's even worse is that the problem exists if you take any
small consecutive range, only the ratio changes.
So, IMO the implementation should stay as is - the hash should include
information of all bytes of the string.
The good news is that there is PluggableDictionary, which allows you to
implement your suggestion on the image side. Actually, if we had a really
good JIT - the overhead of using blocks were negligible, then we could
just make Dictionary itself pluggable and collapse large parts of the
HashedCollection hierarchy.
Levente
More information about the Vm-dev
mailing list