[Vm-dev] String hash function

Fri Apr 14 16:09:15 UTC 2017

> So the number of different hash values went down from the optimal 10000 to 
100 - aka 1%. What's even worse is that the problem exists if you take any 
small consecutive range, only the ratio changes.

The speed tradeoff seems most acute for large chunks of text.

If the original strategy were used for strings of size less than (say) 50 and the "sampling" strategy were  used for longer strings, with the string length included in the hash, then a large chunk of text with one character added would likely not hash collide.  

A large string scale app could tune the hash function as has been suggested if performance were poor and tuning would be done by someone likely to be oriented to the problem, while performance would be good in the general case.

$0.02
-KenD