[Vm-dev] Cog Primitive Performance

Tue Apr 18 19:33:28 UTC 2017

The "technique" of multiplicative hash functions is well publicized, e.g.

https://en.wikipedia.org/wiki/Universal_hashing (under hashing strings)

https://en.wikipedia.org/wiki/Hash_function#Multiplicative_hashing

and IIRC Knuth under hashing.  Part of the problem with multiplicative 
hash functions is that every other person chooses a different number and 
attaches their name to it.  So if you choose 13, 131, 1313 etc you have 
the K&R "digital method" hash, or if you choose an FNV factor then you 
have the FNV hashes, and so on.  But really they are all the same 
algorithm: namely, the evaluation of a polynomial with coefficients 
implied by the data at the factor chosen using Horner's rule.

Note I did measure, and even Knuth's suggestions for factors did not 
substantially improve the outcome.  The entire class of multiplicative 
hash functions, as long as the factor is not really bad like 31 or 255, 
are "good enough".  In particular, the size of the factor doesn't matter 
as much provided it's sufficiently large compared to the quanta being 
hashed (e.g. bytes => factor [significantly] greater than 256).

Overall, I checked 300+ hash functions against 100+ datasets and 
documented the results in the hashing book.  Many of the hash functions 
are multiplicative hash functions, either explicitly or in disguise.

At one point I did find the RNG factor 1664525 on line 26 from Knuth's 
TAOCP vol 2 table of good linear congruence RNG multipliers.  However, 
that the factor is good for an RNG does not necessarily show it has to 
be a good factor for a multiplicative hash function.  In retrospect, 
that implication seems like an argument by osmosis.

Andres.

On 4/18/17 12:08 , Eliot Miranda wrote:
> Hi Andres,
>
> On Tue, Apr 18, 2017 at 11:06 AM, Andres Valloud
> <avalloud at smalltalk.comcastbiz.net
> <mailto:avalloud at smalltalk.comcastbiz.net>> wrote:
>
>
>     The hashing technique is just a multiplicative hash function.  The
>     multiplication technique is equivalent to splitting AX into AH and
>     AL, and using the distributive rule to (AH * 2^8 + AL) * (BH * 2^8 +
>     BL). The choice of 1664525 is due to Mark van der Gulik, he took the
>     constant out of a Knuth table of good linear congruency pseudo-RNG
>     factors.  It's been 17 years since that, I do not currently think
>     the pseudo-RNG relationship is that relevant.  In the mean time, as
>     hashing goes, it's good enough.
>
>
> I'm looking for more of a reference, say to a paper which describes the
> technique, or a reference implementation on the web.  Remember that one
> important proper of 1664525 is that it belongs to a set of integers for
> which n bitCount * 2 between: n highBit and: n highBit + 2, i.e. that
> about half the bits are set.  Further, the bits are reasonably well
> distributed throughout, unlike for example 16rF0F0F0F0F.
>
> I'll look up the table in my Knuth vol 3.
>
>
>
>     On 4/18/17 10:01 , Eliot Miranda wrote:
>
>         Hi Andres,
>
>         On Tue, Apr 18, 2017 at 1:52 AM, Andres Valloud
>         <avalloud at smalltalk.comcastbiz.net
>         <mailto:avalloud at smalltalk.comcastbiz.net>
>         <mailto:avalloud at smalltalk.comcastbiz.net
>         <mailto:avalloud at smalltalk.comcastbiz.net>>> wrote:
>
>
>             Yes, the SqR initials are mine.  That's too bad about the
>         comments,
>             at the very least there should be something like this:
>
>             "This code performs a multiplication by 1664525 mod 2^28 without
>             overflowing into large integers."
>
>
>         Thanks.  I'll update the primitives appropriately.  Could you
>         supply a
>         reference both for the technique and the choice of  1664525 in
>         particular?  I'd like to include that info in the comments too.
>
>
>             On 4/18/17 1:48 , Levente Uzonyi wrote:
>
>                 The methods in SmallInteger and LargePositiveInteger
>         have the
>                 initials
>                 SqR (that's yours IIRC), but they have no comment.
>                 The variant in String is originally from 2001 and has
>         never had that
>                 comment.
>
>                 Levente
>
>                 On Mon, 17 Apr 2017, Andres Valloud wrote:
>
>
>                     On 4/17/17 16:45 , Eliot Miranda wrote:
>
>                         FYI, hash multiply is
>                         hashMultiply
>                         | low |
>                         low := self bitAnd: 16383.
>                         ^(16r260D * low + ((16r260D * (self bitShift: -14) +
>                         (16r0065 * low)
>                         bitAnd: 16383) * 16384))
>                         bitAnd: 16r0FFFFFFF
>
>                         and when written as a primitive is
>                         hashMultiply
>                         | low |
>                         low := self bitAnd: 16383.
>                         ^(16r260D * low + (16r260D * (self bitShift: -14) +
>                         (16r0065 * low) *
>                         16384))
>                         bitAnd: 16r0FFFFFFF
>                         because we don't care about the multiply by 16384
>                         overflowing; when
>                         written as a primitive it won't overflow into a
>                         LargePositiveInteger.
>
>
>                     Hopefully this doesn't mean the primitive got
>         implemented by
>                     actually
>                     doing these operations verbatim.  As you have
>         probably seen,
>                     that
>                     convoluted arithmetic is done this way in Smalltalk
>         only to
>                     simulate a
>                     32x32 multiplication bit-anded down to 28 bits without
>                     overflowing
>                     into large integers (the original code from August
>         2000 had
>                     my initials).
>
>                     At a sufficiently low level such as C, all that
>         complexity
>                     is just an
>                     unsigned multiplication by 1664525.  The image code
>         should
>                     still have
>                     a comment to that effect, did it get lost?
>
>                     Andres.
>
>
>
>
>
>         --
>         _,,,^..^,,,_
>         best, Eliot
>
>
>
>
> --
> _,,,^..^,,,_
> best, Eliot