[Vm-dev] Byte & String collection hash performance; a modest proposal for change.

Wed Apr 19 02:09:38 UTC 2017

Hi All,

    the hash algorithm used for ByteString in Squeak and Pharo is good for
"small" strings and overkill for large strings.  It is important in many
applications to get well distributed string hashes, especially over the
range of strings that constitute things like method names, URLs, etc.
Consequently, the current algorithm includes every character in a string.
This works very well for "small" strings and results in very slow hashes
(and hence long latencies, because the hash is an uninterruptible
primitive) for large strings, where large may be several megabytes.

Let's look at the basic hash algorithm.  The following method is translated
my compiler machinery in VMMaker from Smalltalk to C.  It creates a
primitive function called primitiveStringHash, and so when invoked in
normal Smalltalk code the method below invokes its C translation; neat.

ByteArray>>hashBytes: aByteArray startingWith: speciesHash
"Answer the hash of a byte-indexed collection, using speciesHash as the
initial value.
See SmallInteger>>hashMultiply.

The primitive should be renamed at a suitable point in the future"
<primitive: 'primitiveStringHash' module: 'MiscPrimitivePlugin'>
| byteArraySize hash |
<var: 'aByteArray' type: #'unsigned char *'>
<var: 'speciesHash' type: #int>

byteArraySize := aByteArray size.
hash := speciesHash bitAnd: 16rFFFFFFF.
1 to: byteArraySize do:
[:pos |
hash := hash + (aByteArray basicAt: pos).
"Inlined hashMultiply, written this way for translation to C."
hash := hash * 1664525 bitAnd: 16r0FFFFFFF].
^hash

This function is invokes by a rather convoluted chain:

String>>hash
"#hash is implemented, because #= is implemented"
"ar 4/10/2005: I had to change this to use ByteString hash as initial
hash in order to avoid having to rehash everything and yet compute
the same hash for ByteString and WideString.
md 16/10/2006: use identityHash as initialHash, as behavior hash will
use String hash (name) to have a better hash soon.
eem 4/17/2017 it's not possible to use String hash (name) for the
initial hash because that would be recursive."
^self class stringHash: self initialHash: ByteString identityHash

ByteString class>>stringHash: aString initialHash: speciesHash
"Answer the hash of a byte-indexed string, using speciesHash as the initial
value.
See SmallInteger>>hashMultiply."
<primitive: 'primitiveStringHash' module: 'MiscPrimitivePlugin'>
| hash |
hash := speciesHash bitAnd: 16rFFFFFFF.
1 to: aString size do:
[:pos |
hash := (hash + (aString basicAt: pos)) hashMultiply].
^hash

and the generic string implementation is

String class>>stringHash: aString initialHash: speciesHash
"Answer the hash of a byte-indexed string, using speciesHash as the initial
value.
 See SmallInteger>>hashMultiply."
| hash |
hash := speciesHash bitAnd: 16rFFFFFFF.
1 to: aString size do:
[:pos |
hash := (hash + (aString basicAt: pos)) hashMultiply].
^hash

(it simply omits the primitive declaration).

As of yesterday the inner loop was written differently by Andres Valoud to
avoid overflow:

hash := hash + (aByteArray basicAt: pos).
"Begin hashMultiply"
low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low)
bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF.

The problem here is that the Smalltalk-to-C translation machinery is naive
and entirely incapable of transforming

low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low)
bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF.

into
hash := hash * 1664525 bitAnd: 16r0FFFFFFF

The reformulation makes the primitive a little quicker, gaining for larger
strings, but still suffers the high invocation overhead as described in the
Cog Primitive Performance thread.

In looking at this I've added a primitive for hashMultiply; primitive #159
implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for SmallInteger
and LargePositiveInteger receivers, as fast as possible in the Cog JIT.
With this machinery in place it's instructive to compare the cost of the
primitive against the non-primitive Smalltalk code.

First let me introduce a set of replacement hash functions, newHashN.
These hash all characters in strings up to a certain size, and then no more
than that number for larger strings.  Here are newHash64 and newHash2048,
which use pure Smalltalk, including an inlined hashMultiply written to
avoid SmallInteger overflow.  Also measured are the obvious variants
newHash128, newHash256, newHash512 & mewHash1024.

String>>newHash64
"#hash is implemented, because #= is implemented"
"choice of primes:
(HashedCollection goodPrimes select: [:n| n bitCount = (n highBit // 2)
and: [n <= 16rFFFFFFF]]) collect: [:ea| {ea. ea hex}]"
| size hash |
size := self size.
size = 0 ifTrue: [^214748357 "16rCCCCCC5"].
hash := size < 262144
ifTrue: [size * 2617 "16rA39"]
ifFalse: [size + (size >> 16)].
1 to: size by: (size // 32 max: 1) do: "At most 63 characters"
[:i| | low |
hash := hash + (self basicAt: i).
"hash multiply"
low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low)
bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF].
^hash

String>>newHash2048
"#hash is implemented, because #= is implemented"
"choice of primes:
(HashedCollection goodPrimes select: [:n| n bitCount = (n highBit // 2)
and: [n <= 16rFFFFFFF]]) collect: [:ea| {ea. ea hex}]"
| size hash |
size := self size.
size = 0 ifTrue: [^214748357 "16rCCCCCC5"].
hash := size < 262144
ifTrue: [size * 2617 "16rA39"]
ifFalse: [size + (size >> 16)].
1 to: size by: (size // 1024 max: 1) do: "At most 2047 characters"
[:i| | low |
hash := hash + (self basicAt: i).
"hash multiply"
low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low)
bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF].
^hash

So the idea here is to step through the string by 1 for strings sizes up to
N - 1, and by greater than 1 for strings of size >= N, limiting the maximum
number of characters sampled to between N // 2 and N - 1.  Another idea is
to implement the methods on String, so they are invoked directly.  Another
idea is to discard the speciesHash and use a better value for the null
string hash, a prime whose bitCount is about half its highBit (i.e. about
half of its bits are set).

We can rewrite these more cleanly to use the hashMultiply primitive, so
here are newHashP64 through newHashP2048:
String>>newHashP64
"#hash is implemented, because #= is implemented"
| size hash |
size := self size.
size = 0 ifTrue: [^214748357 "16rCCCCCC5"].
hash := size < 262144
ifTrue: [size * 2617 "16rA39"]
ifFalse: [size + (size >> 16)].
1 to: size by: (size // 32 max: 1) do: "At most 63 characters"
[:i| hash := (hash + (self basicAt: i)) hashMultiply].
^hash

String>>newHashP2048
"#hash is implemented, because #= is implemented"
| size hash |
size := self size.
size = 0 ifTrue: [^214748357 "16rCCCCCC5"].
hash := size < 262144
ifTrue: [size * 2617 "16rA39"]
ifFalse: [size + (size >> 16)].
1 to: size by: (size // 1024 max: 1) do: "At most 2047 characters"
[:i| hash := (hash + (self basicAt: i)) hashMultiply].
^hash

So e.g. newHash2048 and newHashP2048 sample at most 2047 and at least 1024
characters for strings whose size exceeds 1024 elements, and all of the
elements for all strings with size <= 1024 elements.

Let's compare both the hash spread (the number of distinct hashes produced)
and the time taken to evaluate the three variants of hash function.  We
have the interpreter primitive (hash implemented in terms of
stringHash:initialHash:), newHash64 through newHash1024 (inlined
hashMultiply in pure Smalltalk written to avoid overflow in to LargeInteger
arithmetic) and newHashP64 through newHashP2048, written in pure Smalltalk
but using the hashMultiply primitive (that avoids the need to decompose the
multiplication to avoid overflow).

Here's the test harness.  A few things; it computes the blocks used rather
than inlining them in the method to eliminate the cost of block dispatch
form the measurements.  The block dispatch isn't complex but introduces a
little noise.  Second, garbageCollectMost is used to run the scavenger
before each measurement so that GC is in the same initial state; again this
reduces noise.

| strs "strings" ns "number of strings" nus "number of unique strings" ass
"average string size" blocks "the blocks that invoke each hash" |
Smalltalk garbageCollect.
strs := ByteString allSubInstances select: [:s| s size <= 32].
ns := strs size. nus := strs asSet size.
ass := ((strs inject: 0 into: [:sum :s| sum + s size]) / strs size) rounded.
blocks := #('hash' 'newHash64' 'newHash128' 'newHash256' 'newHash512'
'newHash1024' 'newHash2048' 'newHashP64' 'newHashP128' 'newHashP256'
'newHashP512' 'newHashP1024' 'newHashP2048' ) collect: [:f| Compiler
evaluate: '[:ea| ea ', f, ']'].
blocks do: [:ea| ea value: ''];  do: [:ea| ea value: ''].
blocks collect:
[:hashBlock| | nh |
Smalltalk garbageCollectMost.
{ ns. nus. nh := (strs collect: hashBlock) asSet size.  nus - nh. 1.0 - (nh
asFloat / nus asFloat). ass. [1 to: 100 do: [:i| strs do: hashBlock]]
timeToRun - [1 to: 100 do: [:i| strs do: [:ea| ea class]]] timeToRun.
(hashBlock sourceString allButFirst: 10) allButLast}]

N Strings N Unique N Hashes N Collisions fraction of collisions Avg String
 Size Time (ms) hash function
#(121162 54439 54435 4 7.347e-5 11 1926 'hash')

#(121162 54439 54435 3 5.510e-5 11 8913 'newHash64')
#(121162 54439 54435 3 5.510e-5 11 8879 'newHash128')
#(121162 54439 54435 3 5.510e-5 11 8870 'newHash256')
#(121162 54439 54435 3 5.510e-5 11 8835 'newHash512')
#(121162 54439 54435 3 5.510e-5 11 8879 'newHash1024')
#(121162 54439 54435 3 5.510e-5 11 8876 newHash2048')

#(121162 54439 54435 3 5.510e-5 11 5658 'newHashP64')
#(121162 54439 54435 3 5.510e-5 11 5506 'newHashP128')
#(121162 54439 54435 3 5.510e-5 11 5677 'newHashP256')
#(121162 54439 54435 3 5.510e-5 11 5595 'newHashP512')
#(121162 54439 54435 3 5.510e-5 11 5645 'newHashP1024')
#(121162 54439 54435 3 5.510e-5 11 5571 'newHashP2048'))

So for small strings the interpreter primitive wins on speed, considerably,
but has one more collision (I suspect because the seed, ByteString
identityHash, is poor).

Now for byte strings with sizes in the range 33 to 1024; I'll dispense with
the newHash forms; they're essentially half the speed of the newHashP forms
but otherwise identical.

N Strings N Unique N Hashes N Collisions fraction of collisions Avg String
 Size Time (ms) hash function
#(34044 25853 25852 1 3.8680e-5 148 1045 'hash')
#(34044 25853 25790 63 0.00243 148 2918 'newHashP64')
#(34044 25853 25847 6 0.00023 148 4929 'newHashP128')
#(34044 25853 25852 1 3.8680e-5 148 6757 'newHashP256')
#(34044 25853 25851 2 7.7360e-5 148 8959 'newHashP512')
#(34044 25853 25852 1 3.8680e-5 148 10055 'newHashP1024')
#(34044 25853 25852 1 3.8680e-5 148 10382 'newHashP2048'))

So here, hashing between 256 and 511 characters gives as good a
distribution of hashes as considering all of the string.  So I think this
shows that the cut off for effectiveness of string hashing is around 256
characters.  At least on the strings in my image not much is to be gained
by hashing more.

So let's look at strings > 1024 in size

N Strings N Unique N Hashes N Collisions fraction of collisions Avg String
 Size Time (ms) hash function
#(732 606 606 0 0.0 50741 5834 'hash')

#(732 606 605 1 0.001650 50741 60 'newHashP64')
#(732 606 605 1 0.001650 50741 106 'newHashP128')
#(732 606 605 1 0.001650 50741 199 'newHashP256')
#(732 606 605 1 0.001650 50741 416 'newHashP512')
#(732 606 606 0 0.0 50741 822 'newHashP1024')
#(732 606 606 0 0.0 50741 1875 'newHashP2048'))

By this time the cost of hashing all characters overwhelms the primitive
implementation and the pure Smalltalk code becomes much faster.  And the
hash spread, the number of distinct hashes, is as good.

So that's the data.  My conclusions are that

- the primitive is clearly still a win, especially for small strings.  It
could be written as a primitive that is run on the Smalltalk stack, and
that would boost performance for small strings considerably.  But the
primitive still wins against Cog code up through at least 150 byte
strings.  We could run a different doit to detect the cross over in string
length, but not today :-).

- replacing the primitive with one that behaves like newHash1024 or
newHash2048 seems the best to me.  Such a primitive would hash between N
and 2*N-1 characters for strings of length > N, where N would likely be
512, 1024 or 2048.  The primitive should also be written to hash 16-bit,
32-bit and 64-bit non-pointer arrays.

_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20170418/924dcbae/attachment-0001.html>