[squeak-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Mon May 1 07:09:29 UTC 2017

I see no replies to this on any of the three lists it was sent to, so I
guess I'll chime in.

tl;dr:
Making a primitive for #hashMultiply, probably a good idea, in some
form, since doing it in Smalltalk is awkward.
Only hashing every N-th character of large strings, probably a very bad
idea. Performance might well get worse, the complexity does not seem
justified, and it would open a sizeable security hole.

More verbiage below for those interested.

Regards,
-Martin

On 04/18/2017 07:09 PM, Eliot Miranda wrote:
> Hi All,
>
>     the hash algorithm used for ByteString in Squeak and Pharo is good
> for "small" strings and overkill for large strings. 

Why do you say it's overkill for large strings? Are there applications
with large strings that are being negatively impacted by the current
algorithm? Which ones, and impacted how?

> It is important in many applications to get well distributed string
> hashes, especially over the range of strings that constitute things
> like method names, URLs, etc.  Consequently, the current algorithm
> includes every character in a string.  This works very well for
> "small" strings and results in very slow hashes (and hence long
> latencies, because the hash is an uninterruptible primitive) for large
> strings, where large may be several megabytes.

A simple solution for the uninterruptable primitive is to not make it a
primitive. Make #hashMultiply a primitive (since this particular kind of
numeric modulo computation is really painful in Smalltalk), and do the
rest in a loop in Smalltalk. It sounds like you've done the
#hashMultiply primitive already.

If the overhead of calling a primitive for each character proves to be
too much, even with the faster primitive calling methodologies you
talked about in the "Cog Primitive Performance" thread on the Vm-dev
list, a more complex primitive could take a range of bytes, so large
strings would be done in batches, solving the latency problem.

>
> Let's look at the basic hash algorithm. 
[...]
>
> In looking at this I've added a primitive for hashMultiply; primitive
> #159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
> SmallInteger and LargePositiveInteger receivers, as fast as possible
> in the Cog JIT.  With this machinery in place it's instructive to
> compare the cost of the primitive against the non-primitive Smalltalk
> code.
>
> First let me introduce a set of replacement hash functions, newHashN. 
> These hash all characters in strings up to a certain size, and then no
> more than that number for larger strings.  Here are newHash64 and
> newHash2048, which use pure Smalltalk, including an inlined
> hashMultiply written to avoid SmallInteger overflow.  Also measured
> are the obvious variants newHash128, newHash256, newHash512 & mewHash1024.
>
>
[...]
> So the idea here is to step through the string by 1 for strings sizes
> up to N - 1, and by greater than 1 for strings of size >= N, limiting
> the maximum number of characters sampled to between N // 2 and N - 1. 

The history of computing is littered with the bones of those who have
tried this kind of thing. It doesn't end well. Yes, you get a faster
hash function. And then you find, for sets of data that you or your
users actually want to use, that you get collisions like crazy, and much
worse overall performance than you started with.

Sure, it works OK for the sets of data that the designer *tested*, and
probably for the sets of data that they *anticipated*. But real-world
data is tricky. It includes data sets where the characters that differ
are the ones that the hash thinks are unimportant, and there goes your
performance, by orders of magnitude. For instance, early versions of
Java used a limited number of characters to hash strings. One of the
biggest compatibility-breaking changes they were forced to make in later
Java versions was to consider *all* characters in hashing. It turned out
that it was very common to hash URLs, and many distinct URLs had most of
their characters in common.

And you don't always get to choose your data -- sometimes you have an
opponent who is actively looking to create collisions as a
denial-of-service attack. There was a fair-sized kerfluffle about this a
few years ago -- most web scripting languages made it too easy to mount
this kind of attack.
https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/
"...an attacker can degenerate the hash table by sending lots of
colliding keys. ...making it possible to exhaust hours of CPU time using
a single HTTP request."
To guard against this kind of attack you need a randomized element in
your hash (not a bad idea for Smalltalk, actually, and pretty easy --
mixing in the identity hash of the collection might be sufficient) or a
cryptographic hash (not worth the computational expense for most
purposes). However, even adding a randomized element would not prevent
this kind of attack if you predictably completely ignore some characters
of the input string. That just makes it *so* easy to generate data that
validates, and is not equal, but causes collisions.

So really, for general-purpose use (i.e. what's built into the language)
hash *every* character of *all* strings. If someone finds that this is a
performance problem in a real-world situation, it can be addressed in an
application-specific way.