[Vm-dev] [squeak-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Mon May 1 14:05:05 UTC 2017

Well, I had started to write a reply, but I had to postpone it.
I mostly agree with your suggestions.

One thing that can be done about large strings is to cache the calculated 
hash value in the larger strings. Currently the object representation 
changes when the string contains 255 or more characters. In that case an 
additional 64 bit field is added to the object header to store its length. 
If we were to use the upper 28+1 bits of that field to cache the hash, 
there would still be 35-bits to encode length, which would be enough to 
represent strings up to 8 GiB.
But this would require further VM changes (e.g. at:put: would have to 
flush the cache).

Levente

On Mon, 1 May 2017, Martin McClure wrote:

> I see no replies to this on any of the three lists it was sent to, so I
> guess I'll chime in.
>
> tl;dr:
> Making a primitive for #hashMultiply, probably a good idea, in some
> form, since doing it in Smalltalk is awkward.
> Only hashing every N-th character of large strings, probably a very bad
> idea. Performance might well get worse, the complexity does not seem
> justified, and it would open a sizeable security hole.
>
> More verbiage below for those interested.
>
> Regards,
> -Martin
>
> On 04/18/2017 07:09 PM, Eliot Miranda wrote:
>> Hi All,
>>
>>     the hash algorithm used for ByteString in Squeak and Pharo is good
>> for "small" strings and overkill for large strings. 
>
> Why do you say it's overkill for large strings? Are there applications
> with large strings that are being negatively impacted by the current
> algorithm? Which ones, and impacted how?
>
>> It is important in many applications to get well distributed string
>> hashes, especially over the range of strings that constitute things
>> like method names, URLs, etc.  Consequently, the current algorithm
>> includes every character in a string.  This works very well for
>> "small" strings and results in very slow hashes (and hence long
>> latencies, because the hash is an uninterruptible primitive) for large
>> strings, where large may be several megabytes.
>
> A simple solution for the uninterruptable primitive is to not make it a
> primitive. Make #hashMultiply a primitive (since this particular kind of
> numeric modulo computation is really painful in Smalltalk), and do the
> rest in a loop in Smalltalk. It sounds like you've done the
> #hashMultiply primitive already.
>
> If the overhead of calling a primitive for each character proves to be
> too much, even with the faster primitive calling methodologies you
> talked about in the "Cog Primitive Performance" thread on the Vm-dev
> list, a more complex primitive could take a range of bytes, so large
> strings would be done in batches, solving the latency problem.
>
>>
>> Let's look at the basic hash algorithm. 
> [...]
>>
>> In looking at this I've added a primitive for hashMultiply; primitive
>> #159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
>> SmallInteger and LargePositiveInteger receivers, as fast as possible
>> in the Cog JIT.  With this machinery in place it's instructive to
>> compare the cost of the primitive against the non-primitive Smalltalk
>> code.
>>
>> First let me introduce a set of replacement hash functions, newHashN. 
>> These hash all characters in strings up to a certain size, and then no
>> more than that number for larger strings.  Here are newHash64 and
>> newHash2048, which use pure Smalltalk, including an inlined
>> hashMultiply written to avoid SmallInteger overflow.  Also measured
>> are the obvious variants newHash128, newHash256, newHash512 & mewHash1024.
>>
>>
> [...]
>> So the idea here is to step through the string by 1 for strings sizes
>> up to N - 1, and by greater than 1 for strings of size >= N, limiting
>> the maximum number of characters sampled to between N // 2 and N - 1. 
>
> The history of computing is littered with the bones of those who have
> tried this kind of thing. It doesn't end well. Yes, you get a faster
> hash function. And then you find, for sets of data that you or your
> users actually want to use, that you get collisions like crazy, and much
> worse overall performance than you started with.
>
> Sure, it works OK for the sets of data that the designer *tested*, and
> probably for the sets of data that they *anticipated*. But real-world
> data is tricky. It includes data sets where the characters that differ
> are the ones that the hash thinks are unimportant, and there goes your
> performance, by orders of magnitude. For instance, early versions of
> Java used a limited number of characters to hash strings. One of the
> biggest compatibility-breaking changes they were forced to make in later
> Java versions was to consider *all* characters in hashing. It turned out
> that it was very common to hash URLs, and many distinct URLs had most of
> their characters in common.
>
> And you don't always get to choose your data -- sometimes you have an
> opponent who is actively looking to create collisions as a
> denial-of-service attack. There was a fair-sized kerfluffle about this a
> few years ago -- most web scripting languages made it too easy to mount
> this kind of attack.
> https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/
> "...an attacker can degenerate the hash table by sending lots of
> colliding keys. ...making it possible to exhaust hours of CPU time using
> a single HTTP request."
> To guard against this kind of attack you need a randomized element in
> your hash (not a bad idea for Smalltalk, actually, and pretty easy --
> mixing in the identity hash of the collection might be sufficient) or a
> cryptographic hash (not worth the computational expense for most
> purposes). However, even adding a randomized element would not prevent
> this kind of attack if you predictably completely ignore some characters
> of the input string. That just makes it *so* easy to generate data that
> validates, and is not equal, but causes collisions.
>
> So really, for general-purpose use (i.e. what's built into the language)
> hash *every* character of *all* strings. If someone finds that this is a
> performance problem in a real-world situation, it can be addressed in an
> application-specific way.