[squeak-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Mon May 1 14:13:48 UTC 2017

On Mon, 1 May 2017, Levente Uzonyi wrote:

> Well, I had started to write a reply, but I had to postpone it.
> I mostly agree with your suggestions.
>
> One thing that can be done about large strings is to cache the calculated 
> hash value in the larger strings. Currently the object representation 
> changes when the string contains 255 or more characters. In that case an 
> additional 64 bit field is added to the object header to store its length. 
> If we were to use the upper 28+1 bits of that field to cache the hash, 
> there would still be 35-bits to encode length, which would be enough to 
> represent strings up to 8 GiB.

Well, we can keep the whole range minus one bit, but then we can't store 
the hash for strings larger than 8 GiB.

Levente

> But this would require further VM changes (e.g. at:put: would have to 
> flush the cache).
>
> Levente
>
> On Mon, 1 May 2017, Martin McClure wrote:
>
>> I see no replies to this on any of the three lists it was sent to, so I
>> guess I'll chime in.
>>
>> tl;dr:
>> Making a primitive for #hashMultiply, probably a good idea, in some
>> form, since doing it in Smalltalk is awkward.
>> Only hashing every N-th character of large strings, probably a very bad
>> idea. Performance might well get worse, the complexity does not seem
>> justified, and it would open a sizeable security hole.
>>
>> More verbiage below for those interested.
>>
>> Regards,
>> -Martin
>>
>> On 04/18/2017 07:09 PM, Eliot Miranda wrote:
>>> Hi All,
>>>
>>>     the hash algorithm used for ByteString in Squeak and Pharo is good
>>> for "small" strings and overkill for large strings. 
>>
>> Why do you say it's overkill for large strings? Are there applications
>> with large strings that are being negatively impacted by the current
>> algorithm? Which ones, and impacted how?
>>
>>> It is important in many applications to get well distributed string
>>> hashes, especially over the range of strings that constitute things
>>> like method names, URLs, etc.  Consequently, the current algorithm
>>> includes every character in a string.  This works very well for
>>> "small" strings and results in very slow hashes (and hence long
>>> latencies, because the hash is an uninterruptible primitive) for large
>>> strings, where large may be several megabytes.
>>
>> A simple solution for the uninterruptable primitive is to not make it a
>> primitive. Make #hashMultiply a primitive (since this particular kind of
>> numeric modulo computation is really painful in Smalltalk), and do the
>> rest in a loop in Smalltalk. It sounds like you've done the
>> #hashMultiply primitive already.
>>
>> If the overhead of calling a primitive for each character proves to be
>> too much, even with the faster primitive calling methodologies you
>> talked about in the "Cog Primitive Performance" thread on the Vm-dev
>> list, a more complex primitive could take a range of bytes, so large
>> strings would be done in batches, solving the latency problem.
>>
>>>
>>> Let's look at the basic hash algorithm. 
>> [...]
>>>
>>> In looking at this I've added a primitive for hashMultiply; primitive
>>> #159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
>>> SmallInteger and LargePositiveInteger receivers, as fast as possible
>>> in the Cog JIT.  With this machinery in place it's instructive to
>>> compare the cost of the primitive against the non-primitive Smalltalk
>>> code.
>>>
>>> First let me introduce a set of replacement hash functions, newHashN. 
>>> These hash all characters in strings up to a certain size, and then no
>>> more than that number for larger strings.  Here are newHash64 and
>>> newHash2048, which use pure Smalltalk, including an inlined
>>> hashMultiply written to avoid SmallInteger overflow.  Also measured
>>> are the obvious variants newHash128, newHash256, newHash512 & mewHash1024.
>>>
>>>
>> [...]
>>> So the idea here is to step through the string by 1 for strings sizes
>>> up to N - 1, and by greater than 1 for strings of size >= N, limiting
>>> the maximum number of characters sampled to between N // 2 and N - 1. 
>>
>> The history of computing is littered with the bones of those who have
>> tried this kind of thing. It doesn't end well. Yes, you get a faster
>> hash function. And then you find, for sets of data that you or your
>> users actually want to use, that you get collisions like crazy, and much
>> worse overall performance than you started with.
>>
>> Sure, it works OK for the sets of data that the designer *tested*, and
>> probably for the sets of data that they *anticipated*. But real-world
>> data is tricky. It includes data sets where the characters that differ
>> are the ones that the hash thinks are unimportant, and there goes your
>> performance, by orders of magnitude. For instance, early versions of
>> Java used a limited number of characters to hash strings. One of the
>> biggest compatibility-breaking changes they were forced to make in later
>> Java versions was to consider *all* characters in hashing. It turned out
>> that it was very common to hash URLs, and many distinct URLs had most of
>> their characters in common.
>>
>> And you don't always get to choose your data -- sometimes you have an
>> opponent who is actively looking to create collisions as a
>> denial-of-service attack. There was a fair-sized kerfluffle about this a
>> few years ago -- most web scripting languages made it too easy to mount
>> this kind of attack.
>> 
> https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/
>> "...an attacker can degenerate the hash table by sending lots of
>> colliding keys. ...making it possible to exhaust hours of CPU time using
>> a single HTTP request."
>> To guard against this kind of attack you need a randomized element in
>> your hash (not a bad idea for Smalltalk, actually, and pretty easy --
>> mixing in the identity hash of the collection might be sufficient) or a
>> cryptographic hash (not worth the computational expense for most
>> purposes). However, even adding a randomized element would not prevent
>> this kind of attack if you predictably completely ignore some characters
>> of the input string. That just makes it *so* easy to generate data that
>> validates, and is not equal, but causes collisions.
>>
>> So really, for general-purpose use (i.e. what's built into the language)
>> hash *every* character of *all* strings. If someone finds that this is a
>> performance problem in a real-world situation, it can be addressed in an
>> application-specific way.
>
>