[squeak-dev] [Vm-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Tue May 2 00:47:32 UTC 2017

On Mon, May 1, 2017 at 5:42 PM, Levente Uzonyi <leves at caesar.elte.hu> wrote:

> On Mon, 1 May 2017, David T. Lewis wrote:
>
> On Mon, May 01, 2017 at 07:26:35PM +0200, Levente Uzonyi wrote:
>>
>>> I presume that a general purpose in-image solution would be more
>>> complex. String already has too many subclasses (6 in Squeak), while at the
>>> same time other kind of new subclasses would be welcome too, e.g. Strings
>>> with 2-byte characters.
>>> Since these properties are orthogonal, there would be many new
>>> subclasses to cover all cases.
>>> Storing the size of the string in a different header word is a VM
>>> specific detail, so I think caching the hash could also be hidden from the
>>> image.
>>>
>>> Levente
>>>
>>
>> Actually, I meant something more like this:
>>
>>   Object subclass: #LargeString
>>         instanceVariableNames: 'anyKindOfString
>> cachedHashValueForTheString'
>>         classVariableNames: ''
>>         poolDictionaries: ''
>>         category: 'Probably a Bad Idea'
>>
>>
>> I was guessing that hashing very large strings would imply a somewhat
>> specialized problem domain, for which a wrapper class might make sense.
>> Certainly it would not be a general solution.
>>
>> I am probably over my quota of bad ideas for today, so I'll stop now ;-)
>>
>
> That's one way to solve the problem. It works, but you have to wrap all
> your strings, or they won't be equal to non-wrapped strings. So, it's not a
> transparent solution.
> If you just want to use a different, expectedly quicker hash function, you
> can always use PluggableSet and PluggableDictionary.
> (If we had a really good JIT, using them would have no overhead compared
> to Set and Dictionary.)

Right.  We have a crap JIT.  Even a "good JIT" wouldn't fix all the places
in the VM source where a string is accessed.

>
> Levente
>
>
>
>> Dave
>>
>>
>>
>>> On Mon, 1 May 2017, David T. Lewis wrote:
>>>
>>> >
>>> >Does it need to be done in the VM? Why not make a class LargeString with
>>> >instance variables aString and myCalculatedHashValueForTheString. That
>>> way
>>> >you can cache the hash value and calculate it any way you want.
>>> >
>>> >I know vey little about hashing, just wondering if this kind of thing
>>> can
>>> >be handled more easily in the image.
>>> >
>>> >Dave
>>> >
>>> >
>>> >>
>>> >>On Mon, 1 May 2017, Levente Uzonyi wrote:
>>> >>
>>> >>>Well, I had started to write a reply, but I had to postpone it.
>>> >>>I mostly agree with your suggestions.
>>> >>>
>>> >>>One thing that can be done about large strings is to cache the
>>> >>>calculated
>>> >>>hash value in the larger strings. Currently the object representation
>>> >>>changes when the string contains 255 or more characters. In that case
>>> an
>>> >>>additional 64 bit field is added to the object header to store its
>>> >>>length.
>>> >>>If we were to use the upper 28+1 bits of that field to cache the hash,
>>> >>>there would still be 35-bits to encode length, which would be enough
>>> to
>>> >>>represent strings up to 8 GiB.
>>> >>
>>> >>Well, we can keep the whole range minus one bit, but then we can't
>>> store
>>> >>the hash for strings larger than 8 GiB.
>>> >>
>>> >>Levente
>>> >>
>>> >>>But this would require further VM changes (e.g. at:put: would have to
>>> >>>flush the cache).
>>> >>>
>>> >>>Levente
>>> >>>
>>> >>>On Mon, 1 May 2017, Martin McClure wrote:
>>> >>>
>>> >>>>I see no replies to this on any of the three lists it was sent to,
>>> so I
>>> >>>>guess I'll chime in.
>>> >>>>
>>> >>>>tl;dr:
>>> >>>>Making a primitive for #hashMultiply, probably a good idea, in some
>>> >>>>form, since doing it in Smalltalk is awkward.
>>> >>>>Only hashing every N-th character of large strings, probably a very
>>> bad
>>> >>>>idea. Performance might well get worse, the complexity does not seem
>>> >>>>justified, and it would open a sizeable security hole.
>>> >>>>
>>> >>>>More verbiage below for those interested.
>>> >>>>
>>> >>>>Regards,
>>> >>>>-Martin
>>> >>>>
>>> >>>>On 04/18/2017 07:09 PM, Eliot Miranda wrote:
>>> >>>>>Hi All,
>>> >>>>>
>>> >>>>>    the hash algorithm used for ByteString in Squeak and Pharo is
>>> good
>>> >>>>>for "small" strings and overkill for large strings.
>>> >>>>
>>> >>>>Why do you say it's overkill for large strings? Are there
>>> applications
>>> >>>>with large strings that are being negatively impacted by the current
>>> >>>>algorithm? Which ones, and impacted how?
>>> >>>>
>>> >>>>>It is important in many applications to get well distributed string
>>> >>>>>hashes, especially over the range of strings that constitute things
>>> >>>>>like method names, URLs, etc.  Consequently, the current algorithm
>>> >>>>>includes every character in a string.  This works very well for
>>> >>>>>"small" strings and results in very slow hashes (and hence long
>>> >>>>>latencies, because the hash is an uninterruptible primitive) for
>>> large
>>> >>>>>strings, where large may be several megabytes.
>>> >>>>
>>> >>>>A simple solution for the uninterruptable primitive is to not make
>>> it a
>>> >>>>primitive. Make #hashMultiply a primitive (since this particular kind
>>> >>>>of
>>> >>>>numeric modulo computation is really painful in Smalltalk), and do
>>> the
>>> >>>>rest in a loop in Smalltalk. It sounds like you've done the
>>> >>>>#hashMultiply primitive already.
>>> >>>>
>>> >>>>If the overhead of calling a primitive for each character proves to
>>> be
>>> >>>>too much, even with the faster primitive calling methodologies you
>>> >>>>talked about in the "Cog Primitive Performance" thread on the Vm-dev
>>> >>>>list, a more complex primitive could take a range of bytes, so large
>>> >>>>strings would be done in batches, solving the latency problem.
>>> >>>>
>>> >>>>>
>>> >>>>>Let's look at the basic hash algorithm.
>>> >>>>[...]
>>> >>>>>
>>> >>>>>In looking at this I've added a primitive for hashMultiply;
>>> primitive
>>> >>>>>#159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
>>> >>>>>SmallInteger and LargePositiveInteger receivers, as fast as possible
>>> >>>>>in the Cog JIT.  With this machinery in place it's instructive to
>>> >>>>>compare the cost of the primitive against the non-primitive
>>> Smalltalk
>>> >>>>>code.
>>> >>>>>
>>> >>>>>First let me introduce a set of replacement hash functions,
>>> newHashN.
>>> >>>>>These hash all characters in strings up to a certain size, and then
>>> no
>>> >>>>>more than that number for larger strings.  Here are newHash64 and
>>> >>>>>newHash2048, which use pure Smalltalk, including an inlined
>>> >>>>>hashMultiply written to avoid SmallInteger overflow.  Also measured
>>> >>>>>are the obvious variants newHash128, newHash256, newHash512 &
>>> >>>>>mewHash1024.
>>> >>>>>
>>> >>>>>
>>> >>>>[...]
>>> >>>>>So the idea here is to step through the string by 1 for strings
>>> sizes
>>> >>>>>up to N - 1, and by greater than 1 for strings of size >= N,
>>> limiting
>>> >>>>>the maximum number of characters sampled to between N // 2 and N -
>>> 1.
>>> >>>>
>>> >>>>The history of computing is littered with the bones of those who have
>>> >>>>tried this kind of thing. It doesn't end well. Yes, you get a faster
>>> >>>>hash function. And then you find, for sets of data that you or your
>>> >>>>users actually want to use, that you get collisions like crazy, and
>>> >>>>much
>>> >>>>worse overall performance than you started with.
>>> >>>>
>>> >>>>Sure, it works OK for the sets of data that the designer *tested*,
>>> and
>>> >>>>probably for the sets of data that they *anticipated*. But real-world
>>> >>>>data is tricky. It includes data sets where the characters that
>>> differ
>>> >>>>are the ones that the hash thinks are unimportant, and there goes
>>> your
>>> >>>>performance, by orders of magnitude. For instance, early versions of
>>> >>>>Java used a limited number of characters to hash strings. One of the
>>> >>>>biggest compatibility-breaking changes they were forced to make in
>>> >>>>later
>>> >>>>Java versions was to consider *all* characters in hashing. It turned
>>> >>>>out
>>> >>>>that it was very common to hash URLs, and many distinct URLs had most
>>> >>>>of
>>> >>>>their characters in common.
>>> >>>>
>>> >>>>And you don't always get to choose your data -- sometimes you have an
>>> >>>>opponent who is actively looking to create collisions as a
>>> >>>>denial-of-service attack. There was a fair-sized kerfluffle about
>>> this
>>> >>>>a
>>> >>>>few years ago -- most web scripting languages made it too easy to
>>> mount
>>> >>>>this kind of attack.
>>> >>>>
>>> >>>https://arstechnica.com/business/2011/12/huge-portions-
>>> of-web-vulnerable-to-hashing-denial-of-service-attack/
>>> >>>>"...an attacker can degenerate the hash table by sending lots of
>>> >>>>colliding keys. ...making it possible to exhaust hours of CPU time
>>> >>>>using
>>> >>>>a single HTTP request."
>>> >>>>To guard against this kind of attack you need a randomized element in
>>> >>>>your hash (not a bad idea for Smalltalk, actually, and pretty easy --
>>> >>>>mixing in the identity hash of the collection might be sufficient)
>>> or a
>>> >>>>cryptographic hash (not worth the computational expense for most
>>> >>>>purposes). However, even adding a randomized element would not
>>> prevent
>>> >>>>this kind of attack if you predictably completely ignore some
>>> >>>>characters
>>> >>>>of the input string. That just makes it *so* easy to generate data
>>> that
>>> >>>>validates, and is not equal, but causes collisions.
>>> >>>>
>>> >>>>So really, for general-purpose use (i.e. what's built into the
>>> >>>>language)
>>> >>>>hash *every* character of *all* strings. If someone finds that this
>>> is
>>> >>>>a
>>> >>>>performance problem in a real-world situation, it can be addressed in
>>> >>>>an
>>> >>>>application-specific way.
>>> >>>
>>> >>>
>>> >>
>>>
>>>
>

-- 
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20170501/aa527527/attachment.html>