[squeak-dev] [Vm-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Levente Uzonyi leves at caesar.elte.hu
Tue May 2 00:42:34 UTC 2017


On Mon, 1 May 2017, David T. Lewis wrote:

> On Mon, May 01, 2017 at 07:26:35PM +0200, Levente Uzonyi wrote:
>> I presume that a general purpose in-image solution would be more complex. 
>> String already has too many subclasses (6 in Squeak), while at the same 
>> time other kind of new subclasses would be welcome too, e.g. Strings with 
>> 2-byte characters.
>> Since these properties are orthogonal, there would be many new subclasses 
>> to cover all cases.
>> Storing the size of the string in a different header word is a VM specific 
>> detail, so I think caching the hash could also be hidden from the image.
>> 
>> Levente
>
> Actually, I meant something more like this:
>
>   Object subclass: #LargeString
> 	instanceVariableNames: 'anyKindOfString cachedHashValueForTheString'
> 	classVariableNames: ''
> 	poolDictionaries: ''
> 	category: 'Probably a Bad Idea'
>
>
> I was guessing that hashing very large strings would imply a somewhat
> specialized problem domain, for which a wrapper class might make sense.
> Certainly it would not be a general solution.
>
> I am probably over my quota of bad ideas for today, so I'll stop now ;-)

That's one way to solve the problem. It works, but you have to wrap all 
your strings, or they won't be equal to non-wrapped strings. So, it's 
not a transparent solution.
If you just want to use a different, expectedly quicker hash function, 
you can always use PluggableSet and PluggableDictionary.
(If we had a really good JIT, using them would have no overhead compared 
to Set and Dictionary.)

Levente

>
> Dave
>
>
>> 
>> On Mon, 1 May 2017, David T. Lewis wrote:
>> 
>> >
>> >Does it need to be done in the VM? Why not make a class LargeString with
>> >instance variables aString and myCalculatedHashValueForTheString. That way
>> >you can cache the hash value and calculate it any way you want.
>> >
>> >I know vey little about hashing, just wondering if this kind of thing can
>> >be handled more easily in the image.
>> >
>> >Dave
>> >
>> >
>> >>
>> >>On Mon, 1 May 2017, Levente Uzonyi wrote:
>> >>
>> >>>Well, I had started to write a reply, but I had to postpone it.
>> >>>I mostly agree with your suggestions.
>> >>>
>> >>>One thing that can be done about large strings is to cache the
>> >>>calculated
>> >>>hash value in the larger strings. Currently the object representation
>> >>>changes when the string contains 255 or more characters. In that case an
>> >>>additional 64 bit field is added to the object header to store its
>> >>>length.
>> >>>If we were to use the upper 28+1 bits of that field to cache the hash,
>> >>>there would still be 35-bits to encode length, which would be enough to
>> >>>represent strings up to 8 GiB.
>> >>
>> >>Well, we can keep the whole range minus one bit, but then we can't store
>> >>the hash for strings larger than 8 GiB.
>> >>
>> >>Levente
>> >>
>> >>>But this would require further VM changes (e.g. at:put: would have to
>> >>>flush the cache).
>> >>>
>> >>>Levente
>> >>>
>> >>>On Mon, 1 May 2017, Martin McClure wrote:
>> >>>
>> >>>>I see no replies to this on any of the three lists it was sent to, so I
>> >>>>guess I'll chime in.
>> >>>>
>> >>>>tl;dr:
>> >>>>Making a primitive for #hashMultiply, probably a good idea, in some
>> >>>>form, since doing it in Smalltalk is awkward.
>> >>>>Only hashing every N-th character of large strings, probably a very bad
>> >>>>idea. Performance might well get worse, the complexity does not seem
>> >>>>justified, and it would open a sizeable security hole.
>> >>>>
>> >>>>More verbiage below for those interested.
>> >>>>
>> >>>>Regards,
>> >>>>-Martin
>> >>>>
>> >>>>On 04/18/2017 07:09 PM, Eliot Miranda wrote:
>> >>>>>Hi All,
>> >>>>>
>> >>>>>    the hash algorithm used for ByteString in Squeak and Pharo is good
>> >>>>>for "small" strings and overkill for large strings.
>> >>>>
>> >>>>Why do you say it's overkill for large strings? Are there applications
>> >>>>with large strings that are being negatively impacted by the current
>> >>>>algorithm? Which ones, and impacted how?
>> >>>>
>> >>>>>It is important in many applications to get well distributed string
>> >>>>>hashes, especially over the range of strings that constitute things
>> >>>>>like method names, URLs, etc.  Consequently, the current algorithm
>> >>>>>includes every character in a string.  This works very well for
>> >>>>>"small" strings and results in very slow hashes (and hence long
>> >>>>>latencies, because the hash is an uninterruptible primitive) for large
>> >>>>>strings, where large may be several megabytes.
>> >>>>
>> >>>>A simple solution for the uninterruptable primitive is to not make it a
>> >>>>primitive. Make #hashMultiply a primitive (since this particular kind
>> >>>>of
>> >>>>numeric modulo computation is really painful in Smalltalk), and do the
>> >>>>rest in a loop in Smalltalk. It sounds like you've done the
>> >>>>#hashMultiply primitive already.
>> >>>>
>> >>>>If the overhead of calling a primitive for each character proves to be
>> >>>>too much, even with the faster primitive calling methodologies you
>> >>>>talked about in the "Cog Primitive Performance" thread on the Vm-dev
>> >>>>list, a more complex primitive could take a range of bytes, so large
>> >>>>strings would be done in batches, solving the latency problem.
>> >>>>
>> >>>>>
>> >>>>>Let's look at the basic hash algorithm.
>> >>>>[...]
>> >>>>>
>> >>>>>In looking at this I've added a primitive for hashMultiply; primitive
>> >>>>>#159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
>> >>>>>SmallInteger and LargePositiveInteger receivers, as fast as possible
>> >>>>>in the Cog JIT.  With this machinery in place it's instructive to
>> >>>>>compare the cost of the primitive against the non-primitive Smalltalk
>> >>>>>code.
>> >>>>>
>> >>>>>First let me introduce a set of replacement hash functions, newHashN.
>> >>>>>These hash all characters in strings up to a certain size, and then no
>> >>>>>more than that number for larger strings.  Here are newHash64 and
>> >>>>>newHash2048, which use pure Smalltalk, including an inlined
>> >>>>>hashMultiply written to avoid SmallInteger overflow.  Also measured
>> >>>>>are the obvious variants newHash128, newHash256, newHash512 &
>> >>>>>mewHash1024.
>> >>>>>
>> >>>>>
>> >>>>[...]
>> >>>>>So the idea here is to step through the string by 1 for strings sizes
>> >>>>>up to N - 1, and by greater than 1 for strings of size >= N, limiting
>> >>>>>the maximum number of characters sampled to between N // 2 and N - 1.
>> >>>>
>> >>>>The history of computing is littered with the bones of those who have
>> >>>>tried this kind of thing. It doesn't end well. Yes, you get a faster
>> >>>>hash function. And then you find, for sets of data that you or your
>> >>>>users actually want to use, that you get collisions like crazy, and
>> >>>>much
>> >>>>worse overall performance than you started with.
>> >>>>
>> >>>>Sure, it works OK for the sets of data that the designer *tested*, and
>> >>>>probably for the sets of data that they *anticipated*. But real-world
>> >>>>data is tricky. It includes data sets where the characters that differ
>> >>>>are the ones that the hash thinks are unimportant, and there goes your
>> >>>>performance, by orders of magnitude. For instance, early versions of
>> >>>>Java used a limited number of characters to hash strings. One of the
>> >>>>biggest compatibility-breaking changes they were forced to make in
>> >>>>later
>> >>>>Java versions was to consider *all* characters in hashing. It turned
>> >>>>out
>> >>>>that it was very common to hash URLs, and many distinct URLs had most
>> >>>>of
>> >>>>their characters in common.
>> >>>>
>> >>>>And you don't always get to choose your data -- sometimes you have an
>> >>>>opponent who is actively looking to create collisions as a
>> >>>>denial-of-service attack. There was a fair-sized kerfluffle about this
>> >>>>a
>> >>>>few years ago -- most web scripting languages made it too easy to mount
>> >>>>this kind of attack.
>> >>>>
>> >>>https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/
>> >>>>"...an attacker can degenerate the hash table by sending lots of
>> >>>>colliding keys. ...making it possible to exhaust hours of CPU time
>> >>>>using
>> >>>>a single HTTP request."
>> >>>>To guard against this kind of attack you need a randomized element in
>> >>>>your hash (not a bad idea for Smalltalk, actually, and pretty easy --
>> >>>>mixing in the identity hash of the collection might be sufficient) or a
>> >>>>cryptographic hash (not worth the computational expense for most
>> >>>>purposes). However, even adding a randomized element would not prevent
>> >>>>this kind of attack if you predictably completely ignore some
>> >>>>characters
>> >>>>of the input string. That just makes it *so* easy to generate data that
>> >>>>validates, and is not equal, but causes collisions.
>> >>>>
>> >>>>So really, for general-purpose use (i.e. what's built into the
>> >>>>language)
>> >>>>hash *every* character of *all* strings. If someone finds that this is
>> >>>>a
>> >>>>performance problem in a real-world situation, it can be addressed in
>> >>>>an
>> >>>>application-specific way.
>> >>>
>> >>>
>> >>
>>


More information about the Squeak-dev mailing list