[squeak-dev] [Vm-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.
chisvasileandrei at gmail.com
Mon May 1 09:24:57 UTC 2017
On Mon, May 1, 2017 at 9:09 AM, Martin McClure <martin at hand2mouse.com>
> I see no replies to this on any of the three lists it was sent to, so I
> guess I'll chime in.
> Making a primitive for #hashMultiply, probably a good idea, in some
> form, since doing it in Smalltalk is awkward.
> Only hashing every N-th character of large strings, probably a very bad
> idea. Performance might well get worse, the complexity does not seem
> justified, and it would open a sizeable security hole.
> More verbiage below for those interested.
> On 04/18/2017 07:09 PM, Eliot Miranda wrote:
> > Hi All,
> > the hash algorithm used for ByteString in Squeak and Pharo is good
> > for "small" strings and overkill for large strings.
> Why do you say it's overkill for large strings? Are there applications
> with large strings that are being negatively impacted by the current
> algorithm? Which ones, and impacted how?
> > It is important in many applications to get well distributed string
> > hashes, especially over the range of strings that constitute things
> > like method names, URLs, etc. Consequently, the current algorithm
> > includes every character in a string. This works very well for
> > "small" strings and results in very slow hashes (and hence long
> > latencies, because the hash is an uninterruptible primitive) for large
> > strings, where large may be several megabytes.
> A simple solution for the uninterruptable primitive is to not make it a
> primitive. Make #hashMultiply a primitive (since this particular kind of
> numeric modulo computation is really painful in Smalltalk), and do the
> rest in a loop in Smalltalk. It sounds like you've done the
> #hashMultiply primitive already.
> If the overhead of calling a primitive for each character proves to be
> too much, even with the faster primitive calling methodologies you
> talked about in the "Cog Primitive Performance" thread on the Vm-dev
> list, a more complex primitive could take a range of bytes, so large
> strings would be done in batches, solving the latency problem.
> > Let's look at the basic hash algorithm.
> > In looking at this I've added a primitive for hashMultiply; primitive
> > #159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
> > SmallInteger and LargePositiveInteger receivers, as fast as possible
> > in the Cog JIT. With this machinery in place it's instructive to
> > compare the cost of the primitive against the non-primitive Smalltalk
> > code.
> > First let me introduce a set of replacement hash functions, newHashN.
> > These hash all characters in strings up to a certain size, and then no
> > more than that number for larger strings. Here are newHash64 and
> > newHash2048, which use pure Smalltalk, including an inlined
> > hashMultiply written to avoid SmallInteger overflow. Also measured
> > are the obvious variants newHash128, newHash256, newHash512 &
> > So the idea here is to step through the string by 1 for strings sizes
> > up to N - 1, and by greater than 1 for strings of size >= N, limiting
> > the maximum number of characters sampled to between N // 2 and N - 1.
> The history of computing is littered with the bones of those who have
> tried this kind of thing. It doesn't end well. Yes, you get a faster
> hash function. And then you find, for sets of data that you or your
> users actually want to use, that you get collisions like crazy, and much
> worse overall performance than you started with.
> Sure, it works OK for the sets of data that the designer *tested*, and
> probably for the sets of data that they *anticipated*. But real-world
> data is tricky. It includes data sets where the characters that differ
> are the ones that the hash thinks are unimportant, and there goes your
> performance, by orders of magnitude. For instance, early versions of
> Java used a limited number of characters to hash strings. One of the
> biggest compatibility-breaking changes they were forced to make in later
> Java versions was to consider *all* characters in hashing. It turned out
> that it was very common to hash URLs, and many distinct URLs had most of
> their characters in common.
Was curious about this and found the actual issues  if anybody else is
> And you don't always get to choose your data -- sometimes you have an
> opponent who is actively looking to create collisions as a
> denial-of-service attack. There was a fair-sized kerfluffle about this a
> few years ago -- most web scripting languages made it too easy to mount
> this kind of attack.
> "...an attacker can degenerate the hash table by sending lots of
> colliding keys. ...making it possible to exhaust hours of CPU time using
> a single HTTP request."
> To guard against this kind of attack you need a randomized element in
> your hash (not a bad idea for Smalltalk, actually, and pretty easy --
> mixing in the identity hash of the collection might be sufficient) or a
> cryptographic hash (not worth the computational expense for most
> purposes). However, even adding a randomized element would not prevent
> this kind of attack if you predictably completely ignore some characters
> of the input string. That just makes it *so* easy to generate data that
> validates, and is not equal, but causes collisions.
> So really, for general-purpose use (i.e. what's built into the language)
> hash *every* character of *all* strings. If someone finds that this is a
> performance problem in a real-world situation, it can be addressed in an
> application-specific way.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Squeak-dev