<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, May 1, 2017 at 9:09 AM, Martin McClure <span dir="ltr"><<a href="mailto:martin@hand2mouse.com" target="_blank">martin@hand2mouse.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

I see no replies to this on any of the three lists it was sent to, so I<br>

guess I'll chime in.<br>

<br>

tl;dr:<br>

Making a primitive for #hashMultiply, probably a good idea, in some<br>

form, since doing it in Smalltalk is awkward.<br>

Only hashing every N-th character of large strings, probably a very bad<br>

idea. Performance might well get worse, the complexity does not seem<br>

justified, and it would open a sizeable security hole.<br>

<br>

More verbiage below for those interested.<br>

<br>

Regards,<br>

-Martin<br>

<br>

On 04/18/2017 07:09 PM, Eliot Miranda wrote:<br>

> Hi All,<br>

><br>

>     the hash algorithm used for ByteString in Squeak and Pharo is good<br>

> for "small" strings and overkill for large strings.<br>

<br>

Why do you say it's overkill for large strings? Are there applications<br>

with large strings that are being negatively impacted by the current<br>

algorithm? Which ones, and impacted how?<br>

<br>

> It is important in many applications to get well distributed string<br>

> hashes, especially over the range of strings that constitute things<br>

> like method names, URLs, etc.  Consequently, the current algorithm<br>

> includes every character in a string.  This works very well for<br>

> "small" strings and results in very slow hashes (and hence long<br>

> latencies, because the hash is an uninterruptible primitive) for large<br>

> strings, where large may be several megabytes.<br>

<br>

A simple solution for the uninterruptable primitive is to not make it a<br>

primitive. Make #hashMultiply a primitive (since this particular kind of<br>

numeric modulo computation is really painful in Smalltalk), and do the<br>

rest in a loop in Smalltalk. It sounds like you've done the<br>

#hashMultiply primitive already.<br>

<br>

If the overhead of calling a primitive for each character proves to be<br>

too much, even with the faster primitive calling methodologies you<br>

talked about in the "Cog Primitive Performance" thread on the Vm-dev<br>

list, a more complex primitive could take a range of bytes, so large<br>

strings would be done in batches, solving the latency problem.<br>

<br>

><br>

> Let's look at the basic hash algorithm.<br>

[...]<br>

><br>

> In looking at this I've added a primitive for hashMultiply; primitive<br>

> #159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for<br>

> SmallInteger and LargePositiveInteger receivers, as fast as possible<br>

> in the Cog JIT.  With this machinery in place it's instructive to<br>

> compare the cost of the primitive against the non-primitive Smalltalk<br>

> code.<br>

><br>

> First let me introduce a set of replacement hash functions, newHashN.<br>

> These hash all characters in strings up to a certain size, and then no<br>

> more than that number for larger strings.  Here are newHash64 and<br>

> newHash2048, which use pure Smalltalk, including an inlined<br>

> hashMultiply written to avoid SmallInteger overflow.  Also measured<br>

> are the obvious variants newHash128, newHash256, newHash512 & mewHash1024.<br>

><br>

><br>

[...]<br>

> So the idea here is to step through the string by 1 for strings sizes<br>

> up to N - 1, and by greater than 1 for strings of size >= N, limiting<br>

> the maximum number of characters sampled to between N // 2 and N - 1.<br>

<br>

The history of computing is littered with the bones of those who have<br>

tried this kind of thing. It doesn't end well. Yes, you get a faster<br>

hash function. And then you find, for sets of data that you or your<br>

users actually want to use, that you get collisions like crazy, and much<br>

worse overall performance than you started with.<br>

<br>

Sure, it works OK for the sets of data that the designer *tested*, and<br>

probably for the sets of data that they *anticipated*. But real-world<br>

data is tricky. It includes data sets where the characters that differ<br>

are the ones that the hash thinks are unimportant, and there goes your<br>

performance, by orders of magnitude. For instance, early versions of<br>

Java used a limited number of characters to hash strings. One of the<br>

biggest compatibility-breaking changes they were forced to make in later<br>

Java versions was to consider *all* characters in hashing. It turned out<br>

that it was very common to hash URLs, and many distinct URLs had most of<br>

their characters in common.<br></blockquote><div><br></div><div>Was curious about this and found the actual issues [1] if anybody else is interested.</div><div><br></div><div>[1] <a href="http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622">http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622</a><br></div><div><br></div><div>Cheers,</div><div>Andrei</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

And you don't always get to choose your data -- sometimes you have an<br>

opponent who is actively looking to create collisions as a<br>

denial-of-service attack. There was a fair-sized kerfluffle about this a<br>

few years ago -- most web scripting languages made it too easy to mount<br>

this kind of attack.<br>

<a href="https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/" rel="noreferrer" target="_blank">https://arstechnica.com/<wbr>business/2011/12/huge-<wbr>portions-of-web-vulnerable-to-<wbr>hashing-denial-of-service-<wbr>attack/</a><br>

"...an attacker can degenerate the hash table by sending lots of<br>

colliding keys. ...making it possible to exhaust hours of CPU time using<br>

a single HTTP request."<br>

To guard against this kind of attack you need a randomized element in<br>

your hash (not a bad idea for Smalltalk, actually, and pretty easy --<br>

mixing in the identity hash of the collection might be sufficient) or a<br>

cryptographic hash (not worth the computational expense for most<br>

purposes). However, even adding a randomized element would not prevent<br>

this kind of attack if you predictably completely ignore some characters<br>

of the input string. That just makes it *so* easy to generate data that<br>

validates, and is not equal, but causes collisions.<br>

<br>

So really, for general-purpose use (i.e. what's built into the language)<br>

hash *every* character of *all* strings. If someone finds that this is a<br>

performance problem in a real-world situation, it can be addressed in an<br>

application-specific way.<br>

<br>

</blockquote></div><br></div></div>