<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, May 1, 2017 at 5:11 PM, David T. Lewis <span dir="ltr"><<a href="mailto:lewis@mail.msen.com" target="_blank">lewis@mail.msen.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Mon, May 01, 2017 at 07:26:35PM +0200, Levente Uzonyi wrote:<br>

> I presume that a general purpose in-image solution would be more complex.<br>

> String already has too many subclasses (6 in Squeak), while at the same<br>

> time other kind of new subclasses would be welcome too, e.g. Strings with<br>

> 2-byte characters.<br>

> Since these properties are orthogonal, there would be many new subclasses<br>

> to cover all cases.<br>

> Storing the size of the string in a different header word is a VM specific<br>

> detail, so I think caching the hash could also be hidden from the image.<br>

><br>

> Levente<br>

<br>

</span>Actually, I meant something more like this:<br>

<br>

   Object subclass: #LargeString<br>

        instanceVariableNames: 'anyKindOfString cachedHashValueForTheString'<br>

        classVariableNames: ''<br>

        poolDictionaries: ''<br>

        category: 'Probably a Bad Idea'<br>

<br>

<br>

I was guessing that hashing very large strings would imply a somewhat<br>

specialized problem domain, for which a wrapper class might make sense.<br>

Certainly it would not be a general solution.<br></blockquote><div><br></div><div>But this implies changing very many places in the VM that access a string as a vector of bytes.  That's why Levente suggests hiding the hash in unused memory in Spur.  It doesn't have the side effect of requiring a major rewrite.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

I am probably over my quota of bad ideas for today, so I'll stop now ;-)<br>

<div class="HOEnZb"><div class="h5"><br>

Dave<br>

<br>

<br>

><br>

> On Mon, 1 May 2017, David T. Lewis wrote:<br>

><br>

> ><br>

> >Does it need to be done in the VM? Why not make a class LargeString with<br>

> >instance variables aString and myCalculatedHashValueForTheStr<wbr>ing. That way<br>

> >you can cache the hash value and calculate it any way you want.<br>

> ><br>

> >I know vey little about hashing, just wondering if this kind of thing can<br>

> >be handled more easily in the image.<br>

> ><br>

> >Dave<br>

> ><br>

> ><br>

> >><br>

> >>On Mon, 1 May 2017, Levente Uzonyi wrote:<br>

> >><br>

> >>>Well, I had started to write a reply, but I had to postpone it.<br>

> >>>I mostly agree with your suggestions.<br>

> >>><br>

> >>>One thing that can be done about large strings is to cache the<br>

> >>>calculated<br>

> >>>hash value in the larger strings. Currently the object representation<br>

> >>>changes when the string contains 255 or more characters. In that case an<br>

> >>>additional 64 bit field is added to the object header to store its<br>

> >>>length.<br>

> >>>If we were to use the upper 28+1 bits of that field to cache the hash,<br>

> >>>there would still be 35-bits to encode length, which would be enough to<br>

> >>>represent strings up to 8 GiB.<br>

> >><br>

> >>Well, we can keep the whole range minus one bit, but then we can't store<br>

> >>the hash for strings larger than 8 GiB.<br>

> >><br>

> >>Levente<br>

> >><br>

> >>>But this would require further VM changes (e.g. at:put: would have to<br>

> >>>flush the cache).<br>

> >>><br>

> >>>Levente<br>

> >>><br>

> >>>On Mon, 1 May 2017, Martin McClure wrote:<br>

> >>><br>

> >>>>I see no replies to this on any of the three lists it was sent to, so I<br>

> >>>>guess I'll chime in.<br>

> >>>><br>

> >>>>tl;dr:<br>

> >>>>Making a primitive for #hashMultiply, probably a good idea, in some<br>

> >>>>form, since doing it in Smalltalk is awkward.<br>

> >>>>Only hashing every N-th character of large strings, probably a very bad<br>

> >>>>idea. Performance might well get worse, the complexity does not seem<br>

> >>>>justified, and it would open a sizeable security hole.<br>

> >>>><br>

> >>>>More verbiage below for those interested.<br>

> >>>><br>

> >>>>Regards,<br>

> >>>>-Martin<br>

> >>>><br>

> >>>>On 04/18/2017 07:09 PM, Eliot Miranda wrote:<br>

> >>>>>Hi All,<br>

> >>>>><br>

> >>>>>    the hash algorithm used for ByteString in Squeak and Pharo is good<br>

> >>>>>for "small" strings and overkill for large strings.<br>

> >>>><br>

> >>>>Why do you say it's overkill for large strings? Are there applications<br>

> >>>>with large strings that are being negatively impacted by the current<br>

> >>>>algorithm? Which ones, and impacted how?<br>

> >>>><br>

> >>>>>It is important in many applications to get well distributed string<br>

> >>>>>hashes, especially over the range of strings that constitute things<br>

> >>>>>like method names, URLs, etc.  Consequently, the current algorithm<br>

> >>>>>includes every character in a string.  This works very well for<br>

> >>>>>"small" strings and results in very slow hashes (and hence long<br>

> >>>>>latencies, because the hash is an uninterruptible primitive) for large<br>

> >>>>>strings, where large may be several megabytes.<br>

> >>>><br>

> >>>>A simple solution for the uninterruptable primitive is to not make it a<br>

> >>>>primitive. Make #hashMultiply a primitive (since this particular kind<br>

> >>>>of<br>

> >>>>numeric modulo computation is really painful in Smalltalk), and do the<br>

> >>>>rest in a loop in Smalltalk. It sounds like you've done the<br>

> >>>>#hashMultiply primitive already.<br>

> >>>><br>

> >>>>If the overhead of calling a primitive for each character proves to be<br>

> >>>>too much, even with the faster primitive calling methodologies you<br>

> >>>>talked about in the "Cog Primitive Performance" thread on the Vm-dev<br>

> >>>>list, a more complex primitive could take a range of bytes, so large<br>

> >>>>strings would be done in batches, solving the latency problem.<br>

> >>>><br>

> >>>>><br>

> >>>>>Let's look at the basic hash algorithm.<br>

> >>>>[...]<br>

> >>>>><br>

> >>>>>In looking at this I've added a primitive for hashMultiply; primitive<br>

> >>>>>#159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for<br>

> >>>>>SmallInteger and LargePositiveInteger receivers, as fast as possible<br>

> >>>>>in the Cog JIT.  With this machinery in place it's instructive to<br>

> >>>>>compare the cost of the primitive against the non-primitive Smalltalk<br>

> >>>>>code.<br>

> >>>>><br>

> >>>>>First let me introduce a set of replacement hash functions, newHashN.<br>

> >>>>>These hash all characters in strings up to a certain size, and then no<br>

> >>>>>more than that number for larger strings.  Here are newHash64 and<br>

> >>>>>newHash2048, which use pure Smalltalk, including an inlined<br>

> >>>>>hashMultiply written to avoid SmallInteger overflow.  Also measured<br>

> >>>>>are the obvious variants newHash128, newHash256, newHash512 &<br>

> >>>>>mewHash1024.<br>

> >>>>><br>

> >>>>><br>

> >>>>[...]<br>

> >>>>>So the idea here is to step through the string by 1 for strings sizes<br>

> >>>>>up to N - 1, and by greater than 1 for strings of size >= N, limiting<br>

> >>>>>the maximum number of characters sampled to between N // 2 and N - 1.<br>

> >>>><br>

> >>>>The history of computing is littered with the bones of those who have<br>

> >>>>tried this kind of thing. It doesn't end well. Yes, you get a faster<br>

> >>>>hash function. And then you find, for sets of data that you or your<br>

> >>>>users actually want to use, that you get collisions like crazy, and<br>

> >>>>much<br>

> >>>>worse overall performance than you started with.<br>

> >>>><br>

> >>>>Sure, it works OK for the sets of data that the designer *tested*, and<br>

> >>>>probably for the sets of data that they *anticipated*. But real-world<br>

> >>>>data is tricky. It includes data sets where the characters that differ<br>

> >>>>are the ones that the hash thinks are unimportant, and there goes your<br>

> >>>>performance, by orders of magnitude. For instance, early versions of<br>

> >>>>Java used a limited number of characters to hash strings. One of the<br>

> >>>>biggest compatibility-breaking changes they were forced to make in<br>

> >>>>later<br>

> >>>>Java versions was to consider *all* characters in hashing. It turned<br>

> >>>>out<br>

> >>>>that it was very common to hash URLs, and many distinct URLs had most<br>

> >>>>of<br>

> >>>>their characters in common.<br>

> >>>><br>

> >>>>And you don't always get to choose your data -- sometimes you have an<br>

> >>>>opponent who is actively looking to create collisions as a<br>

> >>>>denial-of-service attack. There was a fair-sized kerfluffle about this<br>

> >>>>a<br>

> >>>>few years ago -- most web scripting languages made it too easy to mount<br>

> >>>>this kind of attack.<br>

> >>>><br>

> >>><a href="https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/" rel="noreferrer" target="_blank">https://arstechnica.com/<wbr>business/2011/12/huge-<wbr>portions-of-web-vulnerable-to-<wbr>hashing-denial-of-service-<wbr>attack/</a><br>

> >>>>"...an attacker can degenerate the hash table by sending lots of<br>

> >>>>colliding keys. ...making it possible to exhaust hours of CPU time<br>

> >>>>using<br>

> >>>>a single HTTP request."<br>

> >>>>To guard against this kind of attack you need a randomized element in<br>

> >>>>your hash (not a bad idea for Smalltalk, actually, and pretty easy --<br>

> >>>>mixing in the identity hash of the collection might be sufficient) or a<br>

> >>>>cryptographic hash (not worth the computational expense for most<br>

> >>>>purposes). However, even adding a randomized element would not prevent<br>

> >>>>this kind of attack if you predictably completely ignore some<br>

> >>>>characters<br>

> >>>>of the input string. That just makes it *so* easy to generate data that<br>

> >>>>validates, and is not equal, but causes collisions.<br>

> >>>><br>

> >>>>So really, for general-purpose use (i.e. what's built into the<br>

> >>>>language)<br>

> >>>>hash *every* character of *all* strings. If someone finds that this is<br>

> >>>>a<br>

> >>>>performance problem in a real-world situation, it can be addressed in<br>

> >>>>an<br>

> >>>>application-specific way.<br>

> >>><br>

> >>><br>

> >><br>

><br>

<br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><span style="font-size:small;border-collapse:separate"><div>_,,,^..^,,,_<br></div><div>best, Eliot</div></span></div></div></div>

</div></div>