<div dir="ltr">Hi Andres,<br><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, May 27, 2014 at 6:10 PM, Andres Valloud <span dir="ltr">&lt;<a href="mailto:avalloud@smalltalk.comcastbiz.net" target="_blank">avalloud@smalltalk.comcastbiz.net</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">What is going to happen when one compares two general Unicode series of characters that represent the same string but differ in normalization? Wouldn&#39;t the size test would result in false negatives?<br>


<br>

<a href="http://unicode.org/reports/tr15/" target="_blank">http://unicode.org/reports/<u></u>tr15/</a><br>

<br>

I&#39;m asking because I haven&#39;t seen any discussion on the subject, and the decision to change the code as proposed could have side effects.</blockquote><div><br></div><div>The issue is whether String supports variable-sized encodings such as UTF-8 where there is no fixed relationship between the number of bytes i a string and the number of characters in a string.  Right now we have ByteString and WideString.  ByteString has 1 byte per character.  WideString has 4 bytes per character.  So &#39;hello&#39; asByteString contains 5 bytes and has size 5, but &#39;hello&#39; asWideString contains 20 bytes and also has size 5.  Hence the size check is fine, since size answers the number of characters, not the number of bytes.  If we were to add a UTF8String we&#39;d have to delete the size check.  But I think for now we&#39;re not going to do that.</div>

<div><br></div><div>A ByteString can contain some characters that comprise a UTF-8 string (see UTF8TextCoverter) but that&#39;s a convention of usage.  if you print some ByteString containing the UTF-8 encoding of a string containing characters that take more than one byte to encode, that string won&#39;t print as the input, it&#39;ll print treating each byte as a character, and so will scramble the string.  It is up to the user to handle these ByteStrings that happen by convention to contain UTF-8 correctly.</div>

<div><br></div><div>Note that there is nothing to stop us adding a UTF8String *provided* that class implements size to answer the number of characters, not the number of bytes.  My understanding is that VW takes this approach also. File streams expose the encoding, sicne position is a byte position, not a character position, and so it is up to the file stream client to cope with the positioning complexities that this introduces, not the stream.</div>

<div><br></div><div><br></div><div>OK?</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class=""><br>

<br>

On 5/27/14 11:59 , Eliot Miranda wrote:<br>

</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">

<br>

<br>

<br>

On Tue, May 27, 2014 at 6:54 AM, J. Vuletich (mail lists)<br></div>

&lt;<a href="mailto:juanlists@jvuletich.org" target="_blank">juanlists@jvuletich.org</a> &lt;mailto:<a href="mailto:juanlists@jvuletich.org" target="_blank">juanlists@jvuletich.<u></u>org</a>&gt;&gt; wrote:<br>

<br>

    __<br>

<br>

    Quoting Eliot Miranda &lt;<a href="mailto:eliot.miranda@gmail.com" target="_blank">eliot.miranda@gmail.com</a><br>

    &lt;mailto:<a href="mailto:eliot.miranda@gmail.com" target="_blank">eliot.miranda@gmail.<u></u>com</a>&gt;&gt;:<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">

    Hi Phillipe,<br>

<br>

<br>

    On Mon, May 26, 2014 at 12:51 AM, Philippe Marschall<br>

    &lt;<a href="mailto:philippe.marschall@netcetera.ch" target="_blank">philippe.marschall@netcetera.<u></u>ch</a><br></div><div><div class="h5">

    &lt;mailto:<a href="mailto:philippe.marschall@netcetera.ch" target="_blank">philippe.marschall@<u></u>netcetera.ch</a>&gt;&gt; wrote:<br>

<br>

        Hi<br>

<br>

        I have been investigating why Dictionary look up performance<br>

        with String keys is not as good as I would expected. Something<br>

        I noted is that String &gt;&gt; #= is implemented in terms of<br>

        #compare:with:collated:. There is no short circuit if Strings<br>

        are not the same size. In my case some Strings have the same<br>

        prefix but a different length eg &#39;Content-Type&#39; and<br>

        &#39;Content-Length&#39;. In that case a #compare:with:collated: is<br>

        performed even though we know in advance the answer will be<br>

        false because they have different sizes.<br>

<br>

    Why not rewrite<br>

    String&gt;&gt;= aString<br>

    &quot;Answer whether the receiver sorts equally as aString.<br>

    The collation order is simple ascii (with case differences).&quot;<br>

    aString isString ifFalse: [ ^ false ].<br>

    ^ (self compare: self with: aString collated: AsciiOrder) = 2<br>

    as<br>

    String&gt;&gt;= aString<br>

    &quot;Answer whether the receiver sorts equally as aString.<br>

    The collation order is simple ascii (with case differences).&quot;<br>

    (aString isString<br>

    and: [self size = aString size]) ifFalse: [^false].<br>

    ^ (self compare: self withSize: with: aString collated:<br>

    AsciiOrder) = 2<br>

    ?<br>

</div></div></blockquote><div><div class="h5">

<br>

<br>

This makes a huge difference, over 3 times faster:<br>

<br>

| bs t1 t2 |<br>

bs := ByteString allInstances first: 10000.<br>

t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.<br>

(FileStream fileNamed: &#39;/Users/eliot/Squeak/Squeak4.<u></u>5/String-=.st&#39;) fileIn.<br>

t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.<br>

{ t1. t2 } #(13726 4467)<br>

4467 - 13726 / 137.26 -67.46%<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

    One /could/ add a replacement compare:with:collated:<br>

    primitive primitiveCompareString which took the sizes as arguments<br>

    to avoid asking twice.  But it wouldn&#39;t be safe.  One could abuse<br>

    the primitive and lie about the size.  So I suspect it is best to<br>

    add the size check to String&gt;&gt;#= and accept the duplication of<br>

    the primitive finding the sizes of the two strings.  The cost in<br>

    the primitive is minimal.  A WideString version of the primitive<br>

    might pay its way, but if Spur and Sista arrive soon the primitive<br>

    shouldn&#39;t be faster than the optimised Smalltalk code.<br>

    --<br>

    best,<br>

    Eliot<br>

</blockquote>

<br>

    BTW, any good reason for not prefixing all the implementors of #=<br>

    with this?<br>

<br>

    &quot;Any object is equal to itself&quot;<br>

    self == argument ifTrue: [ ^ true ].<br>

<br>

<br>

It doesn&#39;t make much difference:<br>

<br>

| bs t1 t2 |<br>

bs := ByteString allInstances first: 10000.<br>

t1 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.<br>

(FileStream fileNamed: &#39;/Users/eliot/Squeak/Squeak4.<u></u>5/String-=.st&#39;) fileIn.<br>

t2 := [bs do: [:a| bs do: [:b| a = b]]] timeToRun.<br>

{ t1. t2 } #(4628 4560)<br>

<br>

4560 - 4628 / 46.28 -1.47%<br>

<br>

So is it worth it?  If you feel it is I&#39;ve no objection other than it<br>

feels a little kludgey for such little benefit.  And there are the<br>

Symbols if one needs quick comparison and can bear the cost of slow<br>

interning.<br>

--<br>

best,<br>

Eliot<br>

</div></div></blockquote>

<br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br>best,<div>Eliot</div>

</div></div>