<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">2014-05-28 4:50 GMT+02:00 Yoshiki Ohshima <span dir="ltr">&lt;<a href="mailto:Yoshiki.Ohshima@acm.org" target="_blank">Yoshiki.Ohshima@acm.org</a>&gt;</span>:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">At Tue, 27 May 2014 19:23:09 -0700,<br>

<div class="">Andres Valloud wrote:<br>

&gt;<br>

&gt; String encoding is perpendicular to my point.  I&#39;m referring to<br>

&gt; canonical equivalence as defined in section 1.1 of the document<br>

&gt; referenced by the URL I sent.  For instance, the Hangul example in the<br>

&gt; first table shows that a combination of two characters (regardless of<br>

&gt; encoding) is to be considered canonically equivalent to a single<br>

&gt; character.  From the document (which claims to be Unicode Standard Annex<br>

&gt; #15),<br>

&gt;<br>

&gt; &quot;Canonical equivalence is a fundamental equivalency between characters<br>

&gt; or sequences of characters that represent the same abstract character,<br>

&gt; and when correctly displayed should always have the same visual<br>

&gt; appearance and behavior.&quot;<br>

&gt;<br>

&gt; How do you propose that a size check is appropriate in the presence of<br>

&gt; canonical equivalence?  What is string equivalence supposed to mean?  I<br>

&gt; think more attention should be given to those questions.<br>

<br>

</div>I think that the single equal message (=) in the Smalltalk language<br>

should not really worry about canonical equvalence.  For those who<br>

need it, it&#39;d be fine to define a new selector and does the real<br>

stuff, and such method could track the Unicode standard revisions and<br>

do the right thing.  But something as fundamental as String&gt;&gt;#= does<br>

not have to have dependency to the external standard.<br>

<span class="HOEnZb"><font color="#888888"><br>

-- Yoshiki<br>

<br>

</font></span></blockquote></div><br>If internal representation is not canonical, we are going toward a path of maximum complexity.<br></div><div class="gmail_extra">All comparison functions = &lt; &gt; &lt;= &gt;= hash will have to first canonicalize.<br>

</div><div class="gmail_extra">So i tend to agree with Yoshiki, let these kernel methods perform their dumb task, and reject this complexity outside.<br><br></div><div class="gmail_extra">Well beyond the complexity of Unicode, the cr-lf mess already creates the same problem.<br>

There is no semantic difference between cr and cr-lf.<br>Though I had to insert a few withSqueakLineEndings sends in Monticello  when playing with GitFileTree.<br></div></div>