<div dir="ltr"><div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Thu, 13 Sep 2018 at 12:00, Chris Muller <<a href="mailto:asqueaker@gmail.com">asqueaker@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I think Levente raises very good points, Squeak should present<br>

a consistent implementation of what a separator is.</blockquote><div><br></div><div>That sounds like a category error. A _character set_ knows what a separator is. Unicode, ASCII, etc.</div><div><br></div><div>The question should, IMO at least, be "what character set should Squeak use" and, again IMO, that should be Unicode and, in particular, the UTF-8 encoding. (<a href="http://utf8everywhere.org/">http://utf8everywhere.org/</a>)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">  I've always<br>

considered hard space and hard page break, etc. as "Word Processor"<br>

characters, since they have "functionality", not merely "separators".<br>

<br>

I think we should give more time for proper consideration, discussion<br>

and full implementation (with consistent behaviors everywhere), and<br>

testing, too.  IMO, this type of change is low-level enough that it<br>

should not be a last-minute change put in merely minutes before the<br>

5.2 release but we should discuss it for the next release.<br></blockquote><div><br></div><div>+1 to this. Even if everyone decided that UTF-8 is the perfect encoding to use, and we should proceed with alacrity towards using it, now is not the time to start. My impression was that 5.2 was in a feature freeze, bugfix only phase.</div><div><br></div><div>frank</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Best,<br>

  Chris<br>

<br>

On Thu, Sep 13, 2018 at 12:13 PM Levente Uzonyi <<a href="mailto:leves@caesar.elte.hu" target="_blank">leves@caesar.elte.hu</a>> wrote:<br>

><br>

> On Thu, 13 Sep 2018, Tobias Pape wrote:<br>

><br>

> ><br>

> >> On 13.09.2018, at 16:35, Levente Uzonyi <<a href="mailto:leves@caesar.elte.hu" target="_blank">leves@caesar.elte.hu</a>> wrote:<br>

> >><br>

> >> You're opening a can of worms with this. There are several other separator/white space characters missing from that list.<br>

> ><br>

> > Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment.<br>

><br>

> That list is still incomplete (e.g. zero width space), and you still have<br>

> to deal with the can of worms - aka answering "What is a separator?".<br>

><br>

> ><br>

> >> Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change.<br>

> ><br>

> > Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right?<br>

> > See the discussion with Ron.<br>

> > On a related note, is a very fast #isSeparator important?<br>

><br>

> Yes, it is. It's used extensively by various parsers. For example, see the<br>

> senders of #isSeparator and #skipSeparators.<br>

> Also, consider how the change of behavior affects those methods (along<br>

> with other users, e.g. those methods which use the character sets).<br>

><br>

> > Otherwise I'd just propose<br>

> ><br>

> >       ^ #( 9 10 12 13 32 160 ) includes: self asInteger<br>

> > for now…<br>

><br>

> According to my measurements, that would be 10-15x slower than the<br>

> current implementation. I optimized it for a reason not just for fun.<br>

><br>

> ><br>

> > All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators.<br>

><br>

> That's true, but those are inconsistent now.<br>

><br>

> Levente<br>

><br>

> ><br>

> ><br>

> ><br>

> >><br>

> >> Levente<br>

> >><br>

> >> On Wed, 12 Sep 2018, <a href="mailto:commits@source.squeak.org" target="_blank">commits@source.squeak.org</a> wrote:<br>

> >><br>

> >>> Tobias Pape uploaded a new version of Collections to project The Trunk:<br>

> >>> <a href="http://source.squeak.org/trunk/Collections-topa.806.mcz" rel="noreferrer" target="_blank">http://source.squeak.org/trunk/Collections-topa.806.mcz</a><br>

> >>><br>

> >>> ==================== Summary ====================<br>

> >>><br>

> >>> Name: Collections-topa.806<br>

> >>> Author: topa<br>

> >>> Time: 12 September 2018, 3:28:40.687052 pm<br>

> >>> UUID: 46b95db5-a773-4113-92f0-5ee905404b49<br>

> >>> Ancestors: Collections-cmm.805<br>

> >>><br>

> >>> Fix separators to include U+00A0 (no break space)<br>

> >>><br>

> >>> Thanks Ron!<br>

> >>><br>

> >>> =============== Diff against Collections-cmm.805 ===============<br>

> >>><br>

> >>> Item was changed:<br>

> >>> ----- Method: Character class>>separators (in category 'instance creation') -----<br>

> >>> separators<br>

> >>> +   "Answer a collection of space-like separator characters.<br>

> >>> +   Note that we do not consider spaces in >8bit code points yet.<br>

> >>> +   "<br>

> >>> -   "Answer a collection of the standard ASCII separator characters."<br>

> >>> +   ^ #(9 "tab"<br>

> >>> -   ^ #(32 "space"<br>

> >>> -           13 "cr"<br>

> >>> -           9 "tab"<br>

> >>>             10 "line feed"<br>

> >>> +           12 "form feed"<br>

> >>> +           13 "cr"<br>

> >>> +           32 "space"<br>

> >>> +           160 "non-breaking space, see Unicode Z general category")<br>

> >>> +           collect: [:v | Character value: v] as: String<br>

> >>> + " To be considered:<br>

> >>> + 16r1680 OGHAM SPACE MARK<br>

> >>> + 16r2000 EN QUAD<br>

> >>> + 16r2001 EM QUAD<br>

> >>> + 16r2002 EN SPACE<br>

> >>> + 16r2003 EM SPACE<br>

> >>> + 16r2004 THREE-PER-EM SPACE<br>

> >>> + 16r2005 FOUR-PER-EM SPACE<br>

> >>> + 16r2006 SIX-PER-EM SPACE<br>

> >>> + 16r2007 FIGURE SPACE<br>

> >>> + 16r2008 PUNCTUATION SPACE<br>

> >>> + 16r2009 THIN SPACE<br>

> >>> + 16r200A HAIR SPACE<br>

> >>> + 16r2028 LINE SEPARATOR<br>

> >>> + 16r2029 PARAGRAPH SEPARATOR<br>

> >>> + 16r202F NARROW NO-BREAK SPACE<br>

> >>> + 16r205F MEDIUM MATHEMATICAL SPACE<br>

> >>> + 16r3000 IDEOGRAPHIC SPACE<br>

> >>> + "!<br>

> >>> -           12 "form feed")<br>

> >>> -           collect: [:v | Character value: v] as: String!<br>

> >>><br>

> >>> Item was changed:<br>

> >>> + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!<br>

> >>> - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!<br>

> >><br>

<br>

</blockquote></div></div></div>