[squeak-dev] The Trunk: Collections-topa.806.mcz

Levente Uzonyi leves at caesar.elte.hu
Thu Sep 13 22:38:12 UTC 2018


On Thu, 13 Sep 2018, Frank Shearar wrote:

> On Thu, 13 Sep 2018 at 12:00, Chris Muller <asqueaker at gmail.com> wrote:
>       I think Levente raises very good points, Squeak should present
>       a consistent implementation of what a separator is.
> 
> 
> That sounds like a category error. A _character set_ knows what a separator is. Unicode, ASCII, etc.
> 
> The question should, IMO at least, be "what character set should Squeak use" and, again IMO, that should be Unicode and, in particular, the UTF-8 encoding. (http://utf8everywhere.org/)

My impression is that UTF-8 is slightly better and slightly worse at the 
same time than the current UTF-32 (+leading char extension) 
representation. So, I don't find it very tempting to make a huge change 
for something "different".

Levente

>  
>         I've always
>       considered hard space and hard page break, etc. as "Word Processor"
>       characters, since they have "functionality", not merely "separators".
>
>       I think we should give more time for proper consideration, discussion
>       and full implementation (with consistent behaviors everywhere), and
>       testing, too.  IMO, this type of change is low-level enough that it
>       should not be a last-minute change put in merely minutes before the
>       5.2 release but we should discuss it for the next release.
> 
> 
> +1 to this. Even if everyone decided that UTF-8 is the perfect encoding to use, and we should proceed with alacrity towards using it, now is not the time to start. My impression was that 5.2 was in a
> feature freeze, bugfix only phase.
> 
> frank
>
>       Best,
>         Chris
>
>       On Thu, Sep 13, 2018 at 12:13 PM Levente Uzonyi <leves at caesar.elte.hu> wrote:
>       >
>       > On Thu, 13 Sep 2018, Tobias Pape wrote:
>       >
>       > >
>       > >> On 13.09.2018, at 16:35, Levente Uzonyi <leves at caesar.elte.hu> wrote:
>       > >>
>       > >> You're opening a can of worms with this. There are several other separator/white space characters missing from that list.
>       > >
>       > > Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment.
>       >
>       > That list is still incomplete (e.g. zero width space), and you still have
>       > to deal with the can of worms - aka answering "What is a separator?".
>       >
>       > >
>       > >> Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change.
>       > >
>       > > Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right?
>       > > See the discussion with Ron.
>       > > On a related note, is a very fast #isSeparator important?
>       >
>       > Yes, it is. It's used extensively by various parsers. For example, see the
>       > senders of #isSeparator and #skipSeparators.
>       > Also, consider how the change of behavior affects those methods (along
>       > with other users, e.g. those methods which use the character sets).
>       >
>       > > Otherwise I'd just propose
>       > >
>       > >       ^ #( 9 10 12 13 32 160 ) includes: self asInteger
>       > > for now…
>       >
>       > According to my measurements, that would be 10-15x slower than the
>       > current implementation. I optimized it for a reason not just for fun.
>       >
>       > >
>       > > All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators.
>       >
>       > That's true, but those are inconsistent now.
>       >
>       > Levente
>       >
>       > >
>       > >
>       > >
>       > >>
>       > >> Levente
>       > >>
>       > >> On Wed, 12 Sep 2018, commits at source.squeak.org wrote:
>       > >>
>       > >>> Tobias Pape uploaded a new version of Collections to project The Trunk:
>       > >>> http://source.squeak.org/trunk/Collections-topa.806.mcz
>       > >>>
>       > >>> ==================== Summary ====================
>       > >>>
>       > >>> Name: Collections-topa.806
>       > >>> Author: topa
>       > >>> Time: 12 September 2018, 3:28:40.687052 pm
>       > >>> UUID: 46b95db5-a773-4113-92f0-5ee905404b49
>       > >>> Ancestors: Collections-cmm.805
>       > >>>
>       > >>> Fix separators to include U+00A0 (no break space)
>       > >>>
>       > >>> Thanks Ron!
>       > >>>
>       > >>> =============== Diff against Collections-cmm.805 ===============
>       > >>>
>       > >>> Item was changed:
>       > >>> ----- Method: Character class>>separators (in category 'instance creation') -----
>       > >>> separators
>       > >>> +   "Answer a collection of space-like separator characters.
>       > >>> +   Note that we do not consider spaces in >8bit code points yet.
>       > >>> +   "
>       > >>> -   "Answer a collection of the standard ASCII separator characters."
>       > >>> +   ^ #(9 "tab"
>       > >>> -   ^ #(32 "space"
>       > >>> -           13 "cr"
>       > >>> -           9 "tab"
>       > >>>             10 "line feed"
>       > >>> +           12 "form feed"
>       > >>> +           13 "cr"
>       > >>> +           32 "space"
>       > >>> +           160 "non-breaking space, see Unicode Z general category")
>       > >>> +           collect: [:v | Character value: v] as: String
>       > >>> + " To be considered:
>       > >>> + 16r1680 OGHAM SPACE MARK
>       > >>> + 16r2000 EN QUAD
>       > >>> + 16r2001 EM QUAD
>       > >>> + 16r2002 EN SPACE
>       > >>> + 16r2003 EM SPACE
>       > >>> + 16r2004 THREE-PER-EM SPACE
>       > >>> + 16r2005 FOUR-PER-EM SPACE
>       > >>> + 16r2006 SIX-PER-EM SPACE
>       > >>> + 16r2007 FIGURE SPACE
>       > >>> + 16r2008 PUNCTUATION SPACE
>       > >>> + 16r2009 THIN SPACE
>       > >>> + 16r200A HAIR SPACE
>       > >>> + 16r2028 LINE SEPARATOR
>       > >>> + 16r2029 PARAGRAPH SEPARATOR
>       > >>> + 16r202F NARROW NO-BREAK SPACE
>       > >>> + 16r205F MEDIUM MATHEMATICAL SPACE
>       > >>> + 16r3000 IDEOGRAPHIC SPACE
>       > >>> + "!
>       > >>> -           12 "form feed")
>       > >>> -           collect: [:v | Character value: v] as: String!
>       > >>>
>       > >>> Item was changed:
>       > >>> + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!
>       > >>> - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!
>       > >>
> 
> 
>


More information about the Squeak-dev mailing list