[squeak-dev] The Trunk: Collections-topa.806.mcz

Chris Muller asqueaker at gmail.com
Thu Sep 13 19:00:12 UTC 2018


I think Levente raises very good points, Squeak should present
a consistent implementation of what a separator is.  I've always
considered hard space and hard page break, etc. as "Word Processor"
characters, since they have "functionality", not merely "separators".

I think we should give more time for proper consideration, discussion
and full implementation (with consistent behaviors everywhere), and
testing, too.  IMO, this type of change is low-level enough that it
should not be a last-minute change put in merely minutes before the
5.2 release but we should discuss it for the next release.

Best,
  Chris

On Thu, Sep 13, 2018 at 12:13 PM Levente Uzonyi <leves at caesar.elte.hu> wrote:
>
> On Thu, 13 Sep 2018, Tobias Pape wrote:
>
> >
> >> On 13.09.2018, at 16:35, Levente Uzonyi <leves at caesar.elte.hu> wrote:
> >>
> >> You're opening a can of worms with this. There are several other separator/white space characters missing from that list.
> >
> > Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment.
>
> That list is still incomplete (e.g. zero width space), and you still have
> to deal with the can of worms - aka answering "What is a separator?".
>
> >
> >> Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change.
> >
> > Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right?
> > See the discussion with Ron.
> > On a related note, is a very fast #isSeparator important?
>
> Yes, it is. It's used extensively by various parsers. For example, see the
> senders of #isSeparator and #skipSeparators.
> Also, consider how the change of behavior affects those methods (along
> with other users, e.g. those methods which use the character sets).
>
> > Otherwise I'd just propose
> >
> >       ^ #( 9 10 12 13 32 160 ) includes: self asInteger
> > for now…
>
> According to my measurements, that would be 10-15x slower than the
> current implementation. I optimized it for a reason not just for fun.
>
> >
> > All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators.
>
> That's true, but those are inconsistent now.
>
> Levente
>
> >
> >
> >
> >>
> >> Levente
> >>
> >> On Wed, 12 Sep 2018, commits at source.squeak.org wrote:
> >>
> >>> Tobias Pape uploaded a new version of Collections to project The Trunk:
> >>> http://source.squeak.org/trunk/Collections-topa.806.mcz
> >>>
> >>> ==================== Summary ====================
> >>>
> >>> Name: Collections-topa.806
> >>> Author: topa
> >>> Time: 12 September 2018, 3:28:40.687052 pm
> >>> UUID: 46b95db5-a773-4113-92f0-5ee905404b49
> >>> Ancestors: Collections-cmm.805
> >>>
> >>> Fix separators to include U+00A0 (no break space)
> >>>
> >>> Thanks Ron!
> >>>
> >>> =============== Diff against Collections-cmm.805 ===============
> >>>
> >>> Item was changed:
> >>> ----- Method: Character class>>separators (in category 'instance creation') -----
> >>> separators
> >>> +   "Answer a collection of space-like separator characters.
> >>> +   Note that we do not consider spaces in >8bit code points yet.
> >>> +   "
> >>> -   "Answer a collection of the standard ASCII separator characters."
> >>> +   ^ #(9 "tab"
> >>> -   ^ #(32 "space"
> >>> -           13 "cr"
> >>> -           9 "tab"
> >>>             10 "line feed"
> >>> +           12 "form feed"
> >>> +           13 "cr"
> >>> +           32 "space"
> >>> +           160 "non-breaking space, see Unicode Z general category")
> >>> +           collect: [:v | Character value: v] as: String
> >>> + " To be considered:
> >>> + 16r1680 OGHAM SPACE MARK
> >>> + 16r2000 EN QUAD
> >>> + 16r2001 EM QUAD
> >>> + 16r2002 EN SPACE
> >>> + 16r2003 EM SPACE
> >>> + 16r2004 THREE-PER-EM SPACE
> >>> + 16r2005 FOUR-PER-EM SPACE
> >>> + 16r2006 SIX-PER-EM SPACE
> >>> + 16r2007 FIGURE SPACE
> >>> + 16r2008 PUNCTUATION SPACE
> >>> + 16r2009 THIN SPACE
> >>> + 16r200A HAIR SPACE
> >>> + 16r2028 LINE SEPARATOR
> >>> + 16r2029 PARAGRAPH SEPARATOR
> >>> + 16r202F NARROW NO-BREAK SPACE
> >>> + 16r205F MEDIUM MATHEMATICAL SPACE
> >>> + 16r3000 IDEOGRAPHIC SPACE
> >>> + "!
> >>> -           12 "form feed")
> >>> -           collect: [:v | Character value: v] as: String!
> >>>
> >>> Item was changed:
> >>> + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!
> >>> - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!
> >>


More information about the Squeak-dev mailing list