Re: [squeak-dev] The Trunk: Collections-topa.806.mcz

13 Sep 2018

      I think Levente raises very good points, Squeak should present
a consistent implementation of what a separator is.  I've always
considered hard space and hard page break, etc. as "Word Processor"
characters, since they have "functionality", not merely "separators".
I think we should give more time for proper consideration, discussion
and full implementation (with consistent behaviors everywhere), and
testing, too.  IMO, this type of change is low-level enough that it
should not be a last-minute change put in merely minutes before the
5.2 release but we should discuss it for the next release.
Best,
  Chris
On Thu, Sep 13, 2018 at 12:13 PM Levente Uzonyi leves@caesar.elte.hu wrote:
...
On Thu, 13 Sep 2018, Tobias Pape wrote:
...
...
On 13.09.2018, at 16:35, Levente Uzonyi leves@caesar.elte.hu wrote:
You're opening a can of worms with this. There are several other separator/white space characters missing from that list.
Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment.
That list is still incomplete (e.g. zero width space), and you still have
to deal with the can of worms - aka answering "What is a separator?".
...
...
Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change.
Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right?
See the discussion with Ron.
On a related note, is a very fast #isSeparator important?
Yes, it is. It's used extensively by various parsers. For example, see the
senders of #isSeparator and #skipSeparators.
Also, consider how the change of behavior affects those methods (along
with other users, e.g. those methods which use the character sets).
...
Otherwise I'd just propose
  ^ #( 9 10 12 13 32 160 ) includes: self asInteger

for now…
According to my measurements, that would be 10-15x slower than the
current implementation. I optimized it for a reason not just for fun.
...
All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators.
That's true, but those are inconsistent now.
Levente
...
...
Levente
On Wed, 12 Sep 2018, commits@source.squeak.org wrote:
...
Tobias Pape uploaded a new version of Collections to project The Trunk:
http://source.squeak.org/trunk/Collections-topa.806.mcz
==================== Summary ====================
Name: Collections-topa.806
Author: topa
Time: 12 September 2018, 3:28:40.687052 pm
UUID: 46b95db5-a773-4113-92f0-5ee905404b49
Ancestors: Collections-cmm.805
Fix separators to include U+00A0 (no break space)
Thanks Ron!
=============== Diff against Collections-cmm.805 ===============
Item was changed:
----- Method: Character class>>separators (in category 'instance creation') -----
separators

"Answer a collection of space-like separator characters.
Note that we do not consider spaces in >8bit code points yet.
"

"Answer a collection of the standard ASCII separator characters."

^ #(9 "tab"

^ #(32 "space"
      13 "cr"

      9 "tab"
      10 "line feed"

      12 "form feed"

      13 "cr"

      32 "space"

      160 "non-breaking space, see Unicode Z general category")

      collect: [:v | Character value: v] as: String

" To be considered:
16r1680 OGHAM SPACE MARK
16r2000 EN QUAD
16r2001 EM QUAD
16r2002 EN SPACE
16r2003 EM SPACE
16r2004 THREE-PER-EM SPACE
16r2005 FOUR-PER-EM SPACE
16r2006 SIX-PER-EM SPACE
16r2007 FIGURE SPACE
16r2008 PUNCTUATION SPACE
16r2009 THIN SPACE
16r200A HAIR SPACE
16r2028 LINE SEPARATOR
16r2029 PARAGRAPH SEPARATOR
16r202F NARROW NO-BREAK SPACE
16r205F MEDIUM MATHEMATICAL SPACE
16r3000 IDEOGRAPHIC SPACE
"!

      12 "form feed")

      collect: [:v | Character value: v] as: String!

Item was changed:

(PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!

(PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!