[squeak-dev] The Trunk: Collections-topa.806.mcz

Levente Uzonyi leves at caesar.elte.hu
Thu Sep 13 17:13:45 UTC 2018


On Thu, 13 Sep 2018, Tobias Pape wrote:

>
>> On 13.09.2018, at 16:35, Levente Uzonyi <leves at caesar.elte.hu> wrote:
>> 
>> You're opening a can of worms with this. There are several other separator/white space characters missing from that list.
>
> Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment.

That list is still incomplete (e.g. zero width space), and you still have 
to deal with the can of worms - aka answering "What is a separator?".

>
>> Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change.
>
> Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right?
> See the discussion with Ron.
> On a related note, is a very fast #isSeparator important?

Yes, it is. It's used extensively by various parsers. For example, see the 
senders of #isSeparator and #skipSeparators.
Also, consider how the change of behavior affects those methods (along 
with other users, e.g. those methods which use the character sets).

> Otherwise I'd just propose 
>
> 	^ #( 9 10 12 13 32 160 ) includes: self asInteger
> for now…

According to my measurements, that would be 10-15x slower than the 
current implementation. I optimized it for a reason not just for fun.

>
> All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators.

That's true, but those are inconsistent now.

Levente

>
>
>
>> 
>> Levente
>> 
>> On Wed, 12 Sep 2018, commits at source.squeak.org wrote:
>> 
>>> Tobias Pape uploaded a new version of Collections to project The Trunk:
>>> http://source.squeak.org/trunk/Collections-topa.806.mcz
>>> 
>>> ==================== Summary ====================
>>> 
>>> Name: Collections-topa.806
>>> Author: topa
>>> Time: 12 September 2018, 3:28:40.687052 pm
>>> UUID: 46b95db5-a773-4113-92f0-5ee905404b49
>>> Ancestors: Collections-cmm.805
>>> 
>>> Fix separators to include U+00A0 (no break space)
>>> 
>>> Thanks Ron!
>>> 
>>> =============== Diff against Collections-cmm.805 ===============
>>> 
>>> Item was changed:
>>> ----- Method: Character class>>separators (in category 'instance creation') -----
>>> separators
>>> + 	"Answer a collection of space-like separator characters.
>>> + 	Note that we do not consider spaces in >8bit code points yet.
>>> + 	"
>>> - 	"Answer a collection of the standard ASCII separator characters."
>>> + 	^ #(9 "tab"
>>> - 	^ #(32 "space"
>>> - 		13 "cr"
>>> - 		9 "tab"
>>> 		10 "line feed"
>>> + 		12 "form feed"
>>> + 		13 "cr"
>>> + 		32 "space"
>>> + 		160 "non-breaking space, see Unicode Z general category")
>>> + 		collect: [:v | Character value: v] as: String
>>> + " To be considered:
>>> + 16r1680 OGHAM SPACE MARK
>>> + 16r2000 EN QUAD
>>> + 16r2001 EM QUAD
>>> + 16r2002 EN SPACE
>>> + 16r2003 EM SPACE
>>> + 16r2004 THREE-PER-EM SPACE
>>> + 16r2005 FOUR-PER-EM SPACE
>>> + 16r2006 SIX-PER-EM SPACE
>>> + 16r2007 FIGURE SPACE
>>> + 16r2008 PUNCTUATION SPACE
>>> + 16r2009 THIN SPACE
>>> + 16r200A HAIR SPACE
>>> + 16r2028 LINE SEPARATOR
>>> + 16r2029 PARAGRAPH SEPARATOR
>>> + 16r202F NARROW NO-BREAK SPACE
>>> + 16r205F MEDIUM MATHEMATICAL SPACE
>>> + 16r3000 IDEOGRAPHIC SPACE
>>> + "!
>>> - 		12 "form feed")
>>> - 		collect: [:v | Character value: v] as: String!
>>> 
>>> Item was changed:
>>> + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!
>>> - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!
>>


More information about the Squeak-dev mailing list