[squeak-dev] The Trunk: Collections-topa.807.mcz

Tobias Pape Das.Linux at gmx.de
Fri Sep 14 08:15:58 UTC 2018


Hi,

I reverted my change.
I understand Leventes point and as long as we don't consider Unicode's separator categories proper
	(https://www.fileformat.info/info/unicode/category/Zs/list.htm, and maybe 
	https://www.fileformat.info/info/unicode/category/Zl/list.htm
	https://www.fileformat.info/info/unicode/category/Zp/list.htm)
it is preposterous to make an exception for NBSP.
Ron raised a good point, and I though the fix was swift; I was wrong tho.


(the following does NOT apply to the 5.2 release)

To what others have written, eg, regarding utf-8 and such, here my reasoning.

1. Encoding conversion should not be done form string to string, but rather only
	Encoding: String => ByteArray
	Decoding: ByteArray => String
   (In theory, we could make a class, eg UTF8, that inherits from ByteArray to make some things clear)
2. UTF8 ist a very good idea, the site http://utf8everywhere.org/ raises very good points.
   It is not important for Squeak to internally encode Strings as UTF8, I think, tho it wouldn't hurt.
   The current Byte/Wide distinction with the nice property that all values in a string correspond to Unicode
   code points is nice and even clever. However, sometimes that bites, eg, when you write things on a Stream[1].
3. Regarding the often mentioned importance of constant time access to characters and easy computation 
   of string length:
     This depends heavily on the notion of what a Characters is.
   This is an easy thing for ascii chars, so there's that.
   Also, one could say that "a character is any instance of Character" which is technically correct,
   however, the questions you can ask with that, namely
     - Where is the instance qurxs of Character in this string and
     - How many instances of Character are in this string
   _are_ easy to answer with a 'direct' encoding (eg, ByteString for ASCII or latin, UTF32/WideString for Unicode etc)
   but actually less meaningful than one might think.
   The UTF-8 everywhere page hints to that direction: 
	'A programmer might count characters as code units, code points, or grapheme clusters, according to the level of the programmer’s Unicode expertise.'
   A more in-depth discussion can be found at: https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
   (please read it, and if you have time, the follow up https://manishearth.github.io/blog/2017/01/15/breaking-our-latin-1-assumptions/)

   I think we need a distinction in String between 
	(a) size aka number of storage entities (Eg, number of bytes/words) and
	(b) displaySize/length aka number of Extended Grapheme Clusters (EGC) [2], what users will see when they print the string.
   Also, we need to have a distinction between
	(a) what value is at memory position x in the string and
	(b) what is the x-th grapheme cluster.
   Only (a) can be answered in constant time, anyhow.
   So embracing this, we could also go UTF-8 internally.


4. Yes, we need better font support.


But, yes, not for 5.2

Best regards
	-Tobias

PS: how many characters is this?:  ﷽
	(fun fact: one. one code point, one grapheme cluster…)
PPS: Boy this got long. Sorry.


[1]: I was bitten here: I wrote the same string on a Socket stream and on a File stream, the former retained the 
   internal encoding, which happens for byte strings to be Latin-1, as subset of Unicode; the latter encoded 
   to UTF-8, and I wondered why the network endpoint rejected my string as not-utf-8.
[2]:    Swift and Perl 6 apparently use EGCs


> On 14.09.2018, at 09:38, commits at source.squeak.org wrote:
> 
> Tobias Pape uploaded a new version of Collections to project The Trunk:
> http://source.squeak.org/trunk/Collections-topa.807.mcz
> 
> ==================== Summary ====================
> 
> Name: Collections-topa.807
> Author: topa
> Time: 14 September 2018, 9:37:43.484317 am
> UUID: fae1c8b3-8396-4790-a491-4e51b047bc49
> Ancestors: Collections-topa.806
> 
> Revert for consistency and, subsequently, speed.
> 
> The correct fix is not as trivial and not fit in the beta phase.
> 
> Sorry, Ron.
> 
> =============== Diff against Collections-topa.806 ===============
> 
> Item was changed:
>  ----- Method: Character class>>separators (in category 'instance creation') -----
>  separators
> + 	"Answer a collection of the standard ASCII separator characters."
> - 	"Answer a collection of space-like separator characters.
> - 	Note that we do not consider spaces in >8bit code points yet.
> - 	"
> 
> + 	^ #(32 "space"
> - 	^ #(9 "tab"
> - 		10 "line feed"
> - 		12 "form feed"
>  		13 "cr"
> + 		9 "tab"
> + 		10 "line feed"
> + 		12 "form feed")
> + 		collect: [:v | Character value: v] as: String!
> - 		32 "space"
> - 		160 "non-breaking space, see Unicode Z general category")
> - 		collect: [:v | Character value: v] as: String
> - " To be considered:
> - 16r1680 OGHAM SPACE MARK
> - 16r2000 EN QUAD
> - 16r2001 EM QUAD
> - 16r2002 EN SPACE
> - 16r2003 EM SPACE
> - 16r2004 THREE-PER-EM SPACE
> - 16r2005 FOUR-PER-EM SPACE
> - 16r2006 SIX-PER-EM SPACE
> - 16r2007 FIGURE SPACE
> - 16r2008 PUNCTUATION SPACE
> - 16r2009 THIN SPACE
> - 16r200A HAIR SPACE
> - 16r2028 LINE SEPARATOR
> - 16r2029 PARAGRAPH SEPARATOR
> - 16r202F NARROW NO-BREAK SPACE
> - 16r205F MEDIUM MATHEMATICAL SPACE
> - 16r3000 IDEOGRAPHIC SPACE
> - "!
> 
> 



More information about the Squeak-dev mailing list