[squeak-dev] String >> utf8Encoded, ByteArray >> utf8Decoded

Tobias Pape Das.Linux at gmx.de
Sun Jan 28 18:40:18 UTC 2018


Hi all.

First, I think Tony's idea is good in terms of usability.

Second:

> On 28.01.2018, at 19:06, tim Rowledge <tim at rowledge.org> wrote:
> 
> What would be so much better is a proper UTF8String class.
> 
> One of the problems of course is that doing almost anything to a utf encoded pseudostring requires complex faffing around to decode some or all of it. This makes them pretty much useless for anything outside passing to external libraries, at least so far as I have found. However, that turns out to be a quite important thing, and right now we have a horrible mess.

I think our ByteString/WideString is already pretty good (We have complete unicode coverage and whatnot). If we have to improve, lets first have a look at the conceptual things.
Please please have a look at:

	https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/

When we have an improved String, it should

 - report size in terms of extended grapheme cluster.
 - Never Ever Expose UTF8 bytes to users
 - Always have UTF8encodings result in ByteArrays.
   (because, well and utf8-encoded thing is no longer a string, it's just encoded byte data)
 - Does normalization correctly…

that's my 2ct
Best regards
	-Tobias

> 
> We do, however, have a sorta-kinda model of a way to handle it in FilePath, which actually has nothing much to do with file paths at all. FilePath is a bit more general than a UTF-* string and maybe that is still a valuable option.
> 
> I see two basic options for making an improvements
> 
> a) a simplistic UTF8String that is a byte array of the utf8 bytes, does nothing much except exist as an encoded string to pass to primitives. #size returns the number of bytes, #at & #at:put: are not for general consumption, to do any sort of editing you have to covert it to a real String.
> 
> b) something like FilePath, with both the original and encoded version kept, automagic conversions and some interesting hand-waving to deal with #size (is it the number of characters, or the number of bytes?) etc.
> 
> tim
> --
> tim Rowledge; tim at rowledge.org; http://www.rowledge.org/tim
> Useful random insult:- His future is behind schedule.
> 
> 
> 



More information about the Squeak-dev mailing list