[squeak-dev] When did Scratch diverge?

tim Rowledge tim at rowledge.org
Wed Aug 7 00:19:12 UTC 2013


On 06-08-2013, at 1:25 PM, Nicolas Cellier <nicolas.cellier.aka.nice at gmail.com> wrote:

> Yes, WideString contain Unicode (iso-10646) code points encoded on 32-bits words, so are like UTF32.

Well that's good news…

> But no, ByteString contains only the 256 first code points of Unicode, that is something like iso-8859-L1 or latin 1.

Got it; I was thinking (foolishly) that it could be (ab)used for utf8 encoding

> 
> So ByteString do not contain UTF8 sequence... Well, except they temporarily contain such encoding (see squeakToUtf8 and utf8ToSqueak).

Ah, so somebody else had that idea too, even though temporarily

> 
> An alternative would be to have the encoding carried by the String itself, either by class (what else would be the encoding of an UTF8String), or through an encoding instance variable. This is what VW did for example. The drawback is that it is necessary to add some VM support for these zoo of String, because String speed is vital.

Yes. Though I can handle a single case since, being single, we know what is intended. Scratch needs to have a utf8 form of string since that is how the project files store non-ascii strings. UTF32 only seems to be used as a way of doing a few odd jobs on the way to making utf8 or macRoman strings, though I'm a long way from certain of that. it gets even more mixed up because the Pi doesn't have a 'renderplugin' set, lacking a UnicodePlugin, I think because it has no Pango library or at least not one that gets used to build the unicode plugin. Maybe it should?


> 
> I said canonical unicode, but if you dig a bit, you'll see that this is not something obvious: for example the same accented latin character can be encoded with a single codePoint, or with two codePoints (a compound letter with a code for the accent and another one for the naked letter).

Now you're just saying things to scare me. 

> 
> Last thing, we have our squeakism: the #leadingChar. I let you dig into its usage, but it should be restricted for east asian languages support since squeak 4.x at least.

Oh boy. More scary stories.

Thanks for explaining...

tim
--
tim Rowledge; tim at rowledge.org; http://www.rowledge.org/tim
29A, the hexadecimal of the Beast.




More information about the Squeak-dev mailing list