[squeak-dev] When did Scratch diverge?

Nicolas Cellier nicolas.cellier.aka.nice at gmail.com
Tue Aug 6 20:25:17 UTC 2013


Yes, WideString contain Unicode (iso-10646) code points encoded on 32-bits
words, so are like UTF32.
But no, ByteString contains only the 256 first code points of Unicode, that
is something like iso-8859-L1 or latin 1.

So ByteString do not contain UTF8 sequence... Well, except they temporarily
contain such encoding (see squeakToUtf8 and utf8ToSqueak).
This is not a good thing that correct interpretation of a String depends on
some state held somewhere in the image...
If we don't know for sure how to interpret the codes composing a String,
this just make String useless: we can't compare them, display them etc...
In other words, they have no more value than just a raw sequence of bytes,
like ByteArray.
For this reason we would prefer to have encoded string (other than
canonical unicode - see further) explicitely represented in ByteArray (I
very much like the UninterpretedBytes variant from VW, very speaking).

An alternative would be to have the encoding carried by the String itself,
either by class (what else would be the encoding of an UTF8String), or
through an encoding instance variable. This is what VW did for example. The
drawback is that it is necessary to add some VM support for these zoo of
String, because String speed is vital.

I said canonical unicode, but if you dig a bit, you'll see that this is not
something obvious: for example the same accented latin character can be
encoded with a single codePoint, or with two codePoints (a compound letter
with a code for the accent and another one for the naked letter).

Last thing, we have our squeakism: the #leadingChar. I let you dig into its
usage, but it should be restricted for east asian languages support since
squeak 4.x at least.



2013/8/6 tim Rowledge <tim at rowledge.org>

>
> On 06-08-2013, at 1:57 AM, Bert Freudenberg <bert at freudenbergs.de> wrote:
> > You missed that I make a distinction between "i18n" (how to translate
> between English and Other Human Languages) and the rather technical aspect
> of how to represent strings with more than 8 bits per character.
>
> Fair enough; it's all unfamiliar enough to me that it looks like one big
> hairy ball of nastiness. The Scratch translation system is a fairly simple
> dictionary lookup, so at least that part makes sense!
>
> > For both of these Scratch has a solution different from main Squeak, but
> I'm saying the best way forward is to use Squeak's strings with Scratch's
> translation framework.
>
> OK, I can see virtue in that. I don't currently have a clue how
> non-english/ascii characters get handled in the Squeak system but I suppose
> we'll crash into that bridge when we come to it…
>
> Squeak has BytesString and WideString. I'm going to make a wild guess that
> WideString is for use as UTF32 encoding of unicode, and that ByteString is
> usable for 'plain old ascii' and UTF8 encoded unicode?
>
>
> > A third part is displaying the translated strings for which I'd continue
> to use Scratch's way, at least for the time being.
>
> I *think* that one advantage of using the Squeak string classes should be
> that StringMorph already handles them properly, rather than having to fudge
> in the rather ugly Scratch modifications. I'm not sure about right-to-left
> languages though - are they supposed to be handled? There's a fair bit of
> if-this draw one way, if the-other draw differently, unless the
> magic-unicode-direction-char says otherwise and it's a blue moon on
> Thursday.
>
>
> tim
> --
> tim Rowledge; tim at rowledge.org; http://www.rowledge.org/tim
> Base 8 is just like base 10, if you are missing two fingers.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20130806/42858de2/attachment.htm


More information about the Squeak-dev mailing list