[squeak-dev] Re: WideString UTF-8, UTF-32, UCS2

Sun Apr 6 19:12:10 UTC 2008

Vladimir Pogorelenko wrote:
> FIRST FORMAT: comes from Keyboard Input, Seaside with WAKomEncoded
> WideString
> 1: 1087.
> 2: 1088.
> ...
> What is the format of that String, I guess it's exactly UTF-8.

It's UTF-32/UCS-4.

> SECOND FORMAT: comes from FileStream, FileIn, etc.
> WideString
> 1: 1069548607.
> 2: 1069548608.
> ...
> The same question what is it? Is it UTF-32 or UCS2?

It's UTF-32/UCS-4, too.

*Except* that it has a particular Squeak-idiosynchratic bit in in, the 
"leading char". If you look at the hex values, you can see that:

1069548607 hex '16r3FC0043F'
       1087 hex      '16r43F'

So the base values are the same, only some high bits are different. Now 
let's check this out:

(Character value: 1069548607) asUnicode
=> 1087
(Character value: 1087) asUnicode
=> 1087

(Character value: 1069548607) leadingChar
=> 255
(Character value: 1087) leadingChar
=> 0

So the difference is that one character is created with a different 
"leadingChar" than the other one.

And therein lies the problem. Which is that the "leadingChar" is not a 
part of the Unicode standard (and not exactly well-defined inside Squeak 
either). So all conversions must be aware that they either have to strip 
off the leadingChar or substitute it properly.

> 1. How to load data from files (e.g. FileStream) in first format 
> (UTF-8?). I also need to do that for loading source code which contains 
> unicode String's. May be I need to subclass UTF8TextConverter and call 
> it UTF8ToUTF8TextConverter.

First of all, get your terminology straight. If you think that the first 
example was in UTF-8 you're missing something big-time. Your first 
example cannot possibly be UTF-8 because it uses characters out of the 
byte range. In fact, if we compare your examples:

(Character value: 1069548607) asString squeakToUtf8 asByteArray
=> a ByteArray(208 191)
(Character value: 1087) asString squeakToUtf8 asByteArray
=> a ByteArray(208 191)

You will find that both will end up with the identical encoding (not 
surprisingly since there is simply no room to stick the leadingChar 
anywhere into UTF-8).

> I'm using squeak-dev 3.9 image with installed UnicodeSupport 
> (http://www.nabble.com/Re%3A--squeak-dev---ANN--3.10-final-is-out-p16182045.html) 
> to input unicode chars from keyboard. I'm on Mac. Don't even know what 
> would be when I try to run in under Windows.

You would need a 3.10.x VM for that and also a few fixes so 3.9 is 
probably a no-go.

Cheers,
   - Andreas