[squeak-dev] Re: WideString UTF-8, UTF-32, UCS2
Andreas Raab
andreas.raab at gmx.de
Sun Apr 6 19:12:10 UTC 2008
Vladimir Pogorelenko wrote:
> FIRST FORMAT: comes from Keyboard Input, Seaside with WAKomEncoded
> WideString
> 1: 1087.
> 2: 1088.
> ...
> What is the format of that String, I guess it's exactly UTF-8.
It's UTF-32/UCS-4.
> SECOND FORMAT: comes from FileStream, FileIn, etc.
> WideString
> 1: 1069548607.
> 2: 1069548608.
> ...
> The same question what is it? Is it UTF-32 or UCS2?
It's UTF-32/UCS-4, too.
*Except* that it has a particular Squeak-idiosynchratic bit in in, the
"leading char". If you look at the hex values, you can see that:
1069548607 hex '16r3FC0043F'
1087 hex '16r43F'
So the base values are the same, only some high bits are different. Now
let's check this out:
(Character value: 1069548607) asUnicode
=> 1087
(Character value: 1087) asUnicode
=> 1087
(Character value: 1069548607) leadingChar
=> 255
(Character value: 1087) leadingChar
=> 0
So the difference is that one character is created with a different
"leadingChar" than the other one.
And therein lies the problem. Which is that the "leadingChar" is not a
part of the Unicode standard (and not exactly well-defined inside Squeak
either). So all conversions must be aware that they either have to strip
off the leadingChar or substitute it properly.
> 1. How to load data from files (e.g. FileStream) in first format
> (UTF-8?). I also need to do that for loading source code which contains
> unicode String's. May be I need to subclass UTF8TextConverter and call
> it UTF8ToUTF8TextConverter.
First of all, get your terminology straight. If you think that the first
example was in UTF-8 you're missing something big-time. Your first
example cannot possibly be UTF-8 because it uses characters out of the
byte range. In fact, if we compare your examples:
(Character value: 1069548607) asString squeakToUtf8 asByteArray
=> a ByteArray(208 191)
(Character value: 1087) asString squeakToUtf8 asByteArray
=> a ByteArray(208 191)
You will find that both will end up with the identical encoding (not
surprisingly since there is simply no room to stick the leadingChar
anywhere into UTF-8).
> I'm using squeak-dev 3.9 image with installed UnicodeSupport
> (http://www.nabble.com/Re%3A--squeak-dev---ANN--3.10-final-is-out-p16182045.html)
> to input unicode chars from keyboard. I'm on Mac. Don't even know what
> would be when I try to run in under Windows.
You would need a 3.10.x VM for that and also a few fixes so 3.9 is
probably a no-go.
Cheers,
- Andreas
More information about the Squeak-dev
mailing list
|