[squeak-dev] WideString UTF-8, UTF-32, UCS2

Philippe Marschall philippe.marschall at gmail.com
Sun Apr 6 19:27:21 UTC 2008


2008/4/6, Vladimir Pogorelenko <vladimir at livesystems.ru>:
> I'm trying to deal with different string encodings in my image.
>
> I've read some related posts but didn't find direct answers.
>
> For the test I took unicode word 'привет'. Trying to input this string from
> keyboard, seaside web form and file stream I got 2 different formats:
>
>
> FIRST FORMAT: comes from Keyboard Input, Seaside with WAKomEncoded
>  WideString
>  1: 1087.
>  2: 1088.
>  ...
> What is the format of that String, I guess it's exactly UTF-8.

Nope, UTF-8 would result in a ByteString.

> SECOND FORMAT: comes from FileStream, FileIn, etc.
>  WideString
>  1: 1069548607.
>  2: 1069548608.
>  ...
> The same question what is it? Is it UTF-32 or UCS2?

The same thing just with a language tag.

> Both string are displayed correctly, but I'm failed to compare it.

Sure, because they're in a different language.

> So the questions are,
>  1. How to load data from files (e.g. FileStream) in first format (UTF-8?).
> I also need to do that for loading source code which contains unicode
> String's. May be I need to subclass UTF8TextConverter and call it
> UTF8ToUTF8TextConverter.
>  2.  How to setup WAKomEncoded and chars from keyboard to come in second
> format.

WAKomEncoded never sets the language tag because there is no way of
knowing the language of a user. Chars from keyboard in general set a
language tag. Due to String comparison taking language into account
they in general don't compare to equal.

Note that when setting up WAKomEncoded:
- make sure your Strings are Squeak encoded (language tags are ignored)
- either use a current version of Kom in Squeak 3.9 or use an old
version of Kom in Squeak 3.8 (the semantics of #unescapePercents have
changed)

>  3. What the encoding to choose as the base? What is the blueprint for it? I
> guess I just need learn how to load data in FIRST FORMAT and all will be ok.
>  4. How to convert WideString in image from one format to another.

#convertToEncoding:  / #convertFromEncoding:

> Unicode problem is still live here in Squeak :-) I'm confused how some great
> products like CMSBox fight against it. May be they don't even need to load
> data from external streams.

I was not aware of CMSBox fighting Unicode especially since it uses
utf-8 just like DabbleDB. Maybe you could elaborate a bit.

Cheers
Philippe

> I'm using squeak-dev 3.9 image with installed UnicodeSupport
> (http://www.nabble.com/Re%3A--squeak-dev---ANN--3.10-final-is-out-p16182045.html)
> to input unicode chars from keyboard. I'm on Mac. Don't even know what would
> be when I try to run in under Windows.
>
>
>


More information about the Squeak-dev mailing list