[squeak-dev] WideString UTF-8, UTF-32, UCS2
Philippe Marschall
philippe.marschall at gmail.com
Sun Apr 6 19:27:21 UTC 2008
2008/4/6, Vladimir Pogorelenko <vladimir at livesystems.ru>:
> I'm trying to deal with different string encodings in my image.
>
> I've read some related posts but didn't find direct answers.
>
> For the test I took unicode word 'привет'. Trying to input this string from
> keyboard, seaside web form and file stream I got 2 different formats:
>
>
> FIRST FORMAT: comes from Keyboard Input, Seaside with WAKomEncoded
> WideString
> 1: 1087.
> 2: 1088.
> ...
> What is the format of that String, I guess it's exactly UTF-8.
Nope, UTF-8 would result in a ByteString.
> SECOND FORMAT: comes from FileStream, FileIn, etc.
> WideString
> 1: 1069548607.
> 2: 1069548608.
> ...
> The same question what is it? Is it UTF-32 or UCS2?
The same thing just with a language tag.
> Both string are displayed correctly, but I'm failed to compare it.
Sure, because they're in a different language.
> So the questions are,
> 1. How to load data from files (e.g. FileStream) in first format (UTF-8?).
> I also need to do that for loading source code which contains unicode
> String's. May be I need to subclass UTF8TextConverter and call it
> UTF8ToUTF8TextConverter.
> 2. How to setup WAKomEncoded and chars from keyboard to come in second
> format.
WAKomEncoded never sets the language tag because there is no way of
knowing the language of a user. Chars from keyboard in general set a
language tag. Due to String comparison taking language into account
they in general don't compare to equal.
Note that when setting up WAKomEncoded:
- make sure your Strings are Squeak encoded (language tags are ignored)
- either use a current version of Kom in Squeak 3.9 or use an old
version of Kom in Squeak 3.8 (the semantics of #unescapePercents have
changed)
> 3. What the encoding to choose as the base? What is the blueprint for it? I
> guess I just need learn how to load data in FIRST FORMAT and all will be ok.
> 4. How to convert WideString in image from one format to another.
#convertToEncoding: / #convertFromEncoding:
> Unicode problem is still live here in Squeak :-) I'm confused how some great
> products like CMSBox fight against it. May be they don't even need to load
> data from external streams.
I was not aware of CMSBox fighting Unicode especially since it uses
utf-8 just like DabbleDB. Maybe you could elaborate a bit.
Cheers
Philippe
> I'm using squeak-dev 3.9 image with installed UnicodeSupport
> (http://www.nabble.com/Re%3A--squeak-dev---ANN--3.10-final-is-out-p16182045.html)
> to input unicode chars from keyboard. I'm on Mac. Don't even know what would
> be when I try to run in under Windows.
>
>
>
More information about the Squeak-dev
mailing list
|