[squeak-dev] WideString UTF-8, UTF-32, UCS2

Vladimir Pogorelenko vladimirpogorelenko at gmail.com
Sun Apr 6 12:49:53 UTC 2008


I'm trying to deal with different string encodings in my image.

I've read some related posts but didn't find direct answers.

For the test I took unicode word 'привет'. Trying to input this string  
from keyboard, seaside web form and file stream I got 2 different  
formats:


FIRST FORMAT: comes from Keyboard Input, Seaside with WAKomEncoded
	WideString
		1: 1087.
		2: 1088.
		...
What is the format of that String, I guess it's exactly UTF-8.


SECOND FORMAT: comes from FileStream, FileIn, etc.
	WideString
		1: 1069548607.
		2: 1069548608.
		...
The same question what is it? Is it UTF-32 or UCS2?

Both string are displayed correctly, but I'm failed to compare it.

So the questions are,
	1. How to load data from files (e.g. FileStream) in first format  
(UTF-8?). I also need to do that for loading source code which  
contains unicode String's. May be I need to subclass UTF8TextConverter  
and call it UTF8ToUTF8TextConverter.
	2.  How to setup WAKomEncoded and chars from keyboard to come in  
second format.
	3. What the encoding to choose as the base? What is the blueprint for  
it? I guess I just need learn how to load data in FIRST FORMAT and all  
will be ok.
	4. How to convert WideString in image from one format to another.

Unicode problem is still live here in Squeak :-) I'm confused how some  
great products like CMSBox fight against it. May be they don't even  
need to load data from external streams.

I'm using squeak-dev 3.9 image with installed UnicodeSupport (http://www.nabble.com/Re%3A--squeak-dev---ANN--3.10-final-is-out-p16182045.html 
) to input unicode chars from keyboard. I'm on Mac. Don't even know  
what would be when I try to run in under Windows.


More information about the Squeak-dev mailing list