Is there a reason (other than history) why Strings are not collections of unicode characters (at least as viewed from outside) rather than bytes in some unknown encoding (which should be encapsulated and only appear when text goes in and out the image) ? Or is it already like that ?
On Tue, Sep 23, 2008 at 7:49 PM, Philippe Marschall philippe.marschall@gmail.com wrote:
2008/9/23 Bert Freudenberg bert@freudenbergs.de:
Am 23.09.2008 um 01:46 schrieb stephane ducasse:
Hi all
I would like to know how I can create an UTF-* character composed for example of two bytes
16rC3 and 16rBC
I tried
WideString fromByteArray: { 16rC3 . 16rBC }
Stef
There is no such thing as a "UTF-*" character. There are Unicode Characters, and Unicode Strings, and there are UTF-encoded string (UTF means Unicode Transformation Format).
All characters in Squeak use Unicode now. For example, the cyrillic Б is
char := Character value: 16r0411.
this can be made into a String:
wideString := String with: char.
which of course has the same Unicode code points:
wideString asArray collect: [:each | each hex]
gives
#('16r411')
The string can be encoded as UTF-8:
utf8String := wideString squeakToUtf8.
and to see the values there
utf8String asArray collect: [:each | each hex]
yields
#('16rD0' '16r91')
which is the UTF-8 representation of the character we began with (but if you try to pront utf8String directly you get nonsense, because Squeak does not know it is UTF-8 encoded).
The decoding of UTF-8 to a String is similar:
#(16rC3 16rBC) asByteArray asString utf8ToSqueak
which returns the String 'ü' and probably is what you wanted in the first place - but please try to understand and use the Unicode terms correctly to minimize confusion.
Anyway, to convert between a String in UTF-8 and a regular Squeak String, it's simplest to use utf8ToSqueak and squeakToUtf8.
Am I the only one using the generic en/decoding functionality in Squeak in the form of #convertTo/FromEncoding?
Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'
Convert from UTF-8 to "Squeak" aString converFromEncoding: 'utf-8'
For checking out all the encodings your image supports: TextConverter allEncodingNames
Cheers Philippe