[squeak-dev] how to create an UTF-8 character

Philippe Marschall philippe.marschall at gmail.com
Tue Sep 23 17:49:54 UTC 2008


2008/9/23 Bert Freudenberg <bert at freudenbergs.de>:
> Am 23.09.2008 um 01:46 schrieb stephane ducasse:
>
>> Hi all
>>
>> I would like to know how I can create an UTF-* character composed for
>> example of two bytes
>>
>> 16rC3 and 16rBC
>>
>> I tried
>>
>>        WideString fromByteArray: { 16rC3 . 16rBC }
>>
>> Stef
>
> There is no such thing as a "UTF-*" character. There are Unicode Characters,
> and Unicode Strings, and there are UTF-encoded string (UTF means Unicode
> Transformation Format).
>
> All characters in Squeak use Unicode now. For example, the cyrillic Б is
>
>        char := Character value: 16r0411.
>
> this can be made into a String:
>
>        wideString := String with: char.
>
> which of course has the same Unicode code points:
>
>        wideString asArray collect: [:each | each hex]
>
> gives
>
>         #('16r411')
>
> The string can be encoded as UTF-8:
>
>        utf8String := wideString squeakToUtf8.
>
> and to see the values there
>
>        utf8String asArray collect: [:each | each hex]
>
> yields
>
>         #('16rD0' '16r91')
>
> which is the UTF-8 representation of the character we began with (but if you
> try to pront utf8String directly you get nonsense, because Squeak does not
> know it is UTF-8 encoded).
>
> The decoding of UTF-8 to a String is similar:
>
>        #(16rC3 16rBC) asByteArray asString utf8ToSqueak
>
> which returns the String 'ü' and probably is what you wanted in the first
> place - but please try to understand and use the Unicode terms correctly to
> minimize confusion.
>
> Anyway, to convert between a String in UTF-8 and a regular Squeak String,
> it's simplest to use utf8ToSqueak and squeakToUtf8.

Am I the only one using the generic en/decoding functionality in
Squeak in the form of #convertTo/FromEncoding?

Convert from "Squeak" to UTF-8
aString convertToEncoding: 'utf-8'

Convert from UTF-8 to "Squeak"
aString converFromEncoding: 'utf-8'

For checking out all the encodings your image supports:
TextConverter allEncodingNames

Cheers
Philippe


More information about the Squeak-dev mailing list