[squeak-dev] how to create an UTF-8 character

Damien Pollet damien.pollet at gmail.com
Wed Sep 24 08:49:18 UTC 2008


Is there a reason (other than history) why Strings are not collections
of unicode characters (at least as viewed from outside) rather than
bytes in some unknown encoding (which should be encapsulated and only
appear when text goes in and out the image) ? Or is it already like
that ?

On Tue, Sep 23, 2008 at 7:49 PM, Philippe Marschall
<philippe.marschall at gmail.com> wrote:
> 2008/9/23 Bert Freudenberg <bert at freudenbergs.de>:
>> Am 23.09.2008 um 01:46 schrieb stephane ducasse:
>>
>>> Hi all
>>>
>>> I would like to know how I can create an UTF-* character composed for
>>> example of two bytes
>>>
>>> 16rC3 and 16rBC
>>>
>>> I tried
>>>
>>>        WideString fromByteArray: { 16rC3 . 16rBC }
>>>
>>> Stef
>>
>> There is no such thing as a "UTF-*" character. There are Unicode Characters,
>> and Unicode Strings, and there are UTF-encoded string (UTF means Unicode
>> Transformation Format).
>>
>> All characters in Squeak use Unicode now. For example, the cyrillic Б is
>>
>>        char := Character value: 16r0411.
>>
>> this can be made into a String:
>>
>>        wideString := String with: char.
>>
>> which of course has the same Unicode code points:
>>
>>        wideString asArray collect: [:each | each hex]
>>
>> gives
>>
>>         #('16r411')
>>
>> The string can be encoded as UTF-8:
>>
>>        utf8String := wideString squeakToUtf8.
>>
>> and to see the values there
>>
>>        utf8String asArray collect: [:each | each hex]
>>
>> yields
>>
>>         #('16rD0' '16r91')
>>
>> which is the UTF-8 representation of the character we began with (but if you
>> try to pront utf8String directly you get nonsense, because Squeak does not
>> know it is UTF-8 encoded).
>>
>> The decoding of UTF-8 to a String is similar:
>>
>>        #(16rC3 16rBC) asByteArray asString utf8ToSqueak
>>
>> which returns the String 'ü' and probably is what you wanted in the first
>> place - but please try to understand and use the Unicode terms correctly to
>> minimize confusion.
>>
>> Anyway, to convert between a String in UTF-8 and a regular Squeak String,
>> it's simplest to use utf8ToSqueak and squeakToUtf8.
>
> Am I the only one using the generic en/decoding functionality in
> Squeak in the form of #convertTo/FromEncoding?
>
> Convert from "Squeak" to UTF-8
> aString convertToEncoding: 'utf-8'
>
> Convert from UTF-8 to "Squeak"
> aString converFromEncoding: 'utf-8'
>
> For checking out all the encodings your image supports:
> TextConverter allEncodingNames
>
> Cheers
> Philippe
>
>
>
>



-- 
Damien Pollet
type less, do more [ | ] http://people.untyped.org/damien.pollet


More information about the Squeak-dev mailing list