[squeak-dev] how to create an UTF-8 character

Bert Freudenberg bert at freudenbergs.de
Tue Sep 23 13:48:41 UTC 2008


Am 23.09.2008 um 01:46 schrieb stephane ducasse:

> Hi all
>
> I would like to know how I can create an UTF-* character composed  
> for example of two bytes
>
> 16rC3 and 16rBC
>
> I tried
>
> 	WideString fromByteArray: { 16rC3 . 16rBC }
>
> Stef

There is no such thing as a "UTF-*" character. There are Unicode  
Characters, and Unicode Strings, and there are UTF-encoded string (UTF  
means Unicode Transformation Format).

All characters in Squeak use Unicode now. For example, the cyrillic Б  
is

	char := Character value: 16r0411.

this can be made into a String:

	wideString := String with: char.

which of course has the same Unicode code points:

	wideString asArray collect: [:each | each hex]

gives

	 #('16r411')

The string can be encoded as UTF-8:

	utf8String := wideString squeakToUtf8.

and to see the values there

	utf8String asArray collect: [:each | each hex]

yields

	 #('16rD0' '16r91')

which is the UTF-8 representation of the character we began with (but  
if you try to pront utf8String directly you get nonsense, because  
Squeak does not know it is UTF-8 encoded).

The decoding of UTF-8 to a String is similar:

	#(16rC3 16rBC) asByteArray asString utf8ToSqueak

which returns the String 'ü' and probably is what you wanted in the  
first place - but please try to understand and use the Unicode terms  
correctly to minimize confusion.

Anyway, to convert between a String in UTF-8 and a regular Squeak  
String, it's simplest to use utf8ToSqueak and squeakToUtf8.

- Bert -





More information about the Squeak-dev mailing list