[squeak-dev] how to create an UTF-8 character

stephane ducasse stephane.ducasse at free.fr
Sat Sep 27 06:15:38 UTC 2008


>> There is no such thing as a "UTF-*" character. There are Unicode  
>> Characters, and Unicode Strings, and there are UTF-encoded string  
>> (UTF means Unicode Transformation Format).

Yes I was sloppy.
Thanks for the answer

> All characters in Squeak use Unicode now.

Do you mean that the characters are all encoded using codepoints values?

can you tell me what the "now" refers to?
OLPC? 3.8?
I wanted to chekc the changes made in OLPC and harvest them in Pharo.
Now do you know if there are some tests somehwere?

> For example, the cyrillic Б is
>
> 	char := Character value: 16r0411.
>
> this can be made into a String:
>
> 	wideString := String with: char.

when I do char printString
I block Squeak 3.9. :(
>
>
> which of course has the same Unicode code points:
>
> 	wideString asArray collect: [:each | each hex]
>
> gives
>
> 	 #('16r411')

Here you are talking about codepoint
How do I get the corresponding glyph? Using an encoding I imagine

> The string can be encoded as UTF-8:
>
> 	utf8String := wideString squeakToUtf8.
>
> and to see the values there
>
> 	utf8String asArray collect: [:each | each hex]
>
> yields
>
> 	 #('16rD0' '16r91')
>
> which is the UTF-8 representation of the character we began with  
> (but if you try to pront utf8String directly you get nonsense,  
> because Squeak does not know it is UTF-8 encoded).

ok
>
>
> The decoding of UTF-8 to a String is similar:
>
> 	#(16rC3 16rBC) asByteArray asString utf8ToSqueak
>
> which returns the String 'ü' and probably is what you wanted in the  
> first place

Why do I get a visual representation? How the mapping is done from the  
unicode to the glyph.
Should we always passed via a transformation?
How the encodings schema (UTF-*) associates a code point to its glyph?

> - but please try to understand and use the Unicode terms correctly  
> to minimize confusion.

I learned that over last weeks, reading a lot of docs.

character sets ~= character encodings

>
> Anyway, to convert between a String in UTF-8 and a regular Squeak  
> String, it's simplest to use utf8ToSqueak and squeakToUtf8.

Now utf-8 was just an example. I would like to know what is a *ToSqueak?
I can understand that characters are code points in Unicode system now  
how to get see their visual representation.






More information about the Squeak-dev mailing list