how to create an UTF-8 character

List overview All Threads
Download

newer

older

representing matrix as two...

Etoys Tiles

stephane ducasse

23 Sep 2008 23 Sep '08

10:46 a.m.

Hi all

I would like to know how I can create an UTF-* character composed for example of two bytes

16rC3 and 16rBC

I tried

WideString fromByteArray: { 16rC3 . 16rBC }

Stef

Show replies by date

Norbert Hartl

23 Sep 23 Sep

11:50 a.m.

On Tue, 2008-09-23 at 10:46 +0200, stephane ducasse wrote:

...

Hi all

I would like to know how I can create an UTF-* character composed for example of two bytes

16rC3 and 16rBC

I tried

WideString fromByteArray: { 16rC3 . 16rBC }

Stef

Hmm, I'm not sure what you mean by UTF-* Character but this way it works

( ( String fromByteArray: ( ByteArray with: 16rC3 with: 16rBC ) ) convertFromEncoding: #utf8 ) at: 1

And it is not a two-byte character because it is a character that is contained in latin-1.

I thought there would be an easier/better way to do! Bert? :)

Norbert

Bert Freudenberg

3:48 p.m.

Am 23.09.2008 um 01:46 schrieb stephane ducasse:

...

Hi all

I would like to know how I can create an UTF-* character composed for example of two bytes

16rC3 and 16rBC

I tried

WideString fromByteArray: { 16rC3 . 16rBC }

Stef

There is no such thing as a "UTF-*" character. There are Unicode Characters, and Unicode Strings, and there are UTF-encoded string (UTF means Unicode Transformation Format).

All characters in Squeak use Unicode now. For example, the cyrillic Б is

char := Character value: 16r0411.

this can be made into a String:

wideString := String with: char.

which of course has the same Unicode code points:

wideString asArray collect: [:each | each hex]

gives

#('16r411')

The string can be encoded as UTF-8:

utf8String := wideString squeakToUtf8.

and to see the values there

utf8String asArray collect: [:each | each hex]

yields

#('16rD0' '16r91')

which is the UTF-8 representation of the character we began with (but if you try to pront utf8String directly you get nonsense, because Squeak does not know it is UTF-8 encoded).

The decoding of UTF-8 to a String is similar:

#(16rC3 16rBC) asByteArray asString utf8ToSqueak

which returns the String 'ü' and probably is what you wanted in the first place - but please try to understand and use the Unicode terms correctly to minimize confusion.

Anyway, to convert between a String in UTF-8 and a regular Squeak String, it's simplest to use utf8ToSqueak and squeakToUtf8.

- Bert -

Norbert Hartl

4:07 p.m.

On Tue, 2008-09-23 at 06:48 -0700, Bert Freudenberg wrote:

...

Am 23.09.2008 um 01:46 schrieb stephane ducasse:

...
Hi all

I would like to know how I can create an UTF-* character composed for example of two bytes

16rC3 and 16rBC

I tried

WideString fromByteArray: { 16rC3 . 16rBC }

Stef

There is no such thing as a "UTF-*" character. There are Unicode Characters, and Unicode Strings, and there are UTF-encoded string (UTF means Unicode Transformation Format).

All characters in Squeak use Unicode now. For example, the cyrillic Б is

char := Character value: 16r0411.

this can be made into a String:

wideString := String with: char.

which of course has the same Unicode code points:

wideString asArray collect: [:each | each hex]

gives
#('16r411')
The string can be encoded as UTF-8:

utf8String := wideString squeakToUtf8.

and to see the values there

utf8String asArray collect: [:each | each hex]

yields
#('16rD0' '16r91')
which is the UTF-8 representation of the character we began with (but if you try to pront utf8String directly you get nonsense, because Squeak does not know it is UTF-8 encoded).

The decoding of UTF-8 to a String is similar:

#(16rC3 16rBC) asByteArray asString utf8ToSqueak

Hmmm, I knew it :) That is the same I did just readable and in one line (and more of this "strange method stuff"[tm]).

...

which returns the String 'ü' and probably is what you wanted in the first place - but please try to understand and use the Unicode terms correctly to minimize confusion.

Anyway, to convert between a String in UTF-8 and a regular Squeak String, it's simplest to use utf8ToSqueak and squeakToUtf8.

Bert -

Norbert

P.S.: My only hope is that with my knowledge getting bigger and pharo's getting smaller that we meet somewhere in between!!!

Philippe Marschall

7:49 p.m.

2008/9/23 Bert Freudenberg bert@freudenbergs.de:

...

Am 23.09.2008 um 01:46 schrieb stephane ducasse:

...
Hi all

I would like to know how I can create an UTF-* character composed for example of two bytes

16rC3 and 16rBC

I tried
   WideString fromByteArray: { 16rC3 . 16rBC }
Stef
There is no such thing as a "UTF-*" character. There are Unicode Characters, and Unicode Strings, and there are UTF-encoded string (UTF means Unicode Transformation Format).

All characters in Squeak use Unicode now. For example, the cyrillic Б is
   char := Character value: 16r0411.
this can be made into a String:
   wideString := String with: char.
which of course has the same Unicode code points:
   wideString asArray collect: [:each | each hex]
gives
    #('16r411')
The string can be encoded as UTF-8:
   utf8String := wideString squeakToUtf8.
and to see the values there
   utf8String asArray collect: [:each | each hex]
yields
    #('16rD0' '16r91')
which is the UTF-8 representation of the character we began with (but if you try to pront utf8String directly you get nonsense, because Squeak does not know it is UTF-8 encoded).

The decoding of UTF-8 to a String is similar:
   #(16rC3 16rBC) asByteArray asString utf8ToSqueak
which returns the String 'ü' and probably is what you wanted in the first place - but please try to understand and use the Unicode terms correctly to minimize confusion.

Anyway, to convert between a String in UTF-8 and a regular Squeak String, it's simplest to use utf8ToSqueak and squeakToUtf8.

Am I the only one using the generic en/decoding functionality in Squeak in the form of #convertTo/FromEncoding?

Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'

Convert from UTF-8 to "Squeak" aString converFromEncoding: 'utf-8'

For checking out all the encodings your image supports: TextConverter allEncodingNames

Cheers Philippe

Damien Pollet

24 Sep 24 Sep

10:49 a.m.

Is there a reason (other than history) why Strings are not collections of unicode characters (at least as viewed from outside) rather than bytes in some unknown encoding (which should be encapsulated and only appear when text goes in and out the image) ? Or is it already like that ?

On Tue, Sep 23, 2008 at 7:49 PM, Philippe Marschall philippe.marschall@gmail.com wrote:

...

2008/9/23 Bert Freudenberg bert@freudenbergs.de:

...
Am 23.09.2008 um 01:46 schrieb stephane ducasse:

...
Hi all

I would like to know how I can create an UTF-* character composed for example of two bytes

16rC3 and 16rBC

I tried
   WideString fromByteArray: { 16rC3 . 16rBC }
Stef
There is no such thing as a "UTF-*" character. There are Unicode Characters, and Unicode Strings, and there are UTF-encoded string (UTF means Unicode Transformation Format).

All characters in Squeak use Unicode now. For example, the cyrillic Б is
   char := Character value: 16r0411.
this can be made into a String:
   wideString := String with: char.
which of course has the same Unicode code points:
   wideString asArray collect: [:each | each hex]
gives
    #('16r411')
The string can be encoded as UTF-8:
   utf8String := wideString squeakToUtf8.
and to see the values there
   utf8String asArray collect: [:each | each hex]
yields
    #('16rD0' '16r91')
which is the UTF-8 representation of the character we began with (but if you try to pront utf8String directly you get nonsense, because Squeak does not know it is UTF-8 encoded).

The decoding of UTF-8 to a String is similar:
   #(16rC3 16rBC) asByteArray asString utf8ToSqueak
which returns the String 'ü' and probably is what you wanted in the first place - but please try to understand and use the Unicode terms correctly to minimize confusion.

Anyway, to convert between a String in UTF-8 and a regular Squeak String, it's simplest to use utf8ToSqueak and squeakToUtf8.
Am I the only one using the generic en/decoding functionality in Squeak in the form of #convertTo/FromEncoding?

Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'

Convert from UTF-8 to "Squeak" aString converFromEncoding: 'utf-8'

For checking out all the encodings your image supports: TextConverter allEncodingNames

Cheers Philippe

-- Damien Pollet type less, do more [ | ] http://people.untyped.org/damien.pollet

Yoshiki Ohshima

11:26 a.m.

At Wed, 24 Sep 2008 10:49:18 +0200, Damien Pollet wrote:

...

Is there a reason (other than history) why Strings are not collections of unicode characters (at least as viewed from outside) rather than bytes in some unknown encoding (which should be encapsulated and only appear when text goes in and out the image) ? Or is it already like that ?

I think the answer is that it is already *like that*, although I can't tell what you mean by "from outside".

In the image, a ByteString or WideString is a sequence of characters that hold Unicode code points. (Note that a Unicode code point is 21-bit.) if all the code in a string fits within 8-bit, we use ByteString. if it doesn't it uses WideString, but the distinction is more or less hidden from a casual user. The conversion is only needed when the String is interfacing with the outside of the image.

A Unicode code point doesn't really corresponds to the concept of a character, if you think an accented character a "character". The original concept of Unicode was that such "character" should be always represented as the sequence of code points; one base character, and one or more accent marks. It was at least pure and fair.

But, they got the "Latin-1 compatibility" idea around 1990 in a retrofitted way; so the original idea of "Let us make a universal character set for everybody in the world" was turned to: "Let us make a universal character set for everybody in the world, but let's treat Westerners nicer." But of course this turn made the situation where a simple accented character has two (precomposed and decomposed) representations. Squeak is still way behind and prefers the precomposed "normalization", but the normalization is really lax there.

To me, the han unification is another evidence of "Westerners first" idea. If tracing back to the origin of characters is the concept, i and j should be perhaps unified as well (just kidding).

But, Unicode is the standard now, and it does solve a lot of problems. So using it as the base but putting necessary information around it to support it is a good way in principle.

If so, one could argue that we can just hold every string in decomposed UTF-8 in the image, and have a couple of variants of at: and at:put:. The requirement of O(1) random access is not that important. I might go that direction if I redo it now.

-- Yoshiki

Colin Putney

4:45 p.m.

On 24-Sep-08, at 2:26 AM, Yoshiki Ohshima wrote:

...

At Wed, 24 Sep 2008 10:49:18 +0200, Damien Pollet wrote:

...
Is there a reason (other than history) why Strings are not collections of unicode characters (at least as viewed from outside) rather than bytes in some unknown encoding (which should be encapsulated and only appear when text goes in and out the image) ? Or is it already like that ?

I think the answer is that it is already *like that*, although I can't tell what you mean by "from outside".

I think Damien's confusion comes from the fact that the abstractions are a bit leaky. For example, if you do something like this:

'ábc' convertToEncoding: 'utf-8'

the result is 'Ã¡bc'. It's a string where the internal, "encapsulated" state is such that writing it to a socket or file will produce the desired bytes, but all in-image behavior is totally broken.

VisualWorks tends to do a better job of maintaining the abstractions, I think. The equivalent of the above example would product a ByteArray.

...

If so, one could argue that we can just hold every string in decomposed UTF-8 in the image, and have a couple of variants of at: and at:put:. The requirement of O(1) random access is not that important. I might go that direction if I redo it now.

A UTF8String would be really handy for web applications, where strings come in from the net as UTF-8, live in the image for a while, then get sent out as UTF-8. O(1) random access isn't very useful, because strings are (mostly) uninterpreted, but converting to Squeak's internal representation is expensive.

The thing is, as long as the "sequence of characters" abstraction is maintained, it doesn't matter (for purposes of correct behavior) what the internal representation is. So it's perfectly reasonable to have multiple encodings with different performance profiles. UTF8String when you need it, WideString when that makes sense.

Colin

Yoshiki Ohshima

26 Sep 26 Sep

6:56 p.m.

At Wed, 24 Sep 2008 07:45:38 -0700, Colin Putney wrote:

...

A UTF8String would be really handy for web applications, where strings come in from the net as UTF-8, live in the image for a while, then get sent out as UTF-8. O(1) random access isn't very useful, because strings are (mostly) uninterpreted, but converting to Squeak's internal representation is expensive.

The thing is, as long as the "sequence of characters" abstraction is maintained, it doesn't matter (for purposes of correct behavior) what the internal representation is. So it's perfectly reasonable to have multiple encodings with different performance profiles. UTF8String when you need it, WideString when that makes sense.

The thing is though, that even from the net UTF-8 is not as dominant as like that. There are bunch of other encoding used.

And, have UTF8String and WideString causes the comparison etc. more complicated than it should. Have a single internal representation is cleaner.

Have the encoded data in ByteArray is sensible thing to do. That would have been much bigger redesign of Squeak, though.

-- Yoshiki

K. K. Subramaniam

24 Sep 24 Sep

5:08 p.m.

On Wednesday 24 Sep 2008 2:56:43 pm Yoshiki Ohshima wrote:

...

In the image, a ByteString or WideString is a sequence of characters that hold Unicode code points. (Note that a Unicode code point is 21-bit.) if all the code in a string fits within 8-bit, we use ByteString. if it doesn't it uses WideString

You mean a sequence of code points? Instances of Character hold only one code point (value), while some characters need more than one code point (e.g. ksha in Devanagari needs three).

Subbu

Yoshiki Ohshima

26 Sep 26 Sep

7:03 p.m.

At Wed, 24 Sep 2008 20:38:18 +0530, K. K. Subramaniam wrote:

...

On Wednesday 24 Sep 2008 2:56:43 pm Yoshiki Ohshima wrote:

...
In the image, a ByteString or WideString is a sequence of characters that hold Unicode code points. (Note that a Unicode code point is 21-bit.) if all the code in a string fits within 8-bit, we use ByteString. if it doesn't it uses WideString

You mean a sequence of code points? Instances of Character hold only one code point (value), while some characters need more than one code point (e.g. ksha in Devanagari needs three).

Yes, a sequence of code points, as rephrased below the email.

-- Yoshiki

stephane ducasse

27 Sep 27 Sep

8:18 a.m.

...

...
Am I the only one using the generic en/decoding functionality in Squeak in the form of #convertTo/FromEncoding?

Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'

do I understand correctly that such a aString is a sequence of unicode codepoints?

...

Convert from UTF-8 to "Squeak" aString converFromEncoding: 'utf-8'

For checking out all the encodings your image supports: TextConverter allEncodingNames

Cheers Philippe

Norbert Hartl

12:46 p.m.

On Sat, 2008-09-27 at 08:18 +0200, stephane ducasse wrote:

...

...
...
Am I the only one using the generic en/decoding functionality in Squeak in the form of #convertTo/FromEncoding?

Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'

do I understand correctly that such a aString is a sequence of unicode codepoints?

...

At first the utf-8 is a sequence of bytes. These bytes are a space optimzed encoding of a code point (utf-8). If you decode those bytes you get your code point (unicode). From a sequence of code points you can derive a character. In most cases (for us westerners) it will be a single code point AFAIK.

Norbert

...

...
Convert from UTF-8 to "Squeak" aString converFromEncoding: 'utf-8'

For checking out all the encodings your image supports: TextConverter allEncodingNames

Cheers Philippe

stephane ducasse

29 Sep 29 Sep

6:53 p.m.

...

...
...
...
Am I the only one using the generic en/decoding functionality in

Squeak in the form of #convertTo/FromEncoding?

Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'

do I understand correctly that such a aString is a sequence of unicode codepoints?

...
At first the utf-8 is a sequence of bytes. These bytes are a space optimzed encoding of a code point (utf-8). If you decode those bytes you get your code point (unicode). From a sequence of code points you can derive a character. In most cases (for us westerners) it will be a single code point AFAIK.

I'm trying to really understand in Squeak. :) What we call character is what then? Is it a codepoint? or the looked up glyph in a font table?

Stef

Norbert Hartl

8:11 p.m.

On Mon, 2008-09-29 at 18:53 +0200, stephane ducasse wrote:

...

...
...
...
...
Am I the only one using the generic en/decoding functionality in

Squeak in the form of #convertTo/FromEncoding?

Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'

do I understand correctly that such a aString is a sequence of unicode codepoints?

...
At first the utf-8 is a sequence of bytes. These bytes are a space optimzed encoding of a code point (utf-8). If you decode those bytes you get your code point (unicode). From a sequence of code points you can derive a character. In most cases (for us westerners) it will be a single code point AFAIK.

I'm trying to really understand in Squeak. :) What we call character is what then? Is it a codepoint? or the looked up glyph in a font table?

I don't know. I've never dealt with how squeak does those things

Norbert

Bert Freudenberg

8:24 p.m.

Am 29.09.2008 um 11:11 schrieb Norbert Hartl:

...

On Mon, 2008-09-29 at 18:53 +0200, stephane ducasse wrote:

...
...
...
...
...
Am I the only one using the generic en/decoding functionality in

Squeak in the form of #convertTo/FromEncoding?

Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'

do I understand correctly that such a aString is a sequence of unicode codepoints?

...
At first the utf-8 is a sequence of bytes. These bytes are a space optimzed encoding of a code point (utf-8). If you decode those bytes you get your code point (unicode). From a sequence of code points you can derive a character. In most cases (for us westerners) it will be a single code point AFAIK.

I'm trying to really understand in Squeak. :) What we call character is what then? Is it a codepoint? or the looked up glyph in a font table?

I don't know. I've never dealt with how squeak does those things

A character represents a single code point. A font maps code points to glyphs.

A character also encodes a language-tag (a.k.a. leading char) but we all seem to agree that's a bad idea, it was done to allow easier migration of old code (for many eastern languages a code point and a font is not enough for rendering, you also need to know the language).

- Bert -

Andreas Raab

30 Sep 30 Sep

12:56 a.m.

Bert Freudenberg wrote:

...

A character also encodes a language-tag (a.k.a. leading char) but we all seem to agree that's a bad idea, it was done to allow easier migration of old code (for many eastern languages a code point and a font is not enough for rendering, you also need to know the language).

I wouldn't necessarily call it a bad idea. It is incomplete, for sure, but it is one of the ways one can deal with this problem. Even though I prefer having language information in text attributes the language tag per se wouldn't cause problems if the code would be able to deal with its absence. E.g., if one could use strings with "just unicode" I wouldn't mind having the ability to add the language tag for disambiguation where necessary (issues of equality etc. notwithstanding which is why I think using text attributes is the better way to go).

The problem is that too much code relies on both the presence as well as particular values for certain code points and simply breaks if it isn't filled in "correctly". As such the language tag seems to be mostly redundant with certain code points. I guess one way to get over this is to add a preference that leaves out the language tag and just try running that way to see what and where it breaks.

Cheers, - Andreas

Yoshiki Ohshima

3:07 a.m.

At Mon, 29 Sep 2008 11:24:36 -0700, Bert Freudenberg wrote:

...

...
...
I'm trying to really understand in Squeak. :) What we call character is what then? Is it a codepoint? or the looked up glyph in a font table?

I don't know. I've never dealt with how squeak does those things

A character represents a single code point.

This I would like to be philosophically false, but Unicode decided that is the way it is. We use Unicode for part of the representation, but we can have different philosophy there.

...

A font maps code points to glyphs.

And the trouble is that "a font" cannot really map to glyphs to what the users want and we need additional information.

IOW, if we follow the philosophy of "a character is a code point and a font maps to glyph", we should not be able to print-it "a codepoint" in a workspace. I am not sure that the Squeak community would like to go all the way like that.

-- Yoshiki

Philippe Marschall

27 Sep 27 Sep

7:05 p.m.

2008/9/27 stephane ducasse stephane.ducasse@free.fr:

...

...
...
Am I the only one using the generic en/decoding functionality in Squeak in the form of #convertTo/FromEncoding?

Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'

do I understand correctly that such a aString is a sequence of unicode codepoints?

Plus leading char. If you look at UTF8TextConverter it will give every incoming character with an index higher than 255 the language of the image. I don't need to explain why this is problematic in the context of a web application, do I?

Cheers Philippe

Andreas Raab

7:14 p.m.

Philippe Marschall wrote:

...

2008/9/27 stephane ducasse stephane.ducasse@free.fr:

...
do I understand correctly that such a aString is a sequence of unicode codepoints?

Plus leading char. If you look at UTF8TextConverter it will give every incoming character with an index higher than 255 the language of the image. I don't need to explain why this is problematic in the context of a web application, do I?

Actually, it *is* worthwhile to explain this. The problem is that since UTF-8 doesn't have the notion of a leading char there is no way to tag incoming data correctly. The leading char will be taken from the running image, so an image running in the US (like our servers) will tag input coming from Chinese browsers as Latin1. In these situations the leading char isn't just useless, it is actively misleading.

Cheers, - Andreas

Yoshiki Ohshima

9:39 p.m.

At Sat, 27 Sep 2008 10:14:39 -0700, Andreas Raab wrote:

...

Philippe Marschall wrote:

...
2008/9/27 stephane ducasse stephane.ducasse@free.fr:

...
do I understand correctly that such a aString is a sequence of unicode codepoints?

Plus leading char. If you look at UTF8TextConverter it will give every incoming character with an index higher than 255 the language of the image. I don't need to explain why this is problematic in the context of a web application, do I?

Actually, it *is* worthwhile to explain this. The problem is that since UTF-8 doesn't have the notion of a leading char there is no way to tag incoming data correctly. The leading char will be taken from the running image, so an image running in the US (like our servers) will tag input coming from Chinese browsers as Latin1. In these situations the leading char isn't just useless, it is actively misleading.

For that kind of web applications and servers that deals with stuff outside of Squeak, it doesn't serve a good purpose, because editting, displaying etc. are out of scope. Needless to say, the original idea was to make Squeak to be the dynamic, interactive, multilingualized, environment so there is mismatch. Web applications etc. historically comes after the goal.

If you need to retain these extra information, sending the strings without going through UTF-8 conversion makes more sense.

-- Yoshiki

Andreas Raab

28 Sep 28 Sep

7:45 p.m.

Yoshiki Ohshima wrote:

...

For that kind of web applications and servers that deals with stuff outside of Squeak, it doesn't serve a good purpose, because editting, displaying etc. are out of scope. Needless to say, the original idea was to make Squeak to be the dynamic, interactive, multilingualized, environment so there is mismatch. Web applications etc. historically comes after the goal.

Which wouldn't be a problem if the code was able to handle the data properly. Unfortunately, the effects of an "invalid" leading char are very, very strange (everything from crashing the scanner to raising weird errors in comparisons, character access etc). As it stands, an application that uses non-Latin characters off the web is best off by keeping everything in UTF-8.

BTW, one way to deal with this properly is by providing a leading char upon input conversion (i.e., utf8ToSqueak would then insert the proper leading chars for each character). As a matter of fact, I think this is what Unicode class>>value: should do (instead of substituting the environmental leading char).

...

If you need to retain these extra information, sending the strings without going through UTF-8 conversion makes more sense.

Or provide it via additional attributes. I still think that language information would best be modeled by a text attribute - in which case we have a plain Unicode implementation for strings as well as the ability to provide the disambiguation in text where required.

Cheers, - Andreas

Philippe Marschall

7:59 p.m.

2008/9/28, Andreas Raab andreas.raab@gmx.de:

...

Yoshiki Ohshima wrote:

...
For that kind of web applications and servers that deals with stuff outside of Squeak, it doesn't serve a good purpose, because editting, displaying etc. are out of scope. Needless to say, the original idea was to make Squeak to be the dynamic, interactive, multilingualized, environment so there is mismatch. Web applications etc. historically comes after the goal.

Which wouldn't be a problem if the code was able to handle the data properly. Unfortunately, the effects of an "invalid" leading char are very, very strange (everything from crashing the scanner to raising weird errors in comparisons, character access etc). As it stands, an application that uses non-Latin characters off the web is best off by keeping everything in UTF-8.

BTW, one way to deal with this properly is by providing a leading char upon input conversion (i.e., utf8ToSqueak would then insert the proper leading chars for each character). As a matter of fact, I think this is what Unicode class>>value: should do (instead of substituting the environmental leading char).

...
If you need to retain these extra information, sending the strings without going through UTF-8 conversion makes more sense.

Or provide it via additional attributes. I still think that language information would best be modeled by a text attribute - in which case we have a plain Unicode implementation for strings as well as the ability to provide the disambiguation in text where required.

Cheers,

Andreas

Philippe Marschall

7:59 p.m.

2008/9/28, Andreas Raab andreas.raab@gmx.de:

...

Yoshiki Ohshima wrote:

...
For that kind of web applications and servers that deals with stuff outside of Squeak, it doesn't serve a good purpose, because editting, displaying etc. are out of scope. Needless to say, the original idea was to make Squeak to be the dynamic, interactive, multilingualized, environment so there is mismatch. Web applications etc. historically comes after the goal.

Which wouldn't be a problem if the code was able to handle the data properly. Unfortunately, the effects of an "invalid" leading char are very, very strange (everything from crashing the scanner to raising weird errors in comparisons, character access etc). As it stands, an application that uses non-Latin characters off the web is best off by keeping everything in UTF-8.

BTW, one way to deal with this properly is by providing a leading char upon input conversion (i.e., utf8ToSqueak would then insert the proper leading chars for each character). As a matter of fact, I think this is what Unicode class>>value: should do (instead of substituting the environmental leading char).

...
If you need to retain these extra information, sending the strings without going through UTF-8 conversion makes more sense.

Or provide it via additional attributes. I still think that language information would best be modeled by a text attribute - in which case we have a plain Unicode implementation for strings as well as the ability to provide the disambiguation in text where required.

Cheers Philippe

Yoshiki Ohshima

9:53 p.m.

At Sun, 28 Sep 2008 10:45:00 -0700, Andreas Raab wrote:

...

...
If you need to retain these extra information, sending the strings without going through UTF-8 conversion makes more sense.

Or provide it via additional attributes. I still think that language information would best be modeled by a text attribute - in which case we have a plain Unicode implementation for strings as well as the ability to provide the disambiguation in text where required.

Well, sure, for the more work and more clearner approach. That is what I've been mentioning time to time. The consequence would be that a bare character object or string object won't show up in the proper way; but it is not a big problem.

-- Yoshiki

Damien Pollet

11:49 a.m.

On Sat, Sep 27, 2008 at 7:05 PM, Philippe Marschall philippe.marschall@gmail.com wrote:

...

Plus leading char.

You mean the BOM (byte order mark) or something else ?

-- Damien Pollet type less, do more [ | ] http://people.untyped.org/damien.pollet

Philippe Marschall

11:58 a.m.

2008/9/28 Damien Pollet damien.pollet@gmail.com:

...

On Sat, Sep 27, 2008 at 7:05 PM, Philippe Marschall philippe.marschall@gmail.com wrote:

...
Plus leading char.

You mean the BOM (byte order mark) or something else ?

No, I mean the language of the image encoded into every single character with an index bigger than 255. Check the class comment of Character for more information.

Cheers Philippe

stephane ducasse

27 Sep 27 Sep

8:15 a.m.

...

...
There is no such thing as a "UTF-*" character. There are Unicode Characters, and Unicode Strings, and there are UTF-encoded string (UTF means Unicode Transformation Format).

Yes I was sloppy. Thanks for the answer

...

All characters in Squeak use Unicode now.

Do you mean that the characters are all encoded using codepoints values?

can you tell me what the "now" refers to? OLPC? 3.8? I wanted to chekc the changes made in OLPC and harvest them in Pharo. Now do you know if there are some tests somehwere?

...

For example, the cyrillic Б is

char := Character value: 16r0411.

this can be made into a String:

wideString := String with: char.

when I do char printString I block Squeak 3.9. :(

...

which of course has the same Unicode code points:

wideString asArray collect: [:each | each hex]

gives
#('16r411')

Here you are talking about codepoint How do I get the corresponding glyph? Using an encoding I imagine

...

The string can be encoded as UTF-8:

utf8String := wideString squeakToUtf8.

and to see the values there

utf8String asArray collect: [:each | each hex]

yields
#('16rD0' '16r91')
which is the UTF-8 representation of the character we began with (but if you try to pront utf8String directly you get nonsense, because Squeak does not know it is UTF-8 encoded).

...

The decoding of UTF-8 to a String is similar:

#(16rC3 16rBC) asByteArray asString utf8ToSqueak

which returns the String 'ü' and probably is what you wanted in the first place

Why do I get a visual representation? How the mapping is done from the unicode to the glyph. Should we always passed via a transformation? How the encodings schema (UTF-*) associates a code point to its glyph?

...

but please try to understand and use the Unicode terms correctly

to minimize confusion.

I learned that over last weeks, reading a lot of docs.

character sets ~= character encodings

...

Anyway, to convert between a String in UTF-8 and a regular Squeak String, it's simplest to use utf8ToSqueak and squeakToUtf8.

Now utf-8 was just an example. I would like to know what is a *ToSqueak? I can understand that characters are code points in Unicode system now how to get see their visual representation.

K. K. Subramaniam

1:44 p.m.

On Saturday 27 Sep 2008 11:45:38 am stephane ducasse wrote:

...

Why do I get a visual representation? How the mapping is done from the unicode to the glyph.

Unicode codepoints are processed by a shaping engine to generate a graphic. The term 'glyph' (carving in Greek) is historical since typefaces were carved from metal. The shaping engine is trivial in the case of Latin-1 character set. The first 256 code points are same as Extended ASCII and the graphic can be looked up in a font table. Rendering "hello" on the screen involves extracting the box dimensions and graphic of h, e, l, o from a font table, laying out five boxes and then rendering appropriately into the five boxes. Other languages have thousands of such graphics (pictals?) and the rendering algorithms are complex enough to require a shaping engine with pluggable rendering algorithms. google for Dr. Yannis Haralambous works for details.

...

Should we always passed via a transformation?

UTF-8 is recommended when passing Unicode strings across programs and machines for the sake of backward compatibility. Within a program, the choice of encoding depends on the string handling requirements. For instance, if a program deals with palindromes, then an encoding for "rés" like: <r> <grave> <e> <s> will break current algorithms that just reverse the string of codepoints.

...

How the encodings schema (UTF-*) associates a code point to its glyph?

The Unicode sequence "hello world" transformed into UTF-8 is same as its Extended ASCII encoding. The process is more involved for Asian languages, so a separate shaping engine is required. Examples are Pango, Qt shaping engine, Uniscribe etc.

Regards .. Subbu

K. K. Subramaniam

23 Sep 23 Sep

6:01 p.m.

On Tuesday 23 Sep 2008 2:16:43 pm stephane ducasse wrote:

...

I would like to know how I can create an UTF-* character composed for example of two bytes

16rC3 and 16rBC

I tried

WideString fromByteArray: { 16rC3 . 16rBC }

alphaBeta := WideString from: #(945 946).

gives me a Squeak wide string containing Greek alpha and beta. The numbers are from Unicode BMP for Greek.

alphabeta squeakToUtf8 asByteArray

yields the UTF-8 sequence #(206 177 206 178)

and #(206 177 206 178) asString utf8ToSqueak

gives me back the original string.

Of course, you should turn on "usePangoRenderer" preference to see characters rendered correctly for characters other than Latin-1.

HTH .. Subbu

5710

Age (days ago)

5717

Last active (days ago)

squeak-dev@lists.squeakfoundation.org

29 comments

9 participants

tags (0)

participants (9)

Andreas Raab
Bert Freudenberg
Colin Putney
Damien Pollet
K. K. Subramaniam
Norbert Hartl
Philippe Marschall
stephane ducasse
Yoshiki Ohshima