Hi all
I would like to know how I can create an UTF-* character composed for example of two bytes
16rC3 and 16rBC
I tried
WideString fromByteArray: { 16rC3 . 16rBC }
Stef
On Tue, 2008-09-23 at 10:46 +0200, stephane ducasse wrote:
Hi all
I would like to know how I can create an UTF-* character composed for example of two bytes
16rC3 and 16rBC
I tried
WideString fromByteArray: { 16rC3 . 16rBC }
Stef
Hmm, I'm not sure what you mean by UTF-* Character but this way it works
( ( String fromByteArray: ( ByteArray with: 16rC3 with: 16rBC ) ) convertFromEncoding: #utf8 ) at: 1
And it is not a two-byte character because it is a character that is contained in latin-1.
I thought there would be an easier/better way to do! Bert? :)
Norbert
Am 23.09.2008 um 01:46 schrieb stephane ducasse:
Hi all
I would like to know how I can create an UTF-* character composed for example of two bytes
16rC3 and 16rBC
I tried
WideString fromByteArray: { 16rC3 . 16rBC }
Stef
There is no such thing as a "UTF-*" character. There are Unicode Characters, and Unicode Strings, and there are UTF-encoded string (UTF means Unicode Transformation Format).
All characters in Squeak use Unicode now. For example, the cyrillic Б is
char := Character value: 16r0411.
this can be made into a String:
wideString := String with: char.
which of course has the same Unicode code points:
wideString asArray collect: [:each | each hex]
gives
#('16r411')
The string can be encoded as UTF-8:
utf8String := wideString squeakToUtf8.
and to see the values there
utf8String asArray collect: [:each | each hex]
yields
#('16rD0' '16r91')
which is the UTF-8 representation of the character we began with (but if you try to pront utf8String directly you get nonsense, because Squeak does not know it is UTF-8 encoded).
The decoding of UTF-8 to a String is similar:
#(16rC3 16rBC) asByteArray asString utf8ToSqueak
which returns the String 'ü' and probably is what you wanted in the first place - but please try to understand and use the Unicode terms correctly to minimize confusion.
Anyway, to convert between a String in UTF-8 and a regular Squeak String, it's simplest to use utf8ToSqueak and squeakToUtf8.
- Bert -
On Tue, 2008-09-23 at 06:48 -0700, Bert Freudenberg wrote:
Am 23.09.2008 um 01:46 schrieb stephane ducasse:
Hi all
I would like to know how I can create an UTF-* character composed for example of two bytes
16rC3 and 16rBC
I tried
WideString fromByteArray: { 16rC3 . 16rBC }
Stef
There is no such thing as a "UTF-*" character. There are Unicode Characters, and Unicode Strings, and there are UTF-encoded string (UTF means Unicode Transformation Format).
All characters in Squeak use Unicode now. For example, the cyrillic Б is
char := Character value: 16r0411.
this can be made into a String:
wideString := String with: char.
which of course has the same Unicode code points:
wideString asArray collect: [:each | each hex]
gives
#('16r411')
The string can be encoded as UTF-8:
utf8String := wideString squeakToUtf8.
and to see the values there
utf8String asArray collect: [:each | each hex]
yields
#('16rD0' '16r91')
which is the UTF-8 representation of the character we began with (but if you try to pront utf8String directly you get nonsense, because Squeak does not know it is UTF-8 encoded).
The decoding of UTF-8 to a String is similar:
#(16rC3 16rBC) asByteArray asString utf8ToSqueak
Hmmm, I knew it :) That is the same I did just readable and in one line (and more of this "strange method stuff"[tm]).
which returns the String 'ü' and probably is what you wanted in the first place - but please try to understand and use the Unicode terms correctly to minimize confusion.
Anyway, to convert between a String in UTF-8 and a regular Squeak String, it's simplest to use utf8ToSqueak and squeakToUtf8.
- Bert -
Norbert
P.S.: My only hope is that with my knowledge getting bigger and pharo's getting smaller that we meet somewhere in between!!!
2008/9/23 Bert Freudenberg bert@freudenbergs.de:
Am 23.09.2008 um 01:46 schrieb stephane ducasse:
Hi all
I would like to know how I can create an UTF-* character composed for example of two bytes
16rC3 and 16rBC
I tried
WideString fromByteArray: { 16rC3 . 16rBC }
Stef
There is no such thing as a "UTF-*" character. There are Unicode Characters, and Unicode Strings, and there are UTF-encoded string (UTF means Unicode Transformation Format).
All characters in Squeak use Unicode now. For example, the cyrillic Б is
char := Character value: 16r0411.
this can be made into a String:
wideString := String with: char.
which of course has the same Unicode code points:
wideString asArray collect: [:each | each hex]
gives
#('16r411')
The string can be encoded as UTF-8:
utf8String := wideString squeakToUtf8.
and to see the values there
utf8String asArray collect: [:each | each hex]
yields
#('16rD0' '16r91')
which is the UTF-8 representation of the character we began with (but if you try to pront utf8String directly you get nonsense, because Squeak does not know it is UTF-8 encoded).
The decoding of UTF-8 to a String is similar:
#(16rC3 16rBC) asByteArray asString utf8ToSqueak
which returns the String 'ü' and probably is what you wanted in the first place - but please try to understand and use the Unicode terms correctly to minimize confusion.
Anyway, to convert between a String in UTF-8 and a regular Squeak String, it's simplest to use utf8ToSqueak and squeakToUtf8.
Am I the only one using the generic en/decoding functionality in Squeak in the form of #convertTo/FromEncoding?
Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'
Convert from UTF-8 to "Squeak" aString converFromEncoding: 'utf-8'
For checking out all the encodings your image supports: TextConverter allEncodingNames
Cheers Philippe
Is there a reason (other than history) why Strings are not collections of unicode characters (at least as viewed from outside) rather than bytes in some unknown encoding (which should be encapsulated and only appear when text goes in and out the image) ? Or is it already like that ?
On Tue, Sep 23, 2008 at 7:49 PM, Philippe Marschall philippe.marschall@gmail.com wrote:
2008/9/23 Bert Freudenberg bert@freudenbergs.de:
Am 23.09.2008 um 01:46 schrieb stephane ducasse:
Hi all
I would like to know how I can create an UTF-* character composed for example of two bytes
16rC3 and 16rBC
I tried
WideString fromByteArray: { 16rC3 . 16rBC }
Stef
There is no such thing as a "UTF-*" character. There are Unicode Characters, and Unicode Strings, and there are UTF-encoded string (UTF means Unicode Transformation Format).
All characters in Squeak use Unicode now. For example, the cyrillic Б is
char := Character value: 16r0411.
this can be made into a String:
wideString := String with: char.
which of course has the same Unicode code points:
wideString asArray collect: [:each | each hex]
gives
#('16r411')
The string can be encoded as UTF-8:
utf8String := wideString squeakToUtf8.
and to see the values there
utf8String asArray collect: [:each | each hex]
yields
#('16rD0' '16r91')
which is the UTF-8 representation of the character we began with (but if you try to pront utf8String directly you get nonsense, because Squeak does not know it is UTF-8 encoded).
The decoding of UTF-8 to a String is similar:
#(16rC3 16rBC) asByteArray asString utf8ToSqueak
which returns the String 'ü' and probably is what you wanted in the first place - but please try to understand and use the Unicode terms correctly to minimize confusion.
Anyway, to convert between a String in UTF-8 and a regular Squeak String, it's simplest to use utf8ToSqueak and squeakToUtf8.
Am I the only one using the generic en/decoding functionality in Squeak in the form of #convertTo/FromEncoding?
Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'
Convert from UTF-8 to "Squeak" aString converFromEncoding: 'utf-8'
For checking out all the encodings your image supports: TextConverter allEncodingNames
Cheers Philippe
At Wed, 24 Sep 2008 10:49:18 +0200, Damien Pollet wrote:
Is there a reason (other than history) why Strings are not collections of unicode characters (at least as viewed from outside) rather than bytes in some unknown encoding (which should be encapsulated and only appear when text goes in and out the image) ? Or is it already like that ?
I think the answer is that it is already *like that*, although I can't tell what you mean by "from outside".
In the image, a ByteString or WideString is a sequence of characters that hold Unicode code points. (Note that a Unicode code point is 21-bit.) if all the code in a string fits within 8-bit, we use ByteString. if it doesn't it uses WideString, but the distinction is more or less hidden from a casual user. The conversion is only needed when the String is interfacing with the outside of the image.
A Unicode code point doesn't really corresponds to the concept of a character, if you think an accented character a "character". The original concept of Unicode was that such "character" should be always represented as the sequence of code points; one base character, and one or more accent marks. It was at least pure and fair.
But, they got the "Latin-1 compatibility" idea around 1990 in a retrofitted way; so the original idea of "Let us make a universal character set for everybody in the world" was turned to: "Let us make a universal character set for everybody in the world, but let's treat Westerners nicer." But of course this turn made the situation where a simple accented character has two (precomposed and decomposed) representations. Squeak is still way behind and prefers the precomposed "normalization", but the normalization is really lax there.
To me, the han unification is another evidence of "Westerners first" idea. If tracing back to the origin of characters is the concept, i and j should be perhaps unified as well (just kidding).
But, Unicode is the standard now, and it does solve a lot of problems. So using it as the base but putting necessary information around it to support it is a good way in principle.
If so, one could argue that we can just hold every string in decomposed UTF-8 in the image, and have a couple of variants of at: and at:put:. The requirement of O(1) random access is not that important. I might go that direction if I redo it now.
-- Yoshiki
On 24-Sep-08, at 2:26 AM, Yoshiki Ohshima wrote:
At Wed, 24 Sep 2008 10:49:18 +0200, Damien Pollet wrote:
Is there a reason (other than history) why Strings are not collections of unicode characters (at least as viewed from outside) rather than bytes in some unknown encoding (which should be encapsulated and only appear when text goes in and out the image) ? Or is it already like that ?
I think the answer is that it is already *like that*, although I can't tell what you mean by "from outside".
I think Damien's confusion comes from the fact that the abstractions are a bit leaky. For example, if you do something like this:
'ábc' convertToEncoding: 'utf-8'
the result is 'ábc'. It's a string where the internal, "encapsulated" state is such that writing it to a socket or file will produce the desired bytes, but all in-image behavior is totally broken.
VisualWorks tends to do a better job of maintaining the abstractions, I think. The equivalent of the above example would product a ByteArray.
If so, one could argue that we can just hold every string in decomposed UTF-8 in the image, and have a couple of variants of at: and at:put:. The requirement of O(1) random access is not that important. I might go that direction if I redo it now.
A UTF8String would be really handy for web applications, where strings come in from the net as UTF-8, live in the image for a while, then get sent out as UTF-8. O(1) random access isn't very useful, because strings are (mostly) uninterpreted, but converting to Squeak's internal representation is expensive.
The thing is, as long as the "sequence of characters" abstraction is maintained, it doesn't matter (for purposes of correct behavior) what the internal representation is. So it's perfectly reasonable to have multiple encodings with different performance profiles. UTF8String when you need it, WideString when that makes sense.
Colin
At Wed, 24 Sep 2008 07:45:38 -0700, Colin Putney wrote:
A UTF8String would be really handy for web applications, where strings come in from the net as UTF-8, live in the image for a while, then get sent out as UTF-8. O(1) random access isn't very useful, because strings are (mostly) uninterpreted, but converting to Squeak's internal representation is expensive.
The thing is, as long as the "sequence of characters" abstraction is maintained, it doesn't matter (for purposes of correct behavior) what the internal representation is. So it's perfectly reasonable to have multiple encodings with different performance profiles. UTF8String when you need it, WideString when that makes sense.
The thing is though, that even from the net UTF-8 is not as dominant as like that. There are bunch of other encoding used.
And, have UTF8String and WideString causes the comparison etc. more complicated than it should. Have a single internal representation is cleaner.
Have the encoded data in ByteArray is sensible thing to do. That would have been much bigger redesign of Squeak, though.
-- Yoshiki
On Wednesday 24 Sep 2008 2:56:43 pm Yoshiki Ohshima wrote:
In the image, a ByteString or WideString is a sequence of characters that hold Unicode code points. (Note that a Unicode code point is 21-bit.) if all the code in a string fits within 8-bit, we use ByteString. if it doesn't it uses WideString
You mean a sequence of code points? Instances of Character hold only one code point (value), while some characters need more than one code point (e.g. ksha in Devanagari needs three).
Subbu
At Wed, 24 Sep 2008 20:38:18 +0530, K. K. Subramaniam wrote:
On Wednesday 24 Sep 2008 2:56:43 pm Yoshiki Ohshima wrote:
In the image, a ByteString or WideString is a sequence of characters that hold Unicode code points. (Note that a Unicode code point is 21-bit.) if all the code in a string fits within 8-bit, we use ByteString. if it doesn't it uses WideString
You mean a sequence of code points? Instances of Character hold only one code point (value), while some characters need more than one code point (e.g. ksha in Devanagari needs three).
Yes, a sequence of code points, as rephrased below the email.
-- Yoshiki
Am I the only one using the generic en/decoding functionality in Squeak in the form of #convertTo/FromEncoding?
Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'
do I understand correctly that such a aString is a sequence of unicode codepoints?
Convert from UTF-8 to "Squeak" aString converFromEncoding: 'utf-8'
For checking out all the encodings your image supports: TextConverter allEncodingNames
Cheers Philippe
On Sat, 2008-09-27 at 08:18 +0200, stephane ducasse wrote:
Am I the only one using the generic en/decoding functionality in Squeak in the form of #convertTo/FromEncoding?
Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'
do I understand correctly that such a aString is a sequence of unicode codepoints?
At first the utf-8 is a sequence of bytes. These bytes are a space optimzed encoding of a code point (utf-8). If you decode those bytes you get your code point (unicode). From a sequence of code points you can derive a character. In most cases (for us westerners) it will be a single code point AFAIK.
Norbert
Convert from UTF-8 to "Squeak" aString converFromEncoding: 'utf-8'
For checking out all the encodings your image supports: TextConverter allEncodingNames
Cheers Philippe
Am I the only one using the generic en/decoding functionality in
Squeak in the form of #convertTo/FromEncoding?
Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'
do I understand correctly that such a aString is a sequence of unicode codepoints?
At first the utf-8 is a sequence of bytes. These bytes are a space optimzed encoding of a code point (utf-8). If you decode those bytes you get your code point (unicode). From a sequence of code points you can derive a character. In most cases (for us westerners) it will be a single code point AFAIK.
I'm trying to really understand in Squeak. :) What we call character is what then? Is it a codepoint? or the looked up glyph in a font table?
Stef
On Mon, 2008-09-29 at 18:53 +0200, stephane ducasse wrote:
Am I the only one using the generic en/decoding functionality in
Squeak in the form of #convertTo/FromEncoding?
Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'
do I understand correctly that such a aString is a sequence of unicode codepoints?
At first the utf-8 is a sequence of bytes. These bytes are a space optimzed encoding of a code point (utf-8). If you decode those bytes you get your code point (unicode). From a sequence of code points you can derive a character. In most cases (for us westerners) it will be a single code point AFAIK.
I'm trying to really understand in Squeak. :) What we call character is what then? Is it a codepoint? or the looked up glyph in a font table?
I don't know. I've never dealt with how squeak does those things
Norbert
Am 29.09.2008 um 11:11 schrieb Norbert Hartl:
On Mon, 2008-09-29 at 18:53 +0200, stephane ducasse wrote:
Am I the only one using the generic en/decoding functionality in
Squeak in the form of #convertTo/FromEncoding?
Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'
do I understand correctly that such a aString is a sequence of unicode codepoints?
At first the utf-8 is a sequence of bytes. These bytes are a space optimzed encoding of a code point (utf-8). If you decode those bytes you get your code point (unicode). From a sequence of code points you can derive a character. In most cases (for us westerners) it will be a single code point AFAIK.
I'm trying to really understand in Squeak. :) What we call character is what then? Is it a codepoint? or the looked up glyph in a font table?
I don't know. I've never dealt with how squeak does those things
A character represents a single code point. A font maps code points to glyphs.
A character also encodes a language-tag (a.k.a. leading char) but we all seem to agree that's a bad idea, it was done to allow easier migration of old code (for many eastern languages a code point and a font is not enough for rendering, you also need to know the language).
- Bert -
Bert Freudenberg wrote:
A character also encodes a language-tag (a.k.a. leading char) but we all seem to agree that's a bad idea, it was done to allow easier migration of old code (for many eastern languages a code point and a font is not enough for rendering, you also need to know the language).
I wouldn't necessarily call it a bad idea. It is incomplete, for sure, but it is one of the ways one can deal with this problem. Even though I prefer having language information in text attributes the language tag per se wouldn't cause problems if the code would be able to deal with its absence. E.g., if one could use strings with "just unicode" I wouldn't mind having the ability to add the language tag for disambiguation where necessary (issues of equality etc. notwithstanding which is why I think using text attributes is the better way to go).
The problem is that too much code relies on both the presence as well as particular values for certain code points and simply breaks if it isn't filled in "correctly". As such the language tag seems to be mostly redundant with certain code points. I guess one way to get over this is to add a preference that leaves out the language tag and just try running that way to see what and where it breaks.
Cheers, - Andreas
At Mon, 29 Sep 2008 11:24:36 -0700, Bert Freudenberg wrote:
I'm trying to really understand in Squeak. :) What we call character is what then? Is it a codepoint? or the looked up glyph in a font table?
I don't know. I've never dealt with how squeak does those things
A character represents a single code point.
This I would like to be philosophically false, but Unicode decided that is the way it is. We use Unicode for part of the representation, but we can have different philosophy there.
A font maps code points to glyphs.
And the trouble is that "a font" cannot really map to glyphs to what the users want and we need additional information.
IOW, if we follow the philosophy of "a character is a code point and a font maps to glyph", we should not be able to print-it "a codepoint" in a workspace. I am not sure that the Squeak community would like to go all the way like that.
-- Yoshiki
2008/9/27 stephane ducasse stephane.ducasse@free.fr:
Am I the only one using the generic en/decoding functionality in Squeak in the form of #convertTo/FromEncoding?
Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8'
do I understand correctly that such a aString is a sequence of unicode codepoints?
Plus leading char. If you look at UTF8TextConverter it will give every incoming character with an index higher than 255 the language of the image. I don't need to explain why this is problematic in the context of a web application, do I?
Cheers Philippe
Philippe Marschall wrote:
2008/9/27 stephane ducasse stephane.ducasse@free.fr:
do I understand correctly that such a aString is a sequence of unicode codepoints?
Plus leading char. If you look at UTF8TextConverter it will give every incoming character with an index higher than 255 the language of the image. I don't need to explain why this is problematic in the context of a web application, do I?
Actually, it *is* worthwhile to explain this. The problem is that since UTF-8 doesn't have the notion of a leading char there is no way to tag incoming data correctly. The leading char will be taken from the running image, so an image running in the US (like our servers) will tag input coming from Chinese browsers as Latin1. In these situations the leading char isn't just useless, it is actively misleading.
Cheers, - Andreas
At Sat, 27 Sep 2008 10:14:39 -0700, Andreas Raab wrote:
Philippe Marschall wrote:
2008/9/27 stephane ducasse stephane.ducasse@free.fr:
do I understand correctly that such a aString is a sequence of unicode codepoints?
Plus leading char. If you look at UTF8TextConverter it will give every incoming character with an index higher than 255 the language of the image. I don't need to explain why this is problematic in the context of a web application, do I?
Actually, it *is* worthwhile to explain this. The problem is that since UTF-8 doesn't have the notion of a leading char there is no way to tag incoming data correctly. The leading char will be taken from the running image, so an image running in the US (like our servers) will tag input coming from Chinese browsers as Latin1. In these situations the leading char isn't just useless, it is actively misleading.
For that kind of web applications and servers that deals with stuff outside of Squeak, it doesn't serve a good purpose, because editting, displaying etc. are out of scope. Needless to say, the original idea was to make Squeak to be the dynamic, interactive, multilingualized, environment so there is mismatch. Web applications etc. historically comes after the goal.
If you need to retain these extra information, sending the strings without going through UTF-8 conversion makes more sense.
-- Yoshiki
Yoshiki Ohshima wrote:
For that kind of web applications and servers that deals with stuff outside of Squeak, it doesn't serve a good purpose, because editting, displaying etc. are out of scope. Needless to say, the original idea was to make Squeak to be the dynamic, interactive, multilingualized, environment so there is mismatch. Web applications etc. historically comes after the goal.
Which wouldn't be a problem if the code was able to handle the data properly. Unfortunately, the effects of an "invalid" leading char are very, very strange (everything from crashing the scanner to raising weird errors in comparisons, character access etc). As it stands, an application that uses non-Latin characters off the web is best off by keeping everything in UTF-8.
BTW, one way to deal with this properly is by providing a leading char upon input conversion (i.e., utf8ToSqueak would then insert the proper leading chars for each character). As a matter of fact, I think this is what Unicode class>>value: should do (instead of substituting the environmental leading char).
If you need to retain these extra information, sending the strings without going through UTF-8 conversion makes more sense.
Or provide it via additional attributes. I still think that language information would best be modeled by a text attribute - in which case we have a plain Unicode implementation for strings as well as the ability to provide the disambiguation in text where required.
Cheers, - Andreas
2008/9/28, Andreas Raab andreas.raab@gmx.de:
Yoshiki Ohshima wrote:
For that kind of web applications and servers that deals with stuff outside of Squeak, it doesn't serve a good purpose, because editting, displaying etc. are out of scope. Needless to say, the original idea was to make Squeak to be the dynamic, interactive, multilingualized, environment so there is mismatch. Web applications etc. historically comes after the goal.
Which wouldn't be a problem if the code was able to handle the data properly. Unfortunately, the effects of an "invalid" leading char are very, very strange (everything from crashing the scanner to raising weird errors in comparisons, character access etc). As it stands, an application that uses non-Latin characters off the web is best off by keeping everything in UTF-8.
BTW, one way to deal with this properly is by providing a leading char upon input conversion (i.e., utf8ToSqueak would then insert the proper leading chars for each character). As a matter of fact, I think this is what Unicode class>>value: should do (instead of substituting the environmental leading char).
If you need to retain these extra information, sending the strings without going through UTF-8 conversion makes more sense.
Or provide it via additional attributes. I still think that language information would best be modeled by a text attribute - in which case we have a plain Unicode implementation for strings as well as the ability to provide the disambiguation in text where required.
Cheers,
- Andreas
2008/9/28, Andreas Raab andreas.raab@gmx.de:
Yoshiki Ohshima wrote:
For that kind of web applications and servers that deals with stuff outside of Squeak, it doesn't serve a good purpose, because editting, displaying etc. are out of scope. Needless to say, the original idea was to make Squeak to be the dynamic, interactive, multilingualized, environment so there is mismatch. Web applications etc. historically comes after the goal.
Which wouldn't be a problem if the code was able to handle the data properly. Unfortunately, the effects of an "invalid" leading char are very, very strange (everything from crashing the scanner to raising weird errors in comparisons, character access etc). As it stands, an application that uses non-Latin characters off the web is best off by keeping everything in UTF-8.
BTW, one way to deal with this properly is by providing a leading char upon input conversion (i.e., utf8ToSqueak would then insert the proper leading chars for each character). As a matter of fact, I think this is what Unicode class>>value: should do (instead of substituting the environmental leading char).
If you need to retain these extra information, sending the strings without going through UTF-8 conversion makes more sense.
Or provide it via additional attributes. I still think that language information would best be modeled by a text attribute - in which case we have a plain Unicode implementation for strings as well as the ability to provide the disambiguation in text where required.
+1
Cheers Philippe
At Sun, 28 Sep 2008 10:45:00 -0700, Andreas Raab wrote:
If you need to retain these extra information, sending the strings without going through UTF-8 conversion makes more sense.
Or provide it via additional attributes. I still think that language information would best be modeled by a text attribute - in which case we have a plain Unicode implementation for strings as well as the ability to provide the disambiguation in text where required.
Well, sure, for the more work and more clearner approach. That is what I've been mentioning time to time. The consequence would be that a bare character object or string object won't show up in the proper way; but it is not a big problem.
-- Yoshiki
On Sat, Sep 27, 2008 at 7:05 PM, Philippe Marschall philippe.marschall@gmail.com wrote:
Plus leading char.
You mean the BOM (byte order mark) or something else ?
2008/9/28 Damien Pollet damien.pollet@gmail.com:
On Sat, Sep 27, 2008 at 7:05 PM, Philippe Marschall philippe.marschall@gmail.com wrote:
Plus leading char.
You mean the BOM (byte order mark) or something else ?
No, I mean the language of the image encoded into every single character with an index bigger than 255. Check the class comment of Character for more information.
Cheers Philippe
There is no such thing as a "UTF-*" character. There are Unicode Characters, and Unicode Strings, and there are UTF-encoded string (UTF means Unicode Transformation Format).
Yes I was sloppy. Thanks for the answer
All characters in Squeak use Unicode now.
Do you mean that the characters are all encoded using codepoints values?
can you tell me what the "now" refers to? OLPC? 3.8? I wanted to chekc the changes made in OLPC and harvest them in Pharo. Now do you know if there are some tests somehwere?
For example, the cyrillic Б is
char := Character value: 16r0411.
this can be made into a String:
wideString := String with: char.
when I do char printString I block Squeak 3.9. :(
which of course has the same Unicode code points:
wideString asArray collect: [:each | each hex]
gives
#('16r411')
Here you are talking about codepoint How do I get the corresponding glyph? Using an encoding I imagine
The string can be encoded as UTF-8:
utf8String := wideString squeakToUtf8.
and to see the values there
utf8String asArray collect: [:each | each hex]
yields
#('16rD0' '16r91')
which is the UTF-8 representation of the character we began with (but if you try to pront utf8String directly you get nonsense, because Squeak does not know it is UTF-8 encoded).
ok
The decoding of UTF-8 to a String is similar:
#(16rC3 16rBC) asByteArray asString utf8ToSqueak
which returns the String 'ü' and probably is what you wanted in the first place
Why do I get a visual representation? How the mapping is done from the unicode to the glyph. Should we always passed via a transformation? How the encodings schema (UTF-*) associates a code point to its glyph?
- but please try to understand and use the Unicode terms correctly
to minimize confusion.
I learned that over last weeks, reading a lot of docs.
character sets ~= character encodings
Anyway, to convert between a String in UTF-8 and a regular Squeak String, it's simplest to use utf8ToSqueak and squeakToUtf8.
Now utf-8 was just an example. I would like to know what is a *ToSqueak? I can understand that characters are code points in Unicode system now how to get see their visual representation.
On Saturday 27 Sep 2008 11:45:38 am stephane ducasse wrote:
Why do I get a visual representation? How the mapping is done from the unicode to the glyph.
Unicode codepoints are processed by a shaping engine to generate a graphic. The term 'glyph' (carving in Greek) is historical since typefaces were carved from metal. The shaping engine is trivial in the case of Latin-1 character set. The first 256 code points are same as Extended ASCII and the graphic can be looked up in a font table. Rendering "hello" on the screen involves extracting the box dimensions and graphic of h, e, l, o from a font table, laying out five boxes and then rendering appropriately into the five boxes. Other languages have thousands of such graphics (pictals?) and the rendering algorithms are complex enough to require a shaping engine with pluggable rendering algorithms. google for Dr. Yannis Haralambous works for details.
Should we always passed via a transformation?
UTF-8 is recommended when passing Unicode strings across programs and machines for the sake of backward compatibility. Within a program, the choice of encoding depends on the string handling requirements. For instance, if a program deals with palindromes, then an encoding for "rés" like: <r> <grave> <e> <s> will break current algorithms that just reverse the string of codepoints.
How the encodings schema (UTF-*) associates a code point to its glyph?
The Unicode sequence "hello world" transformed into UTF-8 is same as its Extended ASCII encoding. The process is more involved for Asian languages, so a separate shaping engine is required. Examples are Pango, Qt shaping engine, Uniscribe etc.
Regards .. Subbu
On Tuesday 23 Sep 2008 2:16:43 pm stephane ducasse wrote:
I would like to know how I can create an UTF-* character composed for example of two bytes
16rC3 and 16rBC
I tried
WideString fromByteArray: { 16rC3 . 16rBC }
alphaBeta := WideString from: #(945 946).
gives me a Squeak wide string containing Greek alpha and beta. The numbers are from Unicode BMP for Greek.
alphabeta squeakToUtf8 asByteArray
yields the UTF-8 sequence #(206 177 206 178)
and #(206 177 206 178) asString utf8ToSqueak
gives me back the original string.
Of course, you should turn on "usePangoRenderer" preference to see characters rendered correctly for characters other than Latin-1.
HTH .. Subbu
squeak-dev@lists.squeakfoundation.org