Hi ! I am in search of up to date links (or tips) on how to work with UTF-8 (instead of the default latin1) inside Squeak. I thank you in advance for your help. Pierre-Edouard
2009/3/27 Pierre-Edouard PORTIER pierre-edouard.portier@insa-lyon.fr:
Hi ! I am in search of up to date links (or tips) on how to work with UTF-8 (instead of the default latin1) inside Squeak. I thank you in advance for your help. Pierre-Edouard
aString convertToEncoding: 'utf-8' aString convertFromEncoding: 'utf-8'
Cheers Philippe
Thank you Philippe,
I was aware of : aString squeakToUtf8 aString utf8ToSqueak
But I would like to be able to *see* utf-8 characters inside the squeak environment.
Cheers Pierre-Edouard
On Sat, Mar 28, 2009 at 5:20 AM, Philippe Marschall < philippe.marschall@gmail.com> wrote:
2009/3/27 Pierre-Edouard PORTIER pierre-edouard.portier@insa-lyon.fr:
Hi ! I am in search of up to date links (or tips) on how to work with UTF-8 (instead of the default latin1) inside Squeak. I thank you in advance for your help. Pierre-Edouard
aString convertToEncoding: 'utf-8' aString convertFromEncoding: 'utf-8'
Cheers Philippe
2009/3/28 Pierre-Edouard PORTIER pierre.edouard.portier@gmail.com:
Thank you Philippe,
I was aware of : aString squeakToUtf8 aString utf8ToSqueak
But I would like to be able to *see* utf-8 characters inside the squeak environment.
What do you mean with that? What do you understand as an utf-8 character?
Cheers Philippe
On Sat, Mar 28, 2009 at 11:55 AM, Pierre-Edouard PORTIER pierre.edouard.portier@gmail.com wrote:
But I would like to be able to *see* utf-8 characters inside the squeak environment.
Are you sure you are not confusing "utf-8" with "unicode"? utf-8 is just one way of encoding unicode (characters). You can import utf-8 encoded characters/strings, but once inside Squeak they are kept as unicode characters.
Michael
2009/3/29 Michael Rueger m.rueger@acm.org:
On Sat, Mar 28, 2009 at 11:55 AM, Pierre-Edouard PORTIER pierre.edouard.portier@gmail.com wrote:
But I would like to be able to *see* utf-8 characters inside the squeak environment.
Are you sure you are not confusing "utf-8" with "unicode"? utf-8 is just one way of encoding unicode (characters). You can import utf-8 encoded characters/strings, but once inside Squeak they are kept as unicode characters.
Plus leadingChar, which causes a lot of problems for web applications.
Cheers Philippe
Philippe Marschall pravi:
Michael Rueger:
Pierre-Edouard PORTIER wrote:
But I would like to be able to *see* utf-8 characters inside the squeak environment.
Are you sure you are not confusing "utf-8" with "unicode"? utf-8 is just one way of encoding unicode (characters). You can import utf-8 encoded characters/strings, but once inside Squeak they are kept as unicode characters.
Plus leadingChar, which causes a lot of problems for web applications.
We don't have any problems with Squeak Unicode in Aida/Web apps, probably because we strictly use Unicode internally, not the UTF-8 encoded strings. All such strings are then encoded/decoded to the UTF-8 "at the edge" of image by Aida web framework.
Best regards Janko
2009/3/29 Janko Mivšek janko.mivsek@eranova.si:
Philippe Marschall pravi:
Michael Rueger:
Pierre-Edouard PORTIER wrote:
But I would like to be able to *see* utf-8 characters inside the squeak environment.
Are you sure you are not confusing "utf-8" with "unicode"? utf-8 is just one way of encoding unicode (characters). You can import utf-8 encoded characters/strings, but once inside Squeak they are kept as unicode characters.
Plus leadingChar, which causes a lot of problems for web applications.
We don't have any problems with Squeak Unicode in Aida/Web apps, probably because we strictly use Unicode internally, not the UTF-8 encoded strings. All such strings are then encoded/decoded to the UTF-8 "at the edge" of image by Aida web framework.
What leadingChar do you use? The one of the image?
Cheers Philippe
2009/3/29 Janko Mivšek janko.mivsek@eranova.si:
Philippe Marschall pravi:
Michael Rueger:
Pierre-Edouard PORTIER wrote:
But I would like to be able to *see* utf-8 characters inside the squeak environment.
Are you sure you are not confusing "utf-8" with "unicode"? utf-8 is just one way of encoding unicode (characters). You can import utf-8 encoded characters/strings, but once inside Squeak they are kept as unicode characters.
Plus leadingChar, which causes a lot of problems for web applications.
We don't have any problems with Squeak Unicode in Aida/Web apps, probably because we strictly use Unicode internally,
You can not do that. Squeak stores the language of a character in every character. In a web application you don't know the language of the input and utf-8 certainly doesn't contain it. You could take the language of the image but that is random and has no relation to the input. You could also set the language of a character to unicode (255) but that only works for non-Latin-1 characters, these are interned and all have leadingChar 0. Did I already mention that the leadingChar is used for #=? So no, I don't believe you.
Cheers Philippe
You can not do that. Squeak stores the language of a character in every character. In a web application you don't know the language of the input and utf-8 certainly doesn't contain it. You could take the language of the image but that is random and has no relation to the input. You could also set the language of a character to unicode (255) but that only works for non-Latin-1 characters, these are interned and all have leadingChar 0. Did I already mention that the leadingChar is used for #=? So no, I don't believe you.
Cheers Philippe
It seems most reasonnable to me to switch unicode leadingChar to 0. Why couldn't we just do that?
Of course, all this does not really answer Pierre Edouard questions... Pierre, what do you want unicode for? - displaying any arbitrary character inside squeak - inputing any character with keyboard in squeak - exchanging files made of arbitrary characters with external world (utf-8, utf-16 or other formats) - reading and writing filenames containing arbitrary characters - anything else?
Nicolas
Hi Nicolas !
Thank you for this nice synthesis. I want to: - display any arbitrary character inside Squeak (for example Greek characters) - input any character with keyboard inside Squeak - exchange utf-8 encoded data with external world
Pierre-Edouard
On Sun, Mar 29, 2009 at 2:42 PM, Nicolas Cellier < nicolas.cellier.aka.nice@gmail.com> wrote:
You can not do that. Squeak stores the language of a character in every character. In a web application you don't know the language of the input and utf-8 certainly doesn't contain it. You could take the language of the image but that is random and has no relation to the input. You could also set the language of a character to unicode (255) but that only works for non-Latin-1 characters, these are interned and all have leadingChar 0. Did I already mention that the leadingChar is used for #=? So no, I don't believe you.
Cheers Philippe
It seems most reasonnable to me to switch unicode leadingChar to 0. Why couldn't we just do that?
Of course, all this does not really answer Pierre Edouard questions... Pierre, what do you want unicode for?
- displaying any arbitrary character inside squeak
- inputing any character with keyboard in squeak
- exchanging files made of arbitrary characters with external world
(utf-8, utf-16 or other formats)
- reading and writing filenames containing arbitrary characters
- anything else?
Nicolas
Philippe Marschall pravi:
Janko Mivšek:
We don't have any problems with Squeak Unicode in Aida/Web apps, probably because we strictly use Unicode internally,
You can not do that. Squeak stores the language of a character in every character. In a web application you don't know the language of the input and utf-8 certainly doesn't contain it. You could take the language of the image but that is random and has no relation to the input. You could also set the language of a character to unicode (255) but that only works for non-Latin-1 characters, these are interned and all have leadingChar 0. Did I already mention that the leadingChar is used for #=? So no, I don't believe you.
Well, you should believe me, I have a proof!
Look at this Aida/Scribo multilingual demo served from Squeak image: http://demo.bioskop.fr/wiki/wiki.html, see specially Japanese and Russian text. Even Japanese urls are working correctly: http://demo.bioskop.fr/wiki/%E3%83%86%E3%82%B9%E3%83%88.html
About leading character, I even don't know what is that, except in theory. That is, I never encounter this character as a problem when porting Aida and its i8n support to Squeak.
Best regards Janko
2009/3/29 Janko Mivšek janko.mivsek@eranova.si:
Philippe Marschall pravi:
Janko Mivšek:
We don't have any problems with Squeak Unicode in Aida/Web apps, probably because we strictly use Unicode internally,
You can not do that. Squeak stores the language of a character in every character. In a web application you don't know the language of the input and utf-8 certainly doesn't contain it. You could take the language of the image but that is random and has no relation to the input. You could also set the language of a character to unicode (255) but that only works for non-Latin-1 characters, these are interned and all have leadingChar 0. Did I already mention that the leadingChar is used for #=? So no, I don't believe you.
Well, you should believe me, I have a proof!
Look at this Aida/Scribo multilingual demo served from Squeak image: http://demo.bioskop.fr/wiki/wiki.html, see specially Japanese and Russian text. Even Japanese urls are working correctly: http://demo.bioskop.fr/wiki/%E3%83%86%E3%82%B9%E3%83%88.html
That's just external representation, that tells absolutely nothing about internal representation and the implementation. I could easily the the same result on a Squeak 3.7.
About leading character, I even don't know what is that, except in theory. That is, I never encounter this character as a problem when porting Aida and its i8n support to Squeak.
How can you seriously say everything is working fine when in practice you can't say what is happening and don't know how Strings and Characters work in Squeak? I find that quite dubious hyping.
Cheers Philippe
Philippe Marschall pravi:
Look at this Aida/Scribo multilingual demo served from Squeak image: http://demo.bioskop.fr/wiki/wiki.html, see specially Japanese and Russian text. Even Japanese urls are working correctly: http://demo.bioskop.fr/wiki/%E3%83%86%E3%82%B9%E3%83%88.html
That's just external representation, that tells absolutely nothing about internal representation and the implementation. I could easily the the same result on a Squeak 3.7.
For this you need WideStrings and proper UTF-8 converter. Does Squeak 3.7 has that?
About leading character, I even don't know what is that, except in theory. That is, I never encounter this character as a problem when porting Aida and its i8n support to Squeak.
How can you seriously say everything is working fine when in practice you can't say what is happening and don't know how Strings and Characters work in Squeak? I find that quite dubious hyping.
Not hype at all but pure reality. And coming from country where we already need Unicode characters above 256, you can be sure that I know what I'm talking about. If there would be some problem, I would be the first encountering it. But there are no problems with Unicode strings prepared by Aida, so why should I bother? This is like a premature optimization for me.
Note also that Masashi Umezawa, a Japanese guy, made a preview and few modifications to Aida to work well with Japanese writing, in all aspects from Urls to the content. Because of his work I'm therefore even more sure that we did the Unicode support right!
Janko
Janko Mivšek janko.mivsek@eranova.si writes:
Hello Janko
I guess, Phillip talks about in-image japanese/arabic/whatever. This needs probably changes to the vm. Here on Mac OS X it doesnt work. I just get empty block-glyphs. Its not possible to copy non latin characters into the workspace. Linux-vms might handle this better.
ciao Enno
Note also that Masashi Umezawa, a Japanese guy, made a preview and few modifications to Aida to work well with Japanese writing, in all aspects from Urls to the content. Because of his work I'm therefore even more sure that we did the Unicode support right!
On 29.03.2009, at 16:32, Enrico Schwass wrote:
Janko Mivšek janko.mivsek@eranova.si writes:
Hello Janko
I guess, Phillip talks about in-image japanese/arabic/whatever. This needs probably changes to the vm. Here on Mac OS X it doesnt work.
The VMs can provide full unicode input now, but not all images have been adapted to make use of it. And that is completely separate from unicode font rendering support in the image.
I just get empty block-glyphs.
Your image needs to use the UTF-32 unicode character that recent VMs produce along with the old byte-sized character.
Check that "ActiveHand keyboardInterpreter" is in fact a UTF32InputInterpreter.
Its not possible to copy non latin characters into the workspace.
Your image needs to make use of the ClipboardExtendedPlugin which does ship in current Mac VMs.
- Bert -
2009/3/29 Bert Freudenberg bert@freudenbergs.de:
On 29.03.2009, at 16:32, Enrico Schwass wrote:
Janko Mivšek janko.mivsek@eranova.si writes:
Hello Janko
I guess, Phillip talks about in-image japanese/arabic/whatever. This needs probably changes to the vm. Here on Mac OS X it doesnt work.
The VMs can provide full unicode input now, but not all images have been adapted to make use of it. And that is completely separate from unicode font rendering support in the image.
I presume that a good Font or FontSet with unicode support should be in image for rendering correctly. Any link to a good Howto?
I just get empty block-glyphs.
Your image needs to use the UTF-32 unicode character that recent VMs produce along with the old byte-sized character.
Check that "ActiveHand keyboardInterpreter" is in fact a UTF32InputInterpreter.
For images which does not have UTF32InputInterpreter, let me remind Bert's and Yoshiki's job is pending at http://bugs.squeak.org/view.php?id=7071 ...
Its not possible to copy non latin characters into the workspace.
Your image needs to make use of the ClipboardExtendedPlugin which does ship in current Mac VMs.
- Bert -
Bert Freudenberg wrote:
Its not possible to copy non latin characters into the workspace.
Your image needs to make use of the ClipboardExtendedPlugin which does ship in current Mac VMs.
What's that? First time I've heard of it. It certainly doesn't ship in the Windows VM but there is really no need to - as long as the clipboard prims return utf-8, the image can do the conversions it needs.
Cheers, - Andreas
On 29.03.2009, at 20:57, Andreas Raab wrote:
Bert Freudenberg wrote:
Its not possible to copy non latin characters into the workspace.
Your image needs to make use of the ClipboardExtendedPlugin which does ship in current Mac VMs.
What's that? First time I've heard of it. It certainly doesn't ship in the Windows VM but there is really no need to - as long as the clipboard prims return utf-8, the image can do the conversions it needs.
Ah, maybe I mixed it up. It comes from the Sophie folks and allows to not only copy/paste plain text but also rich text, images, and other stuff. I only know it works on Mac and Linux, no idea about Windows.
- Bert -
On Sun, Mar 29, 2009 at 8:54 PM, Bert Freudenberg bert@freudenbergs.de wrote:
Ah, maybe I mixed it up. It comes from the Sophie folks and allows to not only copy/paste plain text but also rich text, images, and other stuff. I only know it works on Mac and Linux, no idea about Windows.
on Windows and Linux the VM clipboard primitives return UTF8, on the Mac you need the extended plugin for that. On the Mac you also need to convert the clipboard contents to pre-composed unicode, all part of the unicode work currently only available for Pharo, although the code should work in the mainstream Squeak as well.
Michael
2009/3/29 Janko Mivšek janko.mivsek@eranova.si:
Philippe Marschall pravi:
Look at this Aida/Scribo multilingual demo served from Squeak image: http://demo.bioskop.fr/wiki/wiki.html, see specially Japanese and Russian text. Even Japanese urls are working correctly: http://demo.bioskop.fr/wiki/%E3%83%86%E3%82%B9%E3%83%88.html
That's just external representation, that tells absolutely nothing about internal representation and the implementation. I could easily the the same result on a Squeak 3.7.
For this you need WideStrings and proper UTF-8 converter.
No you don't. You just need to emit the right bytes. The simplest way to achive this is return 1:1 what was inserted. This works well as long as you don't need any String semantics. This is for example what DabbleDB does.
Does Squeak 3.7 has that?
About leading character, I even don't know what is that, except in theory. That is, I never encounter this character as a problem when porting Aida and its i8n support to Squeak.
How can you seriously say everything is working fine when in practice you can't say what is happening and don't know how Strings and Characters work in Squeak? I find that quite dubious hyping.
Not hype at all but pure reality. And coming from country where we already need Unicode characters above 256, you can be sure that I know what I'm talking about.
Then tell us what leadingChar you use. And tell us how you address the issue that #= takes the leadingChar into account.
If there would be some problem, I would be the first encountering it.
No, as I said as long as you're just outputting the input you won't.
But there are no problems with Unicode strings prepared by Aida, so why should I bother? This is like a premature optimization for me.
What, getting semantics of #= right is premature optimization? Having a working String protocol is premature optimization?
Note also that Masashi Umezawa, a Japanese guy, made a preview and few modifications to Aida to work well with Japanese writing, in all aspects from Urls to the content. Because of his work I'm therefore even more sure that we did the Unicode support right!
Then tell us how it works and how it addresses the leadingChar issues outlined in this thread.
Cheers Philippe
On Fri, Mar 27, 2009 at 8:56 PM, Pierre-Edouard PORTIER pierre-edouard.portier@insa-lyon.fr wrote:
I am in search of up to date links (or tips) on how to work with UTF-8 (instead of the default latin1) inside Squeak. I thank you in advance for your help.
http://article.gmane.org/gmane.comp.lang.smalltalk.pharo.devel/5065/match=lo...
Thank you Damien, I will be a tester of this fork. Pierre-Edouard
On Sat, Mar 28, 2009 at 11:03 AM, Damien Cassou damien.cassou@gmail.comwrote:
On Fri, Mar 27, 2009 at 8:56 PM, Pierre-Edouard PORTIER pierre-edouard.portier@insa-lyon.fr wrote:
I am in search of up to date links (or tips) on how to work with UTF-8 (instead of the default latin1) inside Squeak. I thank you in advance for your help.
http://article.gmane.org/gmane.comp.lang.smalltalk.pharo.devel/5065/match=lo...
-- Damien Cassou http://damiencassou.seasidehosting.st
squeak-dev@lists.squeakfoundation.org