On Thursday 30 Sep 2010 2:10:27 pm Bert Freudenberg (JIRA) wrote:
The squeak VM needs to be started with the -compositioninput option (this can be done via the etoys script in /usr/bin). However, even after that, Chinese characters appear correctly only after the locale (LANG, LC_ALL env variables) are set to ja_JP.UTF-8
I noticed something strange in HandMorph>>generateKeyboardEvent. The keyboardInterpreter for Japanese locale decodes a stream of UTF-8 bytes into a Character 'char' but keyValue passed to event generator is 'char asciiValue'. This doesn't seem to be right. But I try to pass char directly, I get a failure in keyString which tries to print keyValue using String.
Is it necessary that keyValue be ASCII all the time?
Subbu
On 01.10.2010, at 09:04, K. K. Subramaniam wrote:
On Thursday 30 Sep 2010 2:10:27 pm Bert Freudenberg (JIRA) wrote:
The squeak VM needs to be started with the -compositioninput option (this can be done via the etoys script in /usr/bin). However, even after that, Chinese characters appear correctly only after the locale (LANG, LC_ALL env variables) are set to ja_JP.UTF-8
I noticed something strange in HandMorph>>generateKeyboardEvent. The keyboardInterpreter for Japanese locale decodes a stream of UTF-8 bytes into a Character 'char' but keyValue passed to event generator is 'char asciiValue'. This doesn't seem to be right. But I try to pass char directly, I get a failure in keyString which tries to print keyValue using String.
Is it necessary that keyValue be ASCII all the time?
asciiValue is misnamed, it is actually the raw value of the character (including the language tag), not limited to ASCII.
- Bert -
On Friday 01 Oct 2010 3:26:39 pm Bert Freudenberg wrote:
asciiValue is misnamed, it is actually the raw value of the character (including the language tag), not limited to ASCII.
That may be the intention but there are many senders of asciiValue that assume ASCII encoding.
A better idea would be to introduce two methods 'value' to return raw value (i.e. lang and code) and asAscii to return value converted to ASCII. Retain asciiValue only for the case where value < 256 and raise an exception if this is violated. This should help catch all assumptions about char codes in the image (including in comments. see KeyboardEvent).
Would you like a changeset for this?
Subbu
On 01.10.2010, at 14:09, K. K. Subramaniam wrote:
On Friday 01 Oct 2010 3:26:39 pm Bert Freudenberg wrote:
asciiValue is misnamed, it is actually the raw value of the character (including the language tag), not limited to ASCII.
That may be the intention but there are many senders of asciiValue that assume ASCII encoding.
A better idea would be to introduce two methods 'value' to return raw value (i.e. lang and code) and asAscii to return value converted to ASCII. Retain asciiValue only for the case where value < 256 and raise an exception if this is violated. This should help catch all assumptions about char codes in the image (including in comments. see KeyboardEvent).
Would you like a changeset for this?
Subbu
No, I don't think that's a good idea. There is already #charCode and #asUnicode and #asInteger. Overriding #value is bad.
Senders should be fixed to use the correct one. Besides, ASCII is only a 7 bit code anyway so should raise an error if > 127, in the strictest sense. I'd just leave it like it is for now. So any sender of #asciiValue should be removed and replaced with the appropriate method call.
This needs to be fixed it in Squeak, too. When we port Etoys to the squeak.org version, we could pick it up from there.
- Bert -
On Friday 01 Oct 2010 6:45:48 pm Bert Freudenberg wrote:
No, I don't think that's a good idea. There is already #charCode and #asUnicode and #asInteger. Overriding #value is bad.
That's true.
Senders should be fixed to use the correct one. Besides, ASCII is only a 7 bit code anyway so should raise an error if > 127, in the strictest sense. I'd just leave it like it is for now. So any sender of #asciiValue should be removed and replaced with the appropriate method call.
There are about 163 senders just in Etoys image and some of them are quite valid (i.e. they do apply for lang=0,value<128 case. HandMorph>>generateKeyboardEvent is not one of them, so I go ahead fix these places.
This needs to be fixed it in Squeak, too. When we port Etoys to the squeak.org version, we could pick it up from there.
Yes. the issue is common but the fix is non-trivial. Latin1 languages can have either UTF32InputInterpreter or UTF8InputInterpreter depending on the codeset being used. LocalePlugin has to be extended to return codeset (nil, UTF-8, ...) and modifier on Unix.
For solving the current issue, I will patch the immediate asciiValue misuse and create a changeset for UTF8Environment that should cover most of the multilingual issues for non-Latin languages. For en, I will add a m17n preference that will switch between classic codesets and UTF-8 codeset.
Subbu
On 01.10.2010, at 16:56, K. K. Subramaniam wrote:
On Friday 01 Oct 2010 6:45:48 pm Bert Freudenberg wrote:
No, I don't think that's a good idea. There is already #charCode and #asUnicode and #asInteger. Overriding #value is bad.
That's true.
Senders should be fixed to use the correct one. Besides, ASCII is only a 7 bit code anyway so should raise an error if > 127, in the strictest sense. I'd just leave it like it is for now. So any sender of #asciiValue should be removed and replaced with the appropriate method call.
There are about 163 senders just in Etoys image and some of them are quite valid (i.e. they do apply for lang=0,value<128 case. HandMorph>>generateKeyboardEvent is not one of them, so I go ahead fix these places.
This needs to be fixed it in Squeak, too. When we port Etoys to the squeak.org version, we could pick it up from there.
Yes. the issue is common but the fix is non-trivial. Latin1 languages can have either UTF32InputInterpreter or UTF8InputInterpreter depending on the codeset being used. LocalePlugin has to be extended to return codeset (nil, UTF-8, ...) and modifier on Unix.
For solving the current issue, I will patch the immediate asciiValue misuse and create a changeset for UTF8Environment that should cover most of the multilingual issues for non-Latin languages. For en, I will add a m17n preference that will switch between classic codesets and UTF-8 codeset.
Subbu
Awesome!
- Bert -
etoys-dev@lists.squeakfoundation.org