I'm curious about what version of Squeak was used to implement Scratch. It might be nice to attempt to transplant it from one image to a similar one... I usually learn a lot about what outside of an application has changed in its host image that way.
On 18-07-2013, at 1:25 PM, Casey Ransberger casey.obrien.r@gmail.com wrote:
I'm curious about what version of Squeak was used to implement Scratch. It might be nice to attempt to transplant it from one image to a similar one... I usually learn a lot about what outside of an application has changed in its host image that way.
Round about 2.8. Lots of stuff hacked out, some odd stuff put in, a load of i18n & translation (including handling RTL languages) that needs reworking to current unicode classes, then Scratch on top of all that. I've made modest changes to the Scratch execution machinery (it's a sort of vm within squeak) that have provided considerable speedups. We have added the fast blt stuff, which has modestly sped up some parts (makes the normal morphic dev image tolerable, for example) and I'm currently working on moving it all to the current image so it can run on Stack/Cog VMs.
There are several other projects doing similar port-forward work- Phratch for Pharo is probably the most complete. An earlier one was called 'Scat' which was a very unfortunate name. BYOB etc are *extensions* to Scratch and I'm not interested in any of that *yet*. The mission is to make a Scratch that runs on StackVM/Cog that nobody would notice anything different except the speed.
The i18n stuff is my biggest issue right now. Anyone that remembers the old days of UTF8 & UTF32 and also understands the current world of the Multilingual category classes and who can spare some time to educate me would be very welcomed.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Oxymorons: Act naturally
Great work but why not joining your effort with Phratch ? https://code.google.com/p/phratch/
What is the licence of your modifications ?
Regards,
tim Rowledge wrote
On 18-07-2013, at 1:25 PM, Casey Ransberger <
casey.obrien.r@
> wrote:
I'm curious about what version of Squeak was used to implement Scratch. It might be nice to attempt to transplant it from one image to a similar one... I usually learn a lot about what outside of an application has changed in its host image that way.
Round about 2.8. Lots of stuff hacked out, some odd stuff put in, a load of i18n & translation (including handling RTL languages) that needs reworking to current unicode classes, then Scratch on top of all that. I've made modest changes to the Scratch execution machinery (it's a sort of vm within squeak) that have provided considerable speedups. We have added the fast blt stuff, which has modestly sped up some parts (makes the normal morphic dev image tolerable, for example) and I'm currently working on moving it all to the current image so it can run on Stack/Cog VMs.
There are several other projects doing similar port-forward work- Phratch for Pharo is probably the most complete. An earlier one was called 'Scat' which was a very unfortunate name. BYOB etc are *extensions* to Scratch and I'm not interested in any of that *yet*. The mission is to make a Scratch that runs on StackVM/Cog that nobody would notice anything different except the speed.
The i18n stuff is my biggest issue right now. Anyone that remembers the old days of UTF8 & UTF32 and also understands the current world of the Multilingual category classes and who can spare some time to educate me would be very welcomed.
tim
tim Rowledge;
tim@
; http://www.rowledge.org/tim Oxymorons: Act naturally
-- View this message in context: http://forum.world.st/When-did-Scratch-diverge-tp4699465p4699477.html Sent from the Squeak - Dev mailing list archive at Nabble.com.
On 18-07-2013, at 11:43 PM, SergeStinckwich Serge.Stinckwich@gmail.com wrote:
Great work but why not joining your effort with Phratch ? https://code.google.com/p/phratch/
I looked at both Pharo and Phratch and concluded that I simply didn't like them much. Taste is a very personal thing…
What is the licence of your modifications ?
It will all be MIT, same as the original.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Gotta run, the cat's caught in the printer.
What are your questions about Unicode and the Multilingual classes? What facilities did you have in mind to provide?
It would be really nice if we had the equivalent of the Linux libraries for ibus for input and pango or even graphite for display and printing, so that we would be able to support every modern national language other than Mongolian in its traditional alphabets, and a substantial number of others. (Long story, but Mongols can make do with Cyrillic for now. They will eventually tell us how they want their alphabets supported.) Plus the Mac and Windows equivalents for IMEs and rendering. That also means that we will need OpenType support, not just TrueType.
OLPC has deployments in Cambodia, Mongolia, Thailand, and a number of countries using extensions of the Arabic alphabet to write a variety of languages. Hebrew would be helpful for the Gaza deployment. Right now, the Multilingual-Languages category shows European, Greek, Simplified Chinese, Korean, Japanese, Russian, and Nepalese out of about 30 writing systems in modern use. The Set Language menu has about 30 languages on it, including seven whose names it cannot display by default.
I have not gone into the Multilingual categories in depth, but most of the code is straightforward except for classes in Multilingual-Scanning, which has the additional problems of lacking almost all comments, and reusing common method names for quite different operations. It appears to be used to implement text editing functions.
The hard part will be expanding text-handling primitives.
I apologize if this seems like piling on. There is a lot to I18n.
On Thu, July 18, 2013 5:35 pm, tim Rowledge wrote:
On 18-07-2013, at 1:25 PM, Casey Ransberger casey.obrien.r@gmail.com wrote:
I'm curious about what version of Squeak was used to implement Scratch. It might be nice to attempt to transplant it from one image to a similar one... I usually learn a lot about what outside of an application has changed in its host image that way.
Round about 2.8. Lots of stuff hacked out, some odd stuff put in, a load of i18n & translation (including handling RTL languages) that needs reworking to current unicode classes, then Scratch on top of all that. I've made modest changes to the Scratch execution machinery (it's a sort of vm within squeak) that have provided considerable speedups. We have added the fast blt stuff, which has modestly sped up some parts (makes the normal morphic dev image tolerable, for example) and I'm currently working on moving it all to the current image so it can run on Stack/Cog VMs.
There are several other projects doing similar port-forward work- Phratch for Pharo is probably the most complete. An earlier one was called 'Scat' which was a very unfortunate name. BYOB etc are *extensions* to Scratch and I'm not interested in any of that *yet*. The mission is to make a Scratch that runs on StackVM/Cog that nobody would notice anything different except the speed.
The i18n stuff is my biggest issue right now. Anyone that remembers the old days of UTF8 & UTF32 and also understands the current world of the Multilingual category classes and who can spare some time to educate me would be very welcomed.
tim
tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Oxymorons: Act naturally
Hi Edward, thanks for taking an interest!
I apologize if this seems like piling on. There is a lot to I18n.
Dang right it's complex! Especially if one has never had reason to take much interest in it before…
My major requirement is to support exactly as much as Scratch needs. I'd love to be able to be more precise but I don't yet understand enough to do that. Sure, I could hack in UTF8 & UTF32 classes from the old Scratch code but that isn't how to take advantage of what is (I hope) the more modern and better developed code in the new images.
To start with I guess I need to be pointed to some basic info about how Squeak now supports i18n so that I can at least learn some of the names.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Strange OpCodes: VDP: Violate Design Parameters
On 2013-07-19, at 20:41, tim Rowledge tim@rowledge.org wrote:
Hi Edward, thanks for taking an interest!
I apologize if this seems like piling on. There is a lot to I18n.
Dang right it's complex! Especially if one has never had reason to take much interest in it before…
My major requirement is to support exactly as much as Scratch needs. I'd love to be able to be more precise but I don't yet understand enough to do that. Sure, I could hack in UTF8 & UTF32 classes from the old Scratch code but that isn't how to take advantage of what is (I hope) the more modern and better developed code in the new images.
To start with I guess I need to be pointed to some basic info about how Squeak now supports i18n so that I can at least learn some of the names.
Switching to Squeak's i18n wouldn't make sense, you should continue to use Scratch's "home-grown". To make that work, I think you only need to throw out John's explicit UTF8/UTF32 stuff and let Squeak's automatic ByteString/WideString take over. You just need to find the "edges" of the system where you need to explicitly convert to/from utf8.
- Bert -
On 05-08-2013, at 3:34 PM, Bert Freudenberg bert@freudenbergs.de wrote:
Switching to Squeak's i18n wouldn't make sense, you should continue to use Scratch's "home-grown". To make that work, I think you only need to throw out John's explicit UTF8/UTF32 stuff and let Squeak's automatic ByteString/WideString take over. You just need to find the "edges" of the system where you need to explicitly convert to/from utf8.
Err, maybe I've misunderstood but the explicit uft8/32 stuff *is* Scratch's homegrown. So I don't see how I could continue to use that and throw it away at the same time… What did I miss?
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim If at first you don't succeed, call it version 1.0
On 6 August 2013 00:13, tim Rowledge tim@rowledge.org wrote:
On 05-08-2013, at 3:34 PM, Bert Freudenberg bert@freudenbergs.de wrote:
Switching to Squeak's i18n wouldn't make sense, you should continue to use Scratch's "home-grown". To make that work, I think you only need to throw out John's explicit UTF8/UTF32 stuff and let Squeak's automatic ByteString/WideString take over. You just need to find the "edges" of the system where you need to explicitly convert to/from utf8.
Err, maybe I've misunderstood but the explicit uft8/32 stuff *is* Scratch's homegrown. So I don't see how I could continue to use that and throw it away at the same time… What did I miss?
Just that Bert probably the "n't". [sic the whole sentence]
frank
tim
tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim If at first you don't succeed, call it version 1.0
On 2013-08-06, at 01:13, tim Rowledge tim@rowledge.org wrote:
On 05-08-2013, at 3:34 PM, Bert Freudenberg bert@freudenbergs.de wrote:
Switching to Squeak's i18n wouldn't make sense, you should continue to use Scratch's "home-grown". To make that work, I think you only need to throw out John's explicit UTF8/UTF32 stuff and let Squeak's automatic ByteString/WideString take over. You just need to find the "edges" of the system where you need to explicitly convert to/from utf8.
Err, maybe I've misunderstood but the explicit uft8/32 stuff *is* Scratch's homegrown. So I don't see how I could continue to use that and throw it away at the same time… What did I miss?
You missed that I make a distinction between "i18n" (how to translate between English and Other Human Languages) and the rather technical aspect of how to represent strings with more than 8 bits per character. For both of these Scratch has a solution different from main Squeak, but I'm saying the best way forward is to use Squeak's strings with Scratch's translation framework. A third part is displaying the translated strings for which I'd continue to use Scratch's way, at least for the time being.
Hope that's more clear?
- Bert -
On 06-08-2013, at 1:57 AM, Bert Freudenberg bert@freudenbergs.de wrote:
You missed that I make a distinction between "i18n" (how to translate between English and Other Human Languages) and the rather technical aspect of how to represent strings with more than 8 bits per character.
Fair enough; it's all unfamiliar enough to me that it looks like one big hairy ball of nastiness. The Scratch translation system is a fairly simple dictionary lookup, so at least that part makes sense!
For both of these Scratch has a solution different from main Squeak, but I'm saying the best way forward is to use Squeak's strings with Scratch's translation framework.
OK, I can see virtue in that. I don't currently have a clue how non-english/ascii characters get handled in the Squeak system but I suppose we'll crash into that bridge when we come to it…
Squeak has BytesString and WideString. I'm going to make a wild guess that WideString is for use as UTF32 encoding of unicode, and that ByteString is usable for 'plain old ascii' and UTF8 encoded unicode?
A third part is displaying the translated strings for which I'd continue to use Scratch's way, at least for the time being.
I *think* that one advantage of using the Squeak string classes should be that StringMorph already handles them properly, rather than having to fudge in the rather ugly Scratch modifications. I'm not sure about right-to-left languages though - are they supposed to be handled? There's a fair bit of if-this draw one way, if the-other draw differently, unless the magic-unicode-direction-char says otherwise and it's a blue moon on Thursday.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Base 8 is just like base 10, if you are missing two fingers.
Yes, WideString contain Unicode (iso-10646) code points encoded on 32-bits words, so are like UTF32. But no, ByteString contains only the 256 first code points of Unicode, that is something like iso-8859-L1 or latin 1.
So ByteString do not contain UTF8 sequence... Well, except they temporarily contain such encoding (see squeakToUtf8 and utf8ToSqueak). This is not a good thing that correct interpretation of a String depends on some state held somewhere in the image... If we don't know for sure how to interpret the codes composing a String, this just make String useless: we can't compare them, display them etc... In other words, they have no more value than just a raw sequence of bytes, like ByteArray. For this reason we would prefer to have encoded string (other than canonical unicode - see further) explicitely represented in ByteArray (I very much like the UninterpretedBytes variant from VW, very speaking).
An alternative would be to have the encoding carried by the String itself, either by class (what else would be the encoding of an UTF8String), or through an encoding instance variable. This is what VW did for example. The drawback is that it is necessary to add some VM support for these zoo of String, because String speed is vital.
I said canonical unicode, but if you dig a bit, you'll see that this is not something obvious: for example the same accented latin character can be encoded with a single codePoint, or with two codePoints (a compound letter with a code for the accent and another one for the naked letter).
Last thing, we have our squeakism: the #leadingChar. I let you dig into its usage, but it should be restricted for east asian languages support since squeak 4.x at least.
2013/8/6 tim Rowledge tim@rowledge.org
On 06-08-2013, at 1:57 AM, Bert Freudenberg bert@freudenbergs.de wrote:
You missed that I make a distinction between "i18n" (how to translate
between English and Other Human Languages) and the rather technical aspect of how to represent strings with more than 8 bits per character.
Fair enough; it's all unfamiliar enough to me that it looks like one big hairy ball of nastiness. The Scratch translation system is a fairly simple dictionary lookup, so at least that part makes sense!
For both of these Scratch has a solution different from main Squeak, but
I'm saying the best way forward is to use Squeak's strings with Scratch's translation framework.
OK, I can see virtue in that. I don't currently have a clue how non-english/ascii characters get handled in the Squeak system but I suppose we'll crash into that bridge when we come to it…
Squeak has BytesString and WideString. I'm going to make a wild guess that WideString is for use as UTF32 encoding of unicode, and that ByteString is usable for 'plain old ascii' and UTF8 encoded unicode?
A third part is displaying the translated strings for which I'd continue
to use Scratch's way, at least for the time being.
I *think* that one advantage of using the Squeak string classes should be that StringMorph already handles them properly, rather than having to fudge in the rather ugly Scratch modifications. I'm not sure about right-to-left languages though - are they supposed to be handled? There's a fair bit of if-this draw one way, if the-other draw differently, unless the magic-unicode-direction-char says otherwise and it's a blue moon on Thursday.
tim
tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Base 8 is just like base 10, if you are missing two fingers.
On 06-08-2013, at 1:25 PM, Nicolas Cellier nicolas.cellier.aka.nice@gmail.com wrote:
Yes, WideString contain Unicode (iso-10646) code points encoded on 32-bits words, so are like UTF32.
Well that's good news…
But no, ByteString contains only the 256 first code points of Unicode, that is something like iso-8859-L1 or latin 1.
Got it; I was thinking (foolishly) that it could be (ab)used for utf8 encoding
So ByteString do not contain UTF8 sequence... Well, except they temporarily contain such encoding (see squeakToUtf8 and utf8ToSqueak).
Ah, so somebody else had that idea too, even though temporarily
An alternative would be to have the encoding carried by the String itself, either by class (what else would be the encoding of an UTF8String), or through an encoding instance variable. This is what VW did for example. The drawback is that it is necessary to add some VM support for these zoo of String, because String speed is vital.
Yes. Though I can handle a single case since, being single, we know what is intended. Scratch needs to have a utf8 form of string since that is how the project files store non-ascii strings. UTF32 only seems to be used as a way of doing a few odd jobs on the way to making utf8 or macRoman strings, though I'm a long way from certain of that. it gets even more mixed up because the Pi doesn't have a 'renderplugin' set, lacking a UnicodePlugin, I think because it has no Pango library or at least not one that gets used to build the unicode plugin. Maybe it should?
I said canonical unicode, but if you dig a bit, you'll see that this is not something obvious: for example the same accented latin character can be encoded with a single codePoint, or with two codePoints (a compound letter with a code for the accent and another one for the naked letter).
Now you're just saying things to scare me.
Last thing, we have our squeakism: the #leadingChar. I let you dig into its usage, but it should be restricted for east asian languages support since squeak 4.x at least.
Oh boy. More scary stories.
Thanks for explaining...
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim 29A, the hexadecimal of the Beast.
After more peering at dark corners of code, some possible light glimmers appear. I hope they're not an oncoming train…
So far as I can work out, Squeak converts incoming characters from the keyboard and clipboard according to the locale setting returned from the locale plugin (or goes for 'en' if that goes wrong). For now I'll assume that having a locale that needs non-ascii characters will result in at least some strings being WideString objects and that they will get displayed nicely, according to the fonts installed. So we have either latin-1 ascii BytesStrings or unicode WideStrings. Correct so far?
Currently Scratch expects the translation dictionary files it uses to have utf8 encoded strings for the not-english translations, so a utf8 class of some sort will be required in order to stuff them and translate to WideString. Similarly the project file format expects either Strings (as in the old Squeak latin-1 ascii) or UTF8 strings, so a conversion either way is still needed. I think that utf8 can be restricted to only existing as an intermediary buffer object.
I don't suppose the code currently in Squeak handles right-to-left languages?
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Hardware: The parts of a computer system that can be kicked.
On 2013-08-07, at 04:54, tim Rowledge tim@rowledge.org wrote:
After more peering at dark corners of code, some possible light glimmers appear. I hope they're not an oncoming train…
So far as I can work out, Squeak converts incoming characters from the keyboard and clipboard according to the locale setting returned from the locale plugin (or goes for 'en' if that goes wrong). For now I'll assume that having a locale that needs non-ascii characters will result in at least some strings being WideString objects and that they will get displayed nicely, according to the fonts installed. So we have either latin-1 ascii BytesStrings or unicode WideStrings. Correct so far?
Yes, although we do not usually have fonts installed that provide more than Latin-1. The mechanism is there but largely unused (there is only a handful of WideString instances in our image). Installing wide fonts isn't exactly trivial either, IIRC for TTFs all glyphs outside Latin-1 get stripped unless you know which line in which method to change.
Currently Scratch expects the translation dictionary files it uses to have utf8 encoded strings for the not-english translations, so a utf8 class of some sort will be required in order to stuff them and translate to WideString.
This is one of the system/squeak boundaries. The PO files are utf8 encoded, you should change that reader to return proper WideStrings (just by a strategically placed utf8ToSqueak send).
Similarly the project file format expects either Strings (as in the old Squeak latin-1 ascii) or UTF8 strings, so a conversion either way is still needed. I think that utf8 can be restricted to only existing as an intermediary buffer object.
Sounds about right.
I don't suppose the code currently in Squeak handles right-to-left languages?
It does not, which is why I suggested to keep using Scratch's text rendering.
- Bert -
On 07-08-2013, at 5:28 AM, Bert Freudenberg bert@freudenbergs.de wrote:
[snip] Yes, although we do not usually have fonts installed that provide more than Latin-1. The mechanism is there but largely unused (there is only a handful of WideString instances in our image). Installing wide fonts isn't exactly trivial either, IIRC for TTFs all glyphs outside Latin-1 get stripped unless you know which line in which method to change.
I know there's been work done on using pango/cairo/something to improve on this and of course Scratch on Windows/OSX uses a bit of Pango in the UnicodePlugin. The Scratch code explicitly excludes unix at the moment and anyway there is no UnicodePlugin for Pi, so in a way it's all moot for now. Arabic locale in Scratch on a Pi looks like the CIA has been at the document…
Currently Scratch expects the translation dictionary files it uses to have utf8 encoded strings for the not-english translations, so a utf8 class of some sort will be required in order to stuff them and translate to WideString.
This is one of the system/squeak boundaries. The PO files are utf8 encoded, you should change that reader to return proper WideStrings (just by a strategically placed utf8ToSqueak send).
Ah, yes. I had seen that pair of methods en passant and wondered about them. Useful.
I don't suppose the code currently in Squeak handles right-to-left languages?
It does not, which is why I suggested to keep using Scratch's text rendering.
As above, for right now there effectively isn't any . Some interesting work to do to fix that. ;-)
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Useful random insult:- Thinks E=MC^2 is a rap star.
On Tue, Aug 6, 2013 at 1:25 PM, Nicolas Cellier nicolas.cellier.aka.nice@gmail.com wrote:
Last thing, we have our squeakism: the #leadingChar. I let you dig into its usage, but it should be restricted for east asian languages support since squeak 4.x at least.
I am not on top of things (anything, really) but what has changed since Squeak 4.x in this regard?
Just a historical note, but the concept of leadingChar was borrowed from the multilingual Emacs effort, which eventually folded into the mainstream Emacs.
Well, that's already some time ago, but from memory the main things were:
- set leadingChar 0 as synonym of unicode - set leadingChar for several language environment to 0 (unicode) (Greek, Russian, ...)
2013/8/8 Yoshiki Ohshima Yoshiki.Ohshima@acm.org
On Tue, Aug 6, 2013 at 1:25 PM, Nicolas Cellier nicolas.cellier.aka.nice@gmail.com wrote:
Last thing, we have our squeakism: the #leadingChar. I let you dig into
its
usage, but it should be restricted for east asian languages support since squeak 4.x at least.
I am not on top of things (anything, really) but what has changed since Squeak 4.x in this regard?
Just a historical note, but the concept of leadingChar was borrowed from the multilingual Emacs effort, which eventually folded into the mainstream Emacs.
-- -- Yoshiki
And this happened between http://source.squeak.org/trunk/http://source.squeak.org/trunk/Multilingual-ul.141.mcz Multilingual-nice.91.mcz http://source.squeak.org/trunk/Multilingual-ul.141.mcz http://source.squeak.org/trunk/Multilingual-nice.142.mcz
2013/8/8 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com
Well, that's already some time ago, but from memory the main things were:
- set leadingChar 0 as synonym of unicode
- set leadingChar for several language environment to 0 (unicode) (Greek,
Russian, ...)
2013/8/8 Yoshiki Ohshima Yoshiki.Ohshima@acm.org
On Tue, Aug 6, 2013 at 1:25 PM, Nicolas Cellier nicolas.cellier.aka.nice@gmail.com wrote:
Last thing, we have our squeakism: the #leadingChar. I let you dig into
its
usage, but it should be restricted for east asian languages support
since
squeak 4.x at least.
I am not on top of things (anything, really) but what has changed since Squeak 4.x in this regard?
Just a historical note, but the concept of leadingChar was borrowed from the multilingual Emacs effort, which eventually folded into the mainstream Emacs.
-- -- Yoshiki
On 2013-08-08, at 23:26, Nicolas Cellier nicolas.cellier.aka.nice@gmail.com wrote:
And this happened between http://source.squeak.org/trunk/Multilingual-nice.91.mcz http://source.squeak.org/trunk/Multilingual-ul.141.mcz http://source.squeak.org/trunk/Multilingual-nice.142.mcz
For simpler access:
http://source.squeak.org/trunk/Multilingual-nice.91.diff http://source.squeak.org/trunk/Multilingual-ul.141.diff http://source.squeak.org/trunk/Multilingual-nice.142.diff
- Bert -
2013/8/8 Nicolas Cellier nicolas.cellier.aka.nice@gmail.com Well, that's already some time ago, but from memory the main things were:
- set leadingChar 0 as synonym of unicode
- set leadingChar for several language environment to 0 (unicode) (Greek, Russian, ...)
2013/8/8 Yoshiki Ohshima Yoshiki.Ohshima@acm.org On Tue, Aug 6, 2013 at 1:25 PM, Nicolas Cellier nicolas.cellier.aka.nice@gmail.com wrote:
Last thing, we have our squeakism: the #leadingChar. I let you dig into its usage, but it should be restricted for east asian languages support since squeak 4.x at least.
I am not on top of things (anything, really) but what has changed since Squeak 4.x in this regard?
Just a historical note, but the concept of leadingChar was borrowed from the multilingual Emacs effort, which eventually folded into the mainstream Emacs.
-- -- Yoshiki
squeak-dev@lists.squeakfoundation.org