Unicode support (File names was Re: Warning: Large Babel translation)

Lex Spoon lex at cc.gatech.edu
Thu Nov 20 14:06:20 UTC 2003


Yoshiki Ohshima <Yoshiki.Ohshima at acm.org> wrote:
> > I don't understand, then.  What happens if the VM says "This VM uses
> > encoding timbuktu-3-rot11" and my image doesn't know about this
> > encoding?  I thought the proposal was that the VM specifies a character
> > encoding to use, and the image is expected to support that encoding.
> 
>   Well, this is so hypothetical question that I found it hard to
> answer...  What is your assumption here?  "Your image" means an image
> before version 3.6, 3.7 or whatever?  Or a future image based on
> someone's hypothetical multilingualization extention?> 

Right.  Either way.  Either it's a current image, or it's a future
image that is still not new enough to have the particular converter
loaded.


>   If it is in the latter case, I would imagine that the existence of
> such VM mean that someone already had written a converter for the
> conversion between UTF-32 (or whatever the internal representation
> use) and timbuktu-3-rot11.  The people who need such VM will
> definitely write the converter (a subclass of TextConverter, if they
> want) and it will be available from somewhere when such VM is
> available.  Does this sound reasonable to you?

It is reasonable but it sounds like it can be better.  Do you have a
plan for getting these convertors loaded into the older image?  Picture
yourself as the guy in Timbuktu; how do you load a changeset when your
file list cannot report the names of the files?  If it's done
automatically, how can an image request a file be loaded when it cannot
specify the name of that file?

The alternative approach of fixing the encoding -- presumably to UTF-8
-- solves the problem very nicely as far as I can see.  The  VM must
have a convertor from timbuktu-3-rot11 to UTF-8, it is true, but as you
say such a convertor has to exist anyway.  Putting it in the VM is
solves the problem in one blow and afterwards all images will load and
function under that VM.



>   In the former case, you're already lost, Lex.  If we *declare* UTF-8
> to be the only encoding to use for the communication between image and
> VM, your image, that already contains MacRoman characters, won't
> load/run on such VM.  From old images' standpoint, timbuktu-3-rot11 is
> not any worse than UTF-8, an encoding that is variable length and
> imcompatible with MacRoman.

I was mainly talking about post-declaration images.  Frankly, if we had
to lose compatibility with MacRoman images I could live with that.  But
actually we don't have to under the "declaration" approach.  We simply
declare that both UTF-8 and MacRoman are available.  By default the VM
speaks to the image using MacRoman, but if the image asks, the VM will
switch over to UTF-8.  And that's that.  All images work with all VM's.



>  again, my suggestion allows, or trying to allow, you
> to use any of combination: 
>   old image/old VM
>   new image/old VM  (well, this doesn't work in some cases)
>   old image/new VM
>   new image/new VM
>
>   On the other hand, if we would create a VM that *only* handles
> UTF-8, only the 'new/new' and 'old/old' combination would work.  Is
> this really what you want?  

I'm glad we agree this is worthwhile.  But I don't understand your
assessment.  As I described above, the fixed-encoding scheme allows
every image to load under every VM.

To contrast, the in-image-translation scheme means that some images do
*not* work with some VM's; specifically, it is entirely possible for a
VM to request an encoding that the image doesn't know about.  That image
will not function properly with that VM until the image has been
modified by adding a new convertor to it.

Where's the error in this analysis?


>  If you agree with
> this, is my design that bad so you have to blame as if someone who
> doesn't care about Squeak is trying to kill your baby?  If you have
> some real alternative idea, I'm all ears.  That's why I keep asking
> your idea of the internal encoding.  But if you don't... that isn't
> nice...

Let's be polite.




Overall, I believe the heart of the matter is about "virtualization"
versus "reporting"  in the VM design.  When a VM "reports", it simply
tells the image about the underlying machine; the image then talks
directly to the underlying machine in whatever way the underlying
machine prefers.  When a VM virtualizes, it translates between the
underlying machine and some canonical machine; the image then talks to
the canonical machine.

Squeak usually goes with a virtualization approach.  For example, stack
frames have the exact same format on all platforms, and Forms come
close.  There are partial exceptions, though.  Sound is mostly
virtualized, but the underlying hardware may refuse certain parameter
settings and report back the actual setting it will agree to use.  The
main reason to go with reporting instead of virtualization is to improve
performance; in the sound example, it would be impractical to support
13999 Hz audio as opposed to letting the underlying machine round this
off to 14400 Hz.  Aside from performance, however, virtualization seems
better.  It allows the image code to remain simple, and it aids in
portability.

Do you agree that the central difference in the proposals is reporting
versus virtualization?  If so, I am wondering why people would prefer
the reporting approach in this case.  You have posted to the list that
performance does not seem to be an issue with live translation between
encodings.  But if performance is not an issue, why not virtualize away
the character encoding?


Lex Spoon



More information about the Squeak-dev mailing list