Unicode support (File names was Re: Warning: Large Babel translation)

Yoshiki Ohshima Yoshiki.Ohshima at acm.org
Thu Nov 20 07:33:57 UTC 2003


  Lex,

> >   This may require a bit more clarification...  I don't know what is
> > in that 90MB image, but if you rewrite your code a bit, probably
> > changing the test against "String" to "AbstractString" your code may
> > well work fine (or maybe not.)  Even for that kind of meta-level
> > program, the change should be pretty small.
> > 
> >   I actually don't understand what you meant by "stops loading,"
> > though.  Are you worrying about the image won't run on a (new) VM?
> > Then, you really don't have to worry about it.
> > 
> 
> I don't understand, then.  What happens if the VM says "This VM uses
> encoding timbuktu-3-rot11" and my image doesn't know about this
> encoding?  I thought the proposal was that the VM specifies a character
> encoding to use, and the image is expected to support that encoding.

  Well, this is so hypothetical question that I found it hard to
answer...  What is your assumption here?  "Your image" means an image
before version 3.6, 3.7 or whatever?  Or a future image based on
someone's hypothetical multilingualization extention?

  If it is in the latter case, I would imagine that the existence of
such VM mean that someone already had written a converter for the
conversion between UTF-32 (or whatever the internal representation
use) and timbuktu-3-rot11.  The people who need such VM will
definitely write the converter (a subclass of TextConverter, if they
want) and it will be available from somewhere when such VM is
available.  Does this sound reasonable to you?

  In the former case, you're already lost, Lex.  If we *declare* UTF-8
to be the only encoding to use for the communication between image and
VM, your image, that already contains MacRoman characters, won't
load/run on such VM.  From old images' standpoint, timbuktu-3-rot11 is
not any worse than UTF-8, an encoding that is variable length and
imcompatible with MacRoman.

#  Actually, your timbuktu-3-rot11 doesn't have to be convertable
# from/to Unicode.  My internal representation should allow such thing,
# if you put different leading char.

  Also, remember that what I really suggest is that a VM is something
that just deals with the sequence of bytes.  It may seem that it is a
bit different from Andreas' suggestion, but my idea, which has more or
less been implemented, is that the VM lend a little help to the image
to decide what encoding the image should use.  In the current
implementation, it is done by accessing bunch of system attributes in
clumsy switch-case like method.

#  In fact, Hayashi-san and I are talking about to add a primitive to
# switch the VM encoding from the image that seem to be needed on a
# system that wants to be Unicode based such as Mac OS X.  For
# transition stage, it seems that we needed to have such thing.

  If the VM is more or less transparent (reversible conversion that
some of the current VMs do is ok) to the bytes coming from the image
and the platform, such VM should be able to run both 'new' image and
the 'old' image.  again, my suggestion allows, or trying to allow, you
to use any of combination: 
  old image/old VM
  new image/old VM  (well, this doesn't work in some cases)
  old image/new VM
  new image/new VM

  On the other hand, if we would create a VM that *only* handles
UTF-8, only the 'new/new' and 'old/old' combination would work.  Is
this really what you want?  Isn't it nice that you can actually load
the m17n SAR file into 3.6 image running on existing VM even the
internal encoding and the external (fileout/.changes) encoding change
during the install process?

  Lex, I know you're smarter than I am and you can come up with as
many hypothetical questions on this as you want...  But, I don't
suppose you strongly disagree with going with internal UTF-32-like
encoding and external UTF-8 fileouts.  If you agree with this and you
develop such system on current VMs, then you'll find that some kind of
in-image conversion flexibility is unavoidable.  If you agree with
this, is my design that bad so you have to blame as if someone who
doesn't care about Squeak is trying to kill your baby?  If you have
some real alternative idea, I'm all ears.  That's why I keep asking
your idea of the internal encoding.  But if you don't... that isn't
nice...

-- Yoshiki



More information about the Squeak-dev mailing list