Unicode support (File names was Re: Warning: Large Babel translation)

Thu Nov 20 17:53:55 UTC 2003

  Lex,

> >   If it is in the latter case, I would imagine that the existence of
> > such VM mean that someone already had written a converter for the
> > conversion between UTF-32 (or whatever the internal representation
> > use) and timbuktu-3-rot11.  The people who need such VM will
> > definitely write the converter (a subclass of TextConverter, if they
> > want) and it will be available from somewhere when such VM is
> > available.  Does this sound reasonable to you?
> 
> It is reasonable but it sounds like it can be better.  Do you have a
> plan for getting these convertors loaded into the older image?

  In short, no.  If the image doesn't support a *wide* internal
representation, it doesn't worth to load such converters.

  If your application running in an old image and you're having hard
time to port it to 3.6 release image, it is hard to port it to an m17n
image.  If you can, it is not hard.

> Picture
> yourself as the guy in Timbuktu; how do you load a changeset when your
> file list cannot report the names of the files?  If it's done
> automatically, how can an image request a file be loaded when it cannot
> specify the name of that file?

  This is why I said that was so hypothetical.  Should I assume that
timbuktu-3-rot11 is incompatible with 7 bit ASCII?  If so, I would not
be able to compile the current Squeak VM anyway.  If not, since they
should name the class and fileout of the converter in ASCII, it won't
be a problem.  Remember, the encoding of content of a file and the
file name encoding are different.  You'll always need to convert the
file name to the platform encoding, but you may or may not convert the
content of file.  Would you put the 'binary/ascii' distinction into
the VM's file read/write primitives?  Or, first the image read the
content file as sequence of bytes, and then ask VM to convert it
through some primitive?

> The alternative approach of fixing the encoding -- presumably to UTF-8
> -- solves the problem very nicely as far as I can see.  The  VM must
> have a convertor from timbuktu-3-rot11 to UTF-8, it is true, but as you
> say such a convertor has to exist anyway.  Putting it in the VM is
> solves the problem in one blow and afterwards all images will load and
> function under that VM.

  Very nicely...  I don't know if you think about the cases.

> >   In the former case, you're already lost, Lex.  If we *declare* UTF-8
> > to be the only encoding to use for the communication between image and
> > VM, your image, that already contains MacRoman characters, won't
> > load/run on such VM.  From old images' standpoint, timbuktu-3-rot11 is
> > not any worse than UTF-8, an encoding that is variable length and
> > imcompatible with MacRoman.
> 
> I was mainly talking about post-declaration images.  Frankly, if we had
> to lose compatibility with MacRoman images I could live with that.  But
> actually we don't have to under the "declaration" approach.  We simply
> declare that both UTF-8 and MacRoman are available.  By default the VM
> speaks to the image using MacRoman, but if the image asks, the VM will
> switch over to UTF-8.  And that's that.  All images work with all
> VM's.

  Really? where did all those your 90MB image thing come from?

  So, you agree that declaring UTF-8 to be *the* only encoding is not
going to work?

  And, you still don't want to allow the new image work on an old VM?
I did the m17n work on a vanilla (old) Windows VM that uses
Shift-JIS... Are you sure you I should have compiled new VM first?
How would I have asked other people to test the m17n work?

  So, when your version of 'new' image has got the UTF-8 from the VM
as a path name, what would be the internal represenatation?  Again,
do you think that using UTF-8 internally is going to work?

> I'm glad we agree this is worthwhile.  But I don't understand your
> assessment.  As I described above, the fixed-encoding scheme allows
> every image to load under every VM.

  Not they won't.  See above.

> To contrast, the in-image-translation scheme means that some images do
> *not* work with some VM's; specifically, it is entirely possible for a
> VM to request an encoding that the image doesn't know about.  That image
> will not function properly with that VM until the image has been
> modified by adding a new convertor to it.
> 
> Where's the error in this analysis?

 Seems that the same error in timbuktu-3-rot11 discussion?  

> >  If you agree with
> > this, is my design that bad so you have to blame as if someone who
> > doesn't care about Squeak is trying to kill your baby?  If you have
> > some real alternative idea, I'm all ears.  That's why I keep asking
> > your idea of the internal encoding.  But if you don't... that isn't
> > nice...
> 
> Let's be polite.

  Oh, well, if you say so.  

> Overall, I believe the heart of the matter is about "virtualization"
> versus "reporting"  in the VM design.  When a VM "reports", it simply
> tells the image about the underlying machine; the image then talks
> directly to the underlying machine in whatever way the underlying
> machine prefers.  When a VM virtualizes, it translates between the
> underlying machine and some canonical machine; the image then talks to
> the canonical machine.

  Ok.

> Squeak usually goes with a virtualization approach.  For example, stack
> frames have the exact same format on all platforms, and Forms come
> close.  There are partial exceptions, though.  Sound is mostly
> virtualized, but the underlying hardware may refuse certain parameter
> settings and report back the actual setting it will agree to use.  The
> main reason to go with reporting instead of virtualization is to improve
> performance; in the sound example, it would be impractical to support
> 13999 Hz audio as opposed to letting the underlying machine round this
> off to 14400 Hz.  Aside from performance, however, virtualization seems
> better.  It allows the image code to remain simple, and it aids in
> portability.

  This sound example is... too hypothetical.  Probably, the
FileDirectory family would be *relatively* good analogy.

# Of course, analogy doesn't always help for details.

> Do you agree that the central difference in the proposals is reporting
> versus virtualization?

  I think so.

> If so, I am wondering why people would prefer
> the reporting approach in this case.

  You seem to be.

> You have posted to the list that
> performance does not seem to be an issue with live translation between
> encodings.  But if performance is not an issue, why not virtualize away
> the character encoding?

  Because it did let me write m17n code unmodified VM and ask people
to test it, for one reason.  And it will let us embrace the future
change.

-- Yoshiki