Unicode support (File names was Re: Warning:Large Babel translation)

Thu Nov 20 21:42:06 UTC 2003

Hi Guys,

I think this discussion is going a little bit out of hands. Let's be
realistic here. It seems clear that no current VM maintainer who assumes
that their VMs may be used in any serious environment, will switch to UTF-8
with no fallback position whatsoever (or at least, considering John, Ian,
Tim and me I cannot imagine this). Therefore it seems clear that even if we
are aiming for a cross-platform abstraction there will be an intermediate
time during which the image -and not the VM- will have to support (at least)
two encodings. That makes it very clear to me that pretty much the only way
to deal with the situation is to have the VM report its encoding and have
the _image_ to deal flexibly with the encoding it encounters.

Now, at some point it may well be the case that there is a common
understanding that some encoding (such as UTF-8) is the way to go, simply
because it's supported by all of the major Squeak platforms. At that point I
would expect that certain assumptions _will_ be made in the image (just
because it works on all of the platforms involved) and that point it may
very well be the case that we decide that we drop the "have the platform
report the encoding" primitive.

So it seems that the _only_ realistic way of making progress in this area is
to go with what I proposed all along. Namely: Have the VM report the
encoding it wants to use but shoot for (and keep in mind) a common encoding
such as UTF-8.

All other discussions, regardless of how well-meant they are and regardless
of how good the intentions are seem just pointless to me. You can't just
declare that things are going "that way" unless you have the committment
from all of the maintainers to do it that way. Which you don't.

Cheers,
  - Andreas

> -----Original Message-----
> From: squeak-dev-bounces at lists.squeakfoundation.org 
> [mailto:squeak-dev-bounces at lists.squeakfoundation.org] On 
> Behalf Of Yoshiki Ohshima
> Sent: Thursday, November 20, 2003 6:54 PM
> To: The general-purpose Squeak developers list
> Subject: Re: Unicode support (File names was Re: 
> Warning:Large Babel translation)
> 
> 
>   Lex,
> 
> > >   If it is in the latter case, I would imagine that the 
> existence of
> > > such VM mean that someone already had written a converter for the
> > > conversion between UTF-32 (or whatever the internal representation
> > > use) and timbuktu-3-rot11.  The people who need such VM will
> > > definitely write the converter (a subclass of 
> TextConverter, if they
> > > want) and it will be available from somewhere when such VM is
> > > available.  Does this sound reasonable to you?
> > 
> > It is reasonable but it sounds like it can be better.  Do you have a
> > plan for getting these convertors loaded into the older image?
> 
>   In short, no.  If the image doesn't support a *wide* internal
> representation, it doesn't worth to load such converters.
> 
>   If your application running in an old image and you're having hard
> time to port it to 3.6 release image, it is hard to port it to an m17n
> image.  If you can, it is not hard.
> 
> > Picture
> > yourself as the guy in Timbuktu; how do you load a 
> changeset when your
> > file list cannot report the names of the files?  If it's done
> > automatically, how can an image request a file be loaded 
> when it cannot
> > specify the name of that file?
> 
>   This is why I said that was so hypothetical.  Should I assume that
> timbuktu-3-rot11 is incompatible with 7 bit ASCII?  If so, I would not
> be able to compile the current Squeak VM anyway.  If not, since they
> should name the class and fileout of the converter in ASCII, it won't
> be a problem.  Remember, the encoding of content of a file and the
> file name encoding are different.  You'll always need to convert the
> file name to the platform encoding, but you may or may not convert the
> content of file.  Would you put the 'binary/ascii' distinction into
> the VM's file read/write primitives?  Or, first the image read the
> content file as sequence of bytes, and then ask VM to convert it
> through some primitive?
> 
> > The alternative approach of fixing the encoding -- 
> presumably to UTF-8
> > -- solves the problem very nicely as far as I can see.  The  VM must
> > have a convertor from timbuktu-3-rot11 to UTF-8, it is 
> true, but as you
> > say such a convertor has to exist anyway.  Putting it in the VM is
> > solves the problem in one blow and afterwards all images 
> will load and
> > function under that VM.
> 
>   Very nicely...  I don't know if you think about the cases.
> 
> > >   In the former case, you're already lost, Lex.  If we 
> *declare* UTF-8
> > > to be the only encoding to use for the communication 
> between image and
> > > VM, your image, that already contains MacRoman characters, won't
> > > load/run on such VM.  From old images' standpoint, 
> timbuktu-3-rot11 is
> > > not any worse than UTF-8, an encoding that is variable length and
> > > imcompatible with MacRoman.
> > 
> > I was mainly talking about post-declaration images.  
> Frankly, if we had
> > to lose compatibility with MacRoman images I could live 
> with that.  But
> > actually we don't have to under the "declaration" approach. 
>  We simply
> > declare that both UTF-8 and MacRoman are available.  By 
> default the VM
> > speaks to the image using MacRoman, but if the image asks, 
> the VM will
> > switch over to UTF-8.  And that's that.  All images work with all
> > VM's.
> 
>   Really? where did all those your 90MB image thing come from?
> 
>   So, you agree that declaring UTF-8 to be *the* only encoding is not
> going to work?
> 
>   And, you still don't want to allow the new image work on an old VM?
> I did the m17n work on a vanilla (old) Windows VM that uses
> Shift-JIS... Are you sure you I should have compiled new VM first?
> How would I have asked other people to test the m17n work?
> 
>   So, when your version of 'new' image has got the UTF-8 from the VM
> as a path name, what would be the internal represenatation?  Again,
> do you think that using UTF-8 internally is going to work?
> 
> > I'm glad we agree this is worthwhile.  But I don't understand your
> > assessment.  As I described above, the fixed-encoding scheme allows
> > every image to load under every VM.
> 
>   Not they won't.  See above.
> 
> > To contrast, the in-image-translation scheme means that 
> some images do
> > *not* work with some VM's; specifically, it is entirely 
> possible for a
> > VM to request an encoding that the image doesn't know 
> about.  That image
> > will not function properly with that VM until the image has been
> > modified by adding a new convertor to it.
> > 
> > Where's the error in this analysis?
> 
>  Seems that the same error in timbuktu-3-rot11 discussion?  
> 
> > >  If you agree with
> > > this, is my design that bad so you have to blame as if someone who
> > > doesn't care about Squeak is trying to kill your baby?  
> If you have
> > > some real alternative idea, I'm all ears.  That's why I 
> keep asking
> > > your idea of the internal encoding.  But if you don't... 
> that isn't
> > > nice...
> > 
> > Let's be polite.
> 
>   Oh, well, if you say so.  
> 
> > Overall, I believe the heart of the matter is about "virtualization"
> > versus "reporting"  in the VM design.  When a VM "reports", 
> it simply
> > tells the image about the underlying machine; the image then talks
> > directly to the underlying machine in whatever way the underlying
> > machine prefers.  When a VM virtualizes, it translates between the
> > underlying machine and some canonical machine; the image 
> then talks to
> > the canonical machine.
> 
>   Ok.
> 
> > Squeak usually goes with a virtualization approach.  For 
> example, stack
> > frames have the exact same format on all platforms, and Forms come
> > close.  There are partial exceptions, though.  Sound is mostly
> > virtualized, but the underlying hardware may refuse certain 
> parameter
> > settings and report back the actual setting it will agree 
> to use.  The
> > main reason to go with reporting instead of virtualization 
> is to improve
> > performance; in the sound example, it would be impractical 
> to support
> > 13999 Hz audio as opposed to letting the underlying machine 
> round this
> > off to 14400 Hz.  Aside from performance, however, 
> virtualization seems
> > better.  It allows the image code to remain simple, and it aids in
> > portability.
> 
>   This sound example is... too hypothetical.  Probably, the
> FileDirectory family would be *relatively* good analogy.
> 
> # Of course, analogy doesn't always help for details.
> 
> > Do you agree that the central difference in the proposals 
> is reporting
> > versus virtualization?
> 
>   I think so.
> 
> > If so, I am wondering why people would prefer
> > the reporting approach in this case.
> 
>   You seem to be.
> 
> > You have posted to the list that
> > performance does not seem to be an issue with live 
> translation between
> > encodings.  But if performance is not an issue, why not 
> virtualize away
> > the character encoding?
> 
>   Because it did let me write m17n code unmodified VM and ask people
> to test it, for one reason.  And it will let us embrace the future
> change.
> 
> -- Yoshiki
>