Unicode support (File names was Re: Warning: Large Babeltranslation)

Sun Nov 16 20:22:19 UTC 2003

"Andreas Raab" <andreas.raab at gmx.de> wrote:
> I don't see any problem here. We need a single primitive that reports the
> encoding to be used - how about extending the definition of
> getSystemAttribute: to report the VMs string encoding? Then all we need is a
> VM which actually implements a "different" encoding and that is simple
> enough.

That would work.  However, it seems contrary to the spirit of having a
VM that translates image requests into whatever the current platform
desires.  To give an obnoxiously extreme example :), we do not have a
"what architecture are you" primitive and then expect the compiler to
produce machine code for that architecture.  We have bytecodes, and the
VM works hard to support this with reasonable efficiency.  When you
support a new platform, you must supply platform-specific support for
mapping bytecodes to machine code.  This is a good design principle and
it has served Squeak very well.  Why change now?  Wouldn't it be great
if images can just pass String's into the VM instead of having to
convert them at all?

What is gained by doing the translation in the image?  It actually seems
be *more* difficult, as well as violating the above design principle.

Yoshiki Ohshima <Yoshiki.Ohshima at acm.org> wrote:
> > What encodings would be available?  Wouldn't every image have to know
> > about every low-level encoding that is possible?
> 
>   My idea is that thi is much better than every VM have to know about
> every low-level encoding.

Why would this occur?  Each VM only needs to know about the current
platform's encoding(s).

>   The today's VMs treat *more or less* the bytes passed from image as
> mere sequences of bytes.  Which is closer to what I would want to
> have, so 'new VM' + 'old images' or 'new image' + 'old VM'
> compatibility issue wouldn't be too bad.  'pr file from new image' +
> 'old image' will be a problem, though.  (But it is always a
> problem...)

This compatibility problem happens to matter what choice we make.  In
fact, at least some of the VM's are careful to swap between MacRoman and
whatever the underlying encoding is.

> Speaking of today's VM, are you aware of the fact that the Windows
> VM's keymap[] table and X11/unix VM's X_to_Squeak[] table are
> imcompatible?  Even for 256 characters, the VM writers cannot agree on
> a single table^^;  It will be more problem if we have a crystalized
> bigger table in VMs.  

I don't understand.  It's very easy to fix these tables once we decide
what the desired behavior is.  The fix will instantly apply to any image
you run in the future.  If these tables were in images, then we'd need
to patch up existing images somehow, which is bound to more effort than
zero.

This leads to a general question: what is this talk of crystalization? 
I was thinking that the conversion routines would be in the
platform-dependent portion of the VM.  The only crystalization being
proposed is of the interchange encoding.  How you translate to that
encoding is up to the VM, and it can be improved over time without
messing with the existing body of images.

Overall, it seems very simple tome to just declare "Squeak uses UTF-8
for strings" and then for each VM to have a function
squeak_to_platform() and platform_to_squeak().  The Unix port is an
existence proof that this is a straightforward design.  The main
argument I can think of for translating inside of Squeak would be if it
were much easier to translate in Squeak than in C.  I've seen no one
really arguing that, however.  A second argument might be that we don't
want to force all images to support UTF-8; in that case, however, I
would argue that there should be 2 or maybe 3 encodings, and the *image*
should get to choose which one it wants to use.

-Lex