Unicode support (File names was Re: Warning: Large Babeltranslation)

Yoshiki Ohshima Yoshiki.Ohshima at acm.org
Sun Nov 16 23:05:16 UTC 2003


  Hello,

> "Andreas Raab" <andreas.raab at gmx.de> wrote:
> > I don't see any problem here. We need a single primitive that reports the
> > encoding to be used - how about extending the definition of
> > getSystemAttribute: to report the VMs string encoding? Then all we need is a
> > VM which actually implements a "different" encoding and that is simple
> > enough.
> 
> That would work.  However, it seems contrary to the spirit of having a
> VM that translates image requests into whatever the current platform
> desires.  To give an obnoxiously extreme example :), we do not have a
> "what architecture are you" primitive and then expect the compiler to
> produce machine code for that architecture.  We have bytecodes, and the
> VM works hard to support this with reasonable efficiency.  When you
> support a new platform, you must supply platform-specific support for
> mapping bytecodes to machine code.  This is a good design principle and
> it has served Squeak very well.  Why change now?  Wouldn't it be great
> if images can just pass String's into the VM instead of having to
> convert them at all?

  Well, I would imagine that your example is one implementation of the
bigger idea of Squeak; something like "expect unexpected change",
"embrace the change", or in short, "late-binding".

  Remember that the huge majority of the image level code wouldn't
have to bother the low-level encoding stuff at all.

> Yoshiki Ohshima <Yoshiki.Ohshima at acm.org> wrote:
> > > What encodings would be available?  Wouldn't every image have to know
> > > about every low-level encoding that is possible?
> > 
> >   My idea is that thi is much better than every VM have to know about
> > every low-level encoding.
> 
> Why would this occur?  Each VM only needs to know about the current
> platform's encoding(s).

  For example, imagine Unix VM.  The one running on Japanese Unices
(typically uses EUC-jp) has to do different conversion from the one on
Korean Unices (typically uses EUC-kr).

  Same as Windows VM.  Currently it uses Shift-JIS on Japanese
Windows, etc.  If the VM implementors have to take care of all of
those encoding support, it would be too much burden for them.  If
you're, say, from Vietname, and want to add VISCII support, you have
to wait for the maintainer compile the VM, or how the maintainer feel
sure what he is doing is right?

> This compatibility problem happens to matter what choice we make.  In
> fact, at least some of the VM's are careful to swap between MacRoman and
> whatever the underlying encoding is.

  The good part of this is the MacRoman-Latin1 conversion that some of
the current VMs do is reversible so far.  At least it is true for
Windows and (old Unix VMs.)  Currently, the m17n image for Japanese
treats the bytes coming from Windows VM as a "Shift-JIS which each
bytes are swapped by the VM.

  The bad part is that if a VM starts doing its own things, possibly
irriversible things, it will hurt.

> > Speaking of today's VM, are you aware of the fact that the Windows
> > VM's keymap[] table and X11/unix VM's X_to_Squeak[] table are
> > imcompatible?  Even for 256 characters, the VM writers cannot agree on
> > a single table^^;  It will be more problem if we have a crystalized
> > bigger table in VMs.  
> 
> I don't understand.  It's very easy to fix these tables once we decide
> what the desired behavior is.  The fix will instantly apply to any image
> you run in the future.  If these tables were in images, then we'd need
> to patch up existing images somehow, which is bound to more effort than
> zero.

  Oh, well already the existing images assume the current table...

> This leads to a general question: what is this talk of crystalization? 
> I was thinking that the conversion routines would be in the
> platform-dependent portion of the VM.  The only crystalization being
> proposed is of the interchange encoding.  How you translate to that
> encoding is up to the VM, and it can be improved over time without
> messing with the existing body of images.

  Imagine a (current) Windows VM.  the image could pass a UTF-8, but
then the VM has to convert it to Shift-JIS if it is running in
Japanese mode, or GB2312 if it is running on Simplified Chinese mode,
etc.

  Furthermore, since we wouldn't want to use UTF-8 as the internal
representation (or you say we do?) because is is not an easily
indexable string, we will need some conversion at the image level
anyway to cater the VM UTF-8.

> Overall, it seems very simple tome to just declare "Squeak uses UTF-8
> for strings" and then for each VM to have a function
> squeak_to_platform() and platform_to_squeak().

  Oh, no.

> The Unix port is an existence proof that this is a straightforward
> design.

  Remember the native encoding for X11 is c-text, which is based on
the ISO-2022 style encoding.  To make it work for Japanese, it'll
require much work.  (And a guy called Hiroshima-san did it.)

  One thing we could do is pick a table, say the one iconv uses and
this is official Squeak table.  If it is ok for people that the core
part of Squeak VM can be a dependent on a standard library, it may be
ok.  But it'll still require the image level conversion from the
internal representation to UTF-8 anyway.

> The main
> argument I can think of for translating inside of Squeak would be if it
> were much easier to translate in Squeak than in C.  I've seen no one
> really arguing that, however.  A second argument might be that we don't
> want to force all images to support UTF-8; in that case, however, I
> would argue that there should be 2 or maybe 3 encodings, and the *image*
> should get to choose which one it wants to use.

  I don't know what you'd expect, but the Squeak internal string
representation can be just one.  All we have to do is convert it to
another data before it passes to a primitive.

-- Yoshiki




More information about the Squeak-dev mailing list