Unicode support (File names was Re: Warning: Large Babel translation)

Sun Nov 16 18:16:23 UTC 2003

  Lex,

> What encodings would be available?  Wouldn't every image have to know
> about every low-level encoding that is possible?

  My idea is that thi is much better than every VM have to know about
every low-level encoding.

> It would simplify things in the image if the VM always expected things
> in a single encoding.  Further, I don't see the advantage of having a
> converter written within Squeak.  There are already C libraries around
> for conversions between Unicode and most other encodings, and we can dig
> some up and link them in.

  The VM can always expect 'sequence of bytes'.  The VM passes it
from/to image and underlying platform.

> First, Andreas is suggesting that some platforms may not want to have a
> full Unicode table in them.  However, I don't understand why that would
> actually be necssary.  A barebones VM would have the option of only
> generating 7-bit strings, and of rejecting any strings that are not
> 7-bit.  Or, it could support the subset of UTF-8 that matches one
> particular code page (or whatever Unicode calls it).  Just because UTF-8
> is the encoding, doesn't mean that the full character space of Unicode
> needs to be supported in any individual VM.

  The image doesn't have to support 'full Unicode'.  The image-level
solution allows us to load/save the tables/fonts dynamically; if you
need only a part of them, you can make such image.

> A second issue was tossed up by Yoshiki, and involves a difficulty of
> translating between UTF-8 and the encoding used in certain underlying
> environments.

  I wrote this for the reason we wouldn't want to have crystalized
table in the VM.

> I ask, however, whether there is *any* universal
> encoding where we can translate more conveniently both with UTF-8 and
> with these encodings?

  I don't fully understand this question, but a possible approach is
to assign an announcer byte to UTF-8 or UTF-7 and do ISO-2022 style
switching.

> It seems like we will need a big translation
> table somewhere or another.  Should every single image really carry
> around this table just in case it runs on a VM that uses such an
> encoding?

  Yes, and no.

> Or do we only put it in some images, and break portability of
> images?

  If we write a simple dynamic loading mechanism, this can be *mostly*
solved.

> The solution seems worse than the problem; the awkward
> translation has to happen somewhere, and it seems better to put it in
> the VM if it's simply going to be table lookups.  C is wonderful for
> such things, and the libraries are likely to already exist.

  One alternative is to generate tables in Slang and make the
primitive optional.  However, I have been living in m17n image, and
not found that operation is that slow in Squeak.  We may want to move
some stuff to the VM for performance reason, but it doesn't have to
now.

> Finally, there is the issue of backwards compatibility.  That's a real
> issue, but one reasonable way around it is to make the switch at Squeak
> 4.0 instead of during 3.7.  Or, one can simply not worry about it, and
> live with the fact that old images will have trouble with accented
> characters.

  The today's VMs treat *more or less* the bytes passed from image as
mere sequences of bytes.  Which is closer to what I would want to
have, so 'new VM' + 'old images' or 'new image' + 'old VM'
compatibility issue wouldn't be too bad.  'pr file from new image' +
'old image' will be a problem, though.  (But it is always a
problem...)

  Speaking of today's VM, are you aware of the fact that the Windows
VM's keymap[] table and X11/unix VM's X_to_Squeak[] table are
imcompatible?  Even for 256 characters, the VM writers cannot agree on
a single table^^;  It will be more problem if we have a crystalized
bigger table in VMs.  

-- Yoshiki