File names was Re: Warning: Large Babel translation

Lex Spoon lex at cc.gatech.edu
Sun Nov 16 05:39:33 UTC 2003


>  Depending on the
> VM we may have varying encodings - for example, a bare hardware platform may
> want to keep things as simple as possible whereas something like Windows
> which is used in lots of different settings may give you something more
> general (such as UTF-8) and take the burden of translating it appropriately.
>

What encodings would be available?  Wouldn't every image have to know
about every low-level encoding that is possible?

It would simplify things in the image if the VM always expected things
in a single encoding.  Further, I don't see the advantage of having a
converter written within Squeak.  There are already C libraries around
for conversions between Unicode and most other encodings, and we can dig
some up and link them in.

There have been three major problems lobbed at always using UTF-8. 
Let's consider them.

First, Andreas is suggesting that some platforms may not want to have a
full Unicode table in them.  However, I don't understand why that would
actually be necssary.  A barebones VM would have the option of only
generating 7-bit strings, and of rejecting any strings that are not
7-bit.  Or, it could support the subset of UTF-8 that matches one
particular code page (or whatever Unicode calls it).  Just because UTF-8
is the encoding, doesn't mean that the full character space of Unicode
needs to be supported in any individual VM.

A second issue was tossed up by Yoshiki, and involves a difficulty of
translating between UTF-8 and the encoding used in certain underlying
environments.    I ask, however, whether there is *any* universal
encoding where we can translate more conveniently both with UTF-8 and
with these encodings?  It seems like we will need a big translation
table somewhere or another.  Should every single image really carry
around this table just in case it runs on a VM that uses such an
encoding?  Or do we only put it in some images, and break portability of
images?  The solution seems worse than the problem; the awkward
translation has to happen somewhere, and it seems better to put it in
the VM if it's simply going to be table lookups.  C is wonderful for
such things, and the libraries are likely to already exist.

Finally, there is the issue of backwards compatibility.  That's a real
issue, but one reasonable way around it is to make the switch at Squeak
4.0 instead of during 3.7.  Or, one can simply not worry about it, and
live with the fact that old images will have trouble with accented
characters.


Lex



More information about the Squeak-dev mailing list