Unicode support (File names was Re: Warning: LargeBabeltranslation)

Sun Nov 16 23:15:35 UTC 2003

Lex,

A couple of things to keep in mind here. For one thing, no matter what
happens, you will probably not be affected by any such change. Others,
including Yoshiki and me, will. So if we screw up something in the
transition process towards some more flexible interface it is important for
those people who would be affected by it, to be able to use a reliable
system which does not break (for example by being able to switch the image
forth and back between UTF-8 and the old encoding). IOW, sure it is simple
to "declare" that Squeak uses UTF-8. Implementing that, however, is quite a
different matter. For that reason alone I would want to have the flexibility
that comes with the image adapting itself to the needs of the VM rather than
the contrary.

Secondly, the "proof" of it being a "simple design" applies only to Unix. I
don't know how much experience you have on Windows, Mac, or any other weird
platforms but that the approach is simple on Unix does not mean it's simple
everywhere. The fact of the matter is that WE DON'T KNOW YET. If - at some
point down the road - we come to the conclusion that it's in fact simple to
do it everywhere we care about and that the tradeoffs are acceptable, then
we can always just do the declaration. So why would we want to do it now?

Thirdly, when you talk about "translation in the image" vs. "translation in
the VM" then please keep in mind that even if the VM uses UTF-8 there HAS to
be in-image translation; namely that from (currently) Mac Roman to UTF-8.
And again, if you say that this doesn't matter re-read the first argument.
For people like me, or Yoshiki, it DOES matter. Therefore, your argument of
"doing translation in the VM" being better/faster/whatever is pointless.
There will be in-image translation even if we would fix the VM interface to
be UTF-8. Unless the in-image representation matches precisely the VM
representation (which it does not, not even today) there will always be some
translation necessary. Given the tradeoffs in flexibility and support it
seems utterly clear that I'd muchly prefer to do this in the image rather
than the VM.

Having said that, a few specific points:
> What is gained by doing the translation in the image?

I think I've answered that question above. It's not that it would gain
anything - it's simply unavoidable.

> >   My idea is that thi is much better than every VM have to 
> >   know about every low-level encoding.
> 
> Why would this occur?  Each VM only needs to know about the current
> platform's encoding(s).

Again, the two of you are in violent agreement here. Both of you want that
the VM should only have to deal with a single encoding.

> This leads to a general question: what is this talk of 
> crystalization? 
> I was thinking that the conversion routines would be in the
> platform-dependent portion of the VM.  The only crystalization being
> proposed is of the interchange encoding.  How you translate to that
> encoding is up to the VM, and it can be improved over time without
> messing with the existing body of images.

In practice, the opposite is true. If anything, people expect stable VMs.

> The main argument I can think of for translating inside of Squeak 
> would be if it were much easier to translate in Squeak than in C.

No. The main point is to see if it actually works. To buy some flexibility.
IOW, as long as we have no experience in this area, it would be pretty
stupid to require all VMs to support it. Your unique experience on Unix can
NOT be generalized without having further data points. And if you would stop
arguing I could start looking at providing such a data point ;-)

Cheers,
  - Andreas