Unicode support (File names was Re: Warning: Large Babeltranslation)

Andreas Raab andreas.raab at gmx.de
Sun Nov 16 19:06:45 UTC 2003


Hi Guys,

Methinks the two of you are in violent agreement. None of what you say is
going to be any issue whatsoever given two fundamental prerequisites:
a) The VM uses a SINGLE encoding throughout its interface
b) The VM is able to REPORT that encoding to the image
With those two rather simple requirements you have the full range of
options. If a VM wants to provide a "common abstraction" (such as UTF-8) it
can do so on its own. If the VM doesn't want to bother to do anything in
terms of translation it merely reports whatever the underlying assumptions
are.

Given the wide-spread adoption of UTF-8 I find it likely that UTF-8 will
become the de-facto encoding for most major platforms and ports (Lex'
argument). However, given that other encodings are allowed besides UTF-8, it
leaves that decision to the VM maintainer which isn't _forced_ to provide a
full UTF-8 representation if it turns out to be too hard (Yoshiki's
argument).

Lex, think about it that way: There will likely be a de-facto standard
(UTF-8) but up it being established, there will be varying encoding anyways,
and so we need to have a way of dealing with it. And if we deal with it, we
can as well leave the mechanism in for the added benefit of not having to
implement the full support for something like UTF-8 unless we want to.

I don't see any problem here. We need a single primitive that reports the
encoding to be used - how about extending the definition of
getSystemAttribute: to report the VMs string encoding? Then all we need is a
VM which actually implements a "different" encoding and that is simple
enough.

One small note on Yoshiki's message:
>   Speaking of today's VM, are you aware of the fact that the Windows
> VM's keymap[] table and X11/unix VM's X_to_Squeak[] table are
> imcompatible?  Even for 256 characters, the VM writers cannot agree on
> a single table^^;  It will be more problem if we have a crystalized
> bigger table in VMs.

The reason for the VM writers not being able to "agree" on a single table is
that they have not enough range within those 256 characters. So really,
that's an argument both for allowing the image to deal with the encoding AS
WELL AS providing a common abstraction on the VM level (e.g., use a standard
that trivially subsumes the mapping problems we have).

Cheers,
  - Andreas



> -----Original Message-----
> From: squeak-dev-bounces at lists.squeakfoundation.org 
> [mailto:squeak-dev-bounces at lists.squeakfoundation.org] On 
> Behalf Of Yoshiki Ohshima
> Sent: Sunday, November 16, 2003 7:16 PM
> To: The general-purpose Squeak developers list
> Subject: Re: Unicode support (File names was Re: Warning: 
> Large Babeltranslation)
> 
> 
>   Lex,
> 
> > What encodings would be available?  Wouldn't every image 
> have to know
> > about every low-level encoding that is possible?
> 
>   My idea is that thi is much better than every VM have to know about
> every low-level encoding.
> 
> > It would simplify things in the image if the VM always 
> expected things
> > in a single encoding.  Further, I don't see the advantage 
> of having a
> > converter written within Squeak.  There are already C 
> libraries around
> > for conversions between Unicode and most other encodings, 
> and we can dig
> > some up and link them in.
> 
>   The VM can always expect 'sequence of bytes'.  The VM passes it
> from/to image and underlying platform.
> 
> > First, Andreas is suggesting that some platforms may not 
> want to have a
> > full Unicode table in them.  However, I don't understand 
> why that would
> > actually be necssary.  A barebones VM would have the option of only
> > generating 7-bit strings, and of rejecting any strings that are not
> > 7-bit.  Or, it could support the subset of UTF-8 that matches one
> > particular code page (or whatever Unicode calls it).  Just 
> because UTF-8
> > is the encoding, doesn't mean that the full character space 
> of Unicode
> > needs to be supported in any individual VM.
> 
>   The image doesn't have to support 'full Unicode'.  The image-level
> solution allows us to load/save the tables/fonts dynamically; if you
> need only a part of them, you can make such image.
> 
> > A second issue was tossed up by Yoshiki, and involves a 
> difficulty of
> > translating between UTF-8 and the encoding used in certain 
> underlying
> > environments.
> 
>   I wrote this for the reason we wouldn't want to have crystalized
> table in the VM.
> 
> > I ask, however, whether there is *any* universal
> > encoding where we can translate more conveniently both with 
> UTF-8 and
> > with these encodings?
> 
>   I don't fully understand this question, but a possible approach is
> to assign an announcer byte to UTF-8 or UTF-7 and do ISO-2022 style
> switching.
> 
> > It seems like we will need a big translation
> > table somewhere or another.  Should every single image really carry
> > around this table just in case it runs on a VM that uses such an
> > encoding?
> 
>   Yes, and no.
> 
> > Or do we only put it in some images, and break portability of
> > images?
> 
>   If we write a simple dynamic loading mechanism, this can be *mostly*
> solved.
> 
> > The solution seems worse than the problem; the awkward
> > translation has to happen somewhere, and it seems better to 
> put it in
> > the VM if it's simply going to be table lookups.  C is wonderful for
> > such things, and the libraries are likely to already exist.
> 
>   One alternative is to generate tables in Slang and make the
> primitive optional.  However, I have been living in m17n image, and
> not found that operation is that slow in Squeak.  We may want to move
> some stuff to the VM for performance reason, but it doesn't have to
> now.
> 
> > Finally, there is the issue of backwards compatibility.  
> That's a real
> > issue, but one reasonable way around it is to make the 
> switch at Squeak
> > 4.0 instead of during 3.7.  Or, one can simply not worry 
> about it, and
> > live with the fact that old images will have trouble with accented
> > characters.
> 
>   The today's VMs treat *more or less* the bytes passed from image as
> mere sequences of bytes.  Which is closer to what I would want to
> have, so 'new VM' + 'old images' or 'new image' + 'old VM'
> compatibility issue wouldn't be too bad.  'pr file from new image' +
> 'old image' will be a problem, though.  (But it is always a
> problem...)
> 
>   Speaking of today's VM, are you aware of the fact that the Windows
> VM's keymap[] table and X11/unix VM's X_to_Squeak[] table are
> imcompatible?  Even for 256 characters, the VM writers cannot agree on
> a single table^^;  It will be more problem if we have a crystalized
> bigger table in VMs.  
> 
> -- Yoshiki
> 




More information about the Squeak-dev mailing list