Unicode support (File names was Re: Warning: Large Babeltranslation)

Lex Spoon lex at cc.gatech.edu
Mon Nov 17 20:58:40 UTC 2003



Yoshiki Ohshima <Yoshiki.Ohshima at acm.org> wrote:
>   It is not that whether iconv supports those encodings or not.  It is
> the burden who has to do the implementation and testing.  I don't
> think the maintainers feel great if they don't know if it is *really*
> working or not.  I'd rather let someone knows the matter and who cares
> about do the language specific implementation and testing.

This consideration is irrelevant.  Both the image and the VM are open
source, so anyone can fix translation code no matter which place it is
located.


>   Of course, if we start depending on a third party library to this
> deep level, the portability will be affected.  The VMMaker has to
> specify the iconv version and configure option, the table may disagree
> with the one the OS has, and if the platform happens to have a data
> structure called iconv_t, etc.


I'm not suggestiong that.  I agree that it sounds complicated to try to
bake iconv into the portable part of the VM.  I'm not suggesting that. 
iconv is just a tool that each VM can use or not, as it desires.

iconv is very suggestive, however.  It looks like it will work for any
platform that has been mentioned so far.  Further, it appears to be the
library the open source community has settled on for solving the exact
problems being discussed, and so we may as well join the gravy train. 
As you know, it is good to use existing libraries instead of rewriting
things ourselves.


>   Another important point is that we'll need the in image conversion
> anyway.  Again, we don't want to use UTF-8 for the internal
> representation, the internal string has to be converted before passed
> to primitives.  (So, what kind of data structure do you imagine to use
> as the internal representation?)

Ack!  We do *not* need in-image conversion.   Doesn't it disturb you
that a minimal language like Smalltalk might end up being *required* to
carry around translation tables for any encoding a VM might request?  It
bothers me deeply and is the crux of my disturbance with this idea.  I
would very much like to have simple images be possible which are not
fully multinationalized.  Even more, I would like images to not be
required to dynamically load code beacuse they are running on a new VM.

Using UTF-8 as an interchange format solves these problems nicely and
has no clear downsides.

To contrast, we certainly do need translation *in the VM* on some
platforms.  For example, different filesystems can use different
encodings for the filename, and so the problem can't simply be ducked to
the image.

At best, I can imagine allowing the image and VM to negotiate a
different encoding under some circumstances, as a performance
improvement.  But it would be nice if there is a simple interface
available for images that don't care.


>   Also, if you write a program that access a web server (hehe, you
> did, actually), the code that the server returns can be anything.  You
> need to convert the response from the server to the internal
> representation before render it.

Yes.  But we are talking about the interface between the image and the
VM, not the image and the web.  Not every image need to have a web
browser that understands arbitrary encodings.

Incidentally, HTTP makes the choice that you and Andreas are rejecting. 
When you make an HTTP request, you have to specify it using a standard
encoding.  Neither the server nor the client can decide to say "GET"
using UTF-16.  Not that everyone should emulate HTTP, but I happen to
think this is a sound decision for HTTP just like it would be for us.



>   Well, don't worry about it.  Your code won't be affected by the m17n
> stuff too much.  The ASCII world in Squeak will more or less stays the
> same.

It will affect me if I write a primitive that accepts strings as
arguments.  It will also affect me if my 90 MB type inference image
stops loading.

And anyway, I care a LOT about Squeak.  I want it to be the best system
it can be.


-Lex



More information about the Squeak-dev mailing list