Unicode support (File names was Re: Warning: Large Babeltranslation)

Yoshiki Ohshima Yoshiki.Ohshima at acm.org
Mon Nov 17 21:38:05 UTC 2003


  Hannes,

> In the previous mail I had been advocating going for UTF-8. I feel
> that you don't like it and that your m17n solution is much more general.
> It is worked out an running. This is a strong argument to go for it.

  Keep it in mind that we should make distinction between the internal
representation and the external representation.  What do you exactly
means by "going for UTF-8"?

  My m17n stuff basically is Unicode based.  The internal encoding is
more or less UTF-32; but it uses the higher bits (unused in UTF-32)
for basically optimizations and to identify the unified CJK
characters.  This is important because Squeak has to know the final
glyph to render, unlike other software that give up this level of deep
commitment.

  Of course, for the latin-1 characters, the current String and
Character are used.  This is important for space saving and easy
transition.

  The external representation (used for communication between the
image and the VM) *can be* in UTF-8.  It can be in any of supported
encodings, actually.  This is important, too.

  The default file out format is UTF-8.  For CJK characters, it adds
new "]lang[" tag to a chunk if necessary.  In this way, the existing
other software can read and write the chunk format.  This is important
for the communication with non-Squeak program.

> I'd like emphasize that this is fine for me. Basically any solution 
> which allows me work with more than  255 chars is fine for me.

  Well, are you sure^^; I don't think you are going to be happy with
bad things!

-- Yoshiki



More information about the Squeak-dev mailing list