Unicode support

Fri Sep 24 03:37:05 UTC 1999

  Hi John,

  I changed the order of your email for my convinience,
sorry.

John Duncan wrote:
> So, go ahead, come up with prototypes. I didn't say that is wrong. But
> what I think would be bad would be for the Multilingual project to
> just come out with an implementation, without seriously considering
> all the work that has been done before us.

  Thank you mentioning my implementation.  Please note that
the design of my implementation is heavily influenced by
"Mule" (Multilingual enhancement to GNU Emacs), whose
development has began more than 12 years ago and still
agressively improving.

  Roughly speaking, the character representation in my
implementation is somewhat similar to
SmallInteger/LargePositiveInteger integration.  the
ISO-8859-1 characters are represented in the same way as
current Character, and the others are represented as an
object with 30 bit value field.  Currently, there is no
assist from the VM, so you can test it with vanilla VM.

  Because I found many glitches in the implementation, don't
think that the implementation is "final".  (I think the names
"MultiString" and "MultiCharacter" may not be good names:-)

> 2. For six years, a short while compared to almost 50 years of text
> processing, we have had an international standard character set that
> actually expresses virtually all communication on the planet. This
> character set was made possible by community research and a consortium
> of advocates. It comes complete with a large number of ancillary
> standards, such as encoding formats for strings, canonical forms,
> normalization algorithms, etc. These are all the result of someone
> else "doing it", usually many people "doing it", and then a
> precipitation of the available technology into a solid standard.

  I suppose you know the early history of Unicode standard.
I believe that the committee was about to decide to go with
4byte representation, but a few company with loud-voice,
including MS and Apple overturned the conscientious
decision.  IMHO, the six years work can't straighten the
failure at the start, if the starting point is wrong.

  I found many people mentioned about the definition of
string and String on this thread, but I did't find any
article which mentioned what the definition of character
should be.

  On a system like Squeak, where the glyph of a character
should be controlled by the system itself, the character
should know how to represent itself.  This is the reason why
I think the representation should carry enough information
more than 16-bit representation.

  One more thing I'd like to say is, the Unicode could be a
"local" encoding in my framework.  There is so much software
which assumes Unicode, Squeak should be able to support it.
However, the local encoding would not have glyph, because
it's Uniocde:-)

  Thank you for your patience with my english skill:-).

                                             OHSHIMA Yoshiki
                Dept. of Mathematical and Computing Sciences
                               Tokyo Institute of Technology