Unicode support

ohshima at is.titech.ac.jp ohshima at is.titech.ac.jp
Mon Sep 27 00:25:32 UTC 1999


  Hi John,

> I think that's grand, but I wonder if better efficiency and
> flexibility can be gained from using UCS-2 as the base character
> representation, and then breaking out to UCS-4, which can represent
> everything, if necessary. Or, since it is improbable that planes
> 16384-65535 will ever be used, breaking out to the similar 30-bit
> representation that you use.

  The latter case is close to what I'm thinking, if the
things not suitable for Squeak is removed.

  The choise of base character representation is arguable,
but I'd take the current 8-bit representation.  As I read
the Java programs or Mule-Lisp program written by people
here (I admit the number of samples are quite small:-), they
rarely use the character other than ASCII for symbols, which
affect the speed of Squeak.  So, if I keep the rule "make
often case fast (how can I say?)," the base characters
representation should be 8-bit.

  Or, 3-level (1-2-4) might be possible.

> UCS-2 would provide character objects for
> all of the modern communication languages standard in every model of
> Squeak. By using ISO-8859-1 as the base representation, it would be
> more likely to have the strange situation that MULE has of being able
> to represent all languages but not being able to use them.

  I don't understand this.  Could you elaborate on this?

> I don't think that the characters would be all that
> costly, considering having one universal font costs only
> about 10M, those Unicode characters can be directly
> transmitted from any operating system that supports
> Unicode, and those characters can be cached. I think it's
> much more likely that Unicode fonts will exist on users'
> machines than other fonts.

  If this means sending a text data from one place to
another, it doesn't related to the internal representation.
The rich internal representation could be (lossy) converted
into Unicode (UTF).

  If this means transmitting a Squeak image, Squeak itself
has to carry the font so that "dot-identicalness (?)" is
kept.  Because there is no single "universal font" which
satisfy all of Chinese, Japanese, Korean, etc, the font
should be borrowed from the existing ones for local
encodings.

> I don't know if users really want 10M of one font in the image. I'm
> not sure how your implementation handled the "vast amount of
> information" problem.

  My implementation doesn't handle it yet.

                                             OHSHIMA Yoshiki
                Dept. of Mathematical and Computing Sciences
                               Tokyo Institute of Technology 





More information about the Squeak-dev mailing list