3.7 moving to beta tomorrowish

Yoshiki Ohshima Yoshiki.Ohshima at acm.org
Wed Mar 31 00:18:59 UTC 2004


  Bill,

> An overly blunt way to look at unicode is that it offers us an
> opportunity to double the storage requirements for all of our text.
> In fact, one device that I have encountered uses "unicode" (it
> likely predates the standards), and ends up doing precisely that -
> each character it sends is followed by a gratuitous zero, in a world
> where every byte truly counts thanks to bandwidth restrictions.

  Ah, this is...  I don't know what to say...

  First, Unicode is not 16-bit character set.  It is 21-bit character
set.

  Second, let's clarify that the external representation and the
internal representation are different beast.  My stuff mainly uses the
UTF-8 for the external representation.  As long as your code is mainly
written in pure ASCII plus a few other characters such as up arrow and
left arrow, the difference in size is just ignorable.

  For the internal representation, my stuff use 8-bit per character
representation for a String that solely contains latin1 characters.
For those Strings that contains non-latin1 chars, it actually uses
32-bit representation.  It is similar to the SmallInteger and
LargeInteger implicit conversion and the user shouldn't have to care
about the difference.  The typical memory footprint surely will go up
a bit, but as you can imagine, not much.

> I understand the value of unicode, and want Squeak to embrace it.
> However, is unicode something that many of us would want to disable
> most of the time?  I ask because, if true, we might want another
> solution to the underscore/:= collision.

  Does this explanation add some thought to your observation?

-- Yoshiki



More information about the Squeak-dev mailing list