UTC-8 (was Re: Celeste encoding (was: Duplicate messages in Celeste))

Lex Spoon lex at cc.gatech.edu
Thu Mar 16 13:08:08 UTC 2000

Bijan Parsia <bparsia at email.unc.edu> wrote:

> Is there a plan for dealing with Unicode, at least externally? I'm really
> just talking about enough to satisfy a parser...I'm happy to punt on
> display and font issues. Though I would like ASCIIish text to be
> intelligible ;)

It's been batted around before, and I thought a project group had
started up to discuss it further.

In any case, most people would like it, but it's not trivial.  You don't
want to just have a 16-bit encoding being used all the time, because the
number of bytes increase in Squeak 2.7 would be:

>>	String allInstances inject: 0 into: [ :sum :string | sum + string size ] -->  1044616

So it would take a meg to do the simplistic thing.  And so, you almost
certainly need to use some compressed scheme internally.

Furthermore, there is the issue of displaying your nice Unicode stuff. 
I suppose people in practice will load into an image fotns for the
portions of Unicode they see a lot, and other characters will show up as
little squares or something.

Then there is string comparison.  Probably the letter A appears in
oogles of places throughout unicode, and all of them should be
considered equal in a case-insensitive comparison.

I dunno, these are just some of the issues that spring to mind.  It
would be awesome to do the switch while Squeak is still relatively
young, but it ain't trivial.


