3.7 moving to beta tomorrowish

Bill Schwab BSchwab at anest.ufl.edu
Wed Mar 31 01:32:47 UTC 2004


Ned,

======================
Anyone who sends UCS-2 (2 bytes per character) or UTF-16 over the wire is probably doing something wrong if most of the bytes are zero.
======================

Again, the device in question is quite old by computer standards, and was (arguably still is in some ways) ahead of its time, with a few staggering inefficiencies in places.


======================
UTF-8 is the preferred character encoding for this. If your characters are all in the Latin-1 set, you don't spend any extra bytes.
======================

Sounds great.


======================
You should look at Yoshiki's work. He adds new kinds of String and Character (as I recall); these carry their encoding with them. Since he's also concerned about translation and about other potential problems with Unicode and Asian languages (look up "Han Unification" for some pointers), these Strings (and Characters?) also can carry more information about desired rendering, language of origin, etc. But most of the Strings and Characters in the Squeak image should remain untouched.
======================

This is very encouraging, and suggests that Squeak will get this right.  My reason for raising the concern is simply that I didn't want to back out of the _/:= hack too soon.  Given Yoshiki's estimate of 3.8 or 4.0 for integration of his work, it seems wise to step back, let him give us proper back arrows, and then sort out what if anything remains to be done.  If however, Squeak's unicode support was going to turn out to be bloating to images, it would find itself unused in many cases, and there would still be a need for a hack.

Bill



Wilhelm K. Schwab, Ph.D.
University of Florida
Department of Anesthesiology
PO Box 100254
Gainesville, FL 32610-0254

Email: bills at anest4.anest.ufl.edu
Tel: (352) 846-1285
FAX: (352) 392-7029




More information about the Squeak-dev mailing list