Unicode support

Peter William Lount peter at smalltalk.org
Wed Sep 15 19:22:44 UTC 1999


Hi,

I've never liked the fact that strings were made up of bytes. This is not
an object oriented approach to strings. It was a "space optimization" and
is a throw back to the days of limited memory systems.

What about going to the most "general internal character" representation:
each character in a string is a REAL object instance.
GeneralCharacterStrings (for lack of a better name at the moment) are then
made up of 32 bit pointers to character objects. Each character object is
configured with information about how to represent it in differnt character
encodings. These encodings allow for convertion to and from the "general
internal character" representation. This would also allow conversions
between any two encodings. A GeneralCharacterString could then contain a
mixture of characters from any language or any special characters from any
encoding.

Yes this would take up more space for characters (32 bits v.s. 8 or 16 or
21 bits) but it would be simpler and faster for string operations. Each
GeneralCharacter would be a unique instance just like the way the existing
256 ASCII character instances have been done in Smalltalk.

Peter William Lount
peter at smalltalk.org
http://www.smalltalk.org





More information about the Squeak-dev mailing list