Unicode support

agree at carltonfields.com agree at carltonfields.com
Thu Sep 23 14:49:20 UTC 1999


> I don't think one single, all-purpose,
> FinalUltimateSuperString class which can handle all the > possible special
> cases is desireable or doable.  But until someone actually > sits down and
> starts cutting code it's just talk anyways.

Just as one single, all-purpose abstract Number class does not, by itself, do everything that is necessary for crunching data.  Nevertheless, look at the power of that abstraction!  A seamless concept of number that is slammingly fast and efficient in 98% of the cases (SmallInteger), and which automatically self-coerces whenever necessary to slower, less space and speed efficient, but more general objects.

Why couldn't we do the same thing with GeneralizedString/String we do for Number/SmallInteger?

> This means that if we switch from a single-byte character encoding to
> Unicode, in the form where Unicode 'characters' are 16 bits > wide, we add
> roughly 4/5 of a meg to the image size.  If we convert to > something where
> each 'character' takes up 4 bytes we add about 2.5 megs to > the image size.

Sure, but why would we do that?  Why wouldn't GeneralizedString have a subclass, ASCIIString or LatinUnicodeString, which stores its data as bytes until a Unicode char with a non-zero high byte code is assigned thereto, and only then coerces into the more bloated monster?  Perhaps it might even take intermediate steps into the quasi-ASCII representations until an indexing operation occurs?  

Now, since by definition all the strings presently on the system can be LatinUnicodeStrings, the only space cost (if done right), would be the space for the code and extra classes.  I don't see any reason why there would be an

On the other hand, if you live in Israel, China, Japan or Korea, for example, this can permit a seamless tradeoff of space for functionality.





More information about the Squeak-dev mailing list