Unicode support

Peter William Lount peter at smalltalk.org
Tue Sep 21 19:49:03 UTC 1999


Hi,

My main complaint with the design of Strings that use "byte encodings" is
that they bottom out to bytes too soon. The existing String class bottoms
out immediately to bytes. I don't consider this a good object design. It
was good at the time Smalltalk was built due to limited hardware resources,
but these concerns are less with todays systems unless you are running on a
PDA or require an extremely small foot print. The "objectness" (if that's a
word) for byte based strings is lower than it would be for a "string made
up of character objects".

I agree that Strings based on "byte encodings" are important and must be
provided to be compatible with the various standard character sets of the
world. However, these strings are "optimizations" and thus are not pure
objects since they bottom out in bytes.

Yes all of Smalltalk must eventually bottom out into bytes or primitives.
There does not seem to be any way around this fact of computing nature (for
the current technologies). All though there have been some attempts to
create a memory system that is based on objects and not bytes. As long as
we are using the popular computer architectures of the day memory will
continue be allocated in bytes.

I just want Strings to be vastly more object oriented than they currently
are. Nobody else in our email group seems to support the notion or sees the
problems that I see with strings being bytes. If you do see these problems
I'd appreciate hearing from you. From the discussion so far people have
cited "black box encapsulation" and "protocols" as reasons to justify that
we don't care about the implentation details of a generic object based
string class. I say that implementation is very important. While all string
classes should have a common protocol where it makes sense I don't think
that this is a very strong argument for keeping a String implementation
byte oriented instead of object oriented.

The space consumption argument against an object oriented String
implementation is that each character object takes more space. Yes, a
character object based string would use 4 bytes per string as opposed to 1
byte per string. This is a definate concern and needs some study to find
out how much of a concern it is.

Speed issues may be a concern for "complicated byte encodings" like
"unicode" due to the "magic" that must be done to meet the encoding
specification with uses special "encodings". No wonder that OpenStep's text
system implemenation slows down when dealing with large strings. And their
system is largely written in C for speed with lots of Objective-C object
wrappers. In many cases it is actually faster to process 32bits instead of
bytes.

Regardless of these reasons not to have an "object oriented string class"
it still makes sense to have one. A byte oriented string class is just too
limited. I can't put any characters I want into it. A pure object oriented
string class enables ALL characters from ALL languages to exist in the same
object string. This simplifies displaying characters of ANY character set
or font on the screen in a single unit. It also doesn't require the user to
be setting which character encoding to use: Japanese this or that (they
have at least three seperate encodings excluding UNICODE), Unicode, ASCII,
UTF-8, etc... Since they can mix any characters they wish in to a string
when typing into an edit box or into document text.

Coversion between string encodings is very important. However, it is
limited due to the fact that different encodings don't have characters that
other encodings have. This simple fact is the reason that a generic object
oriented implementation of Strings (that doesn't bottom out in bytes) is
required. It would be able to contain ANY and ALL characters from all
encodings thus making it the most general string encoding possible. It also
makes it idea for  a "neutral" format for converting from one string
encoding to another.

As for comments about "this design including that one or vice versa". It
would seem that the most general design includes the less general and in
this case the "general object oriented string class" design includes all
the other encodings in order to make it work. I have never said "do away
with byte encodings" as this would be counter productive and ignore the
existing character set standards. I am simply talking about creating the
most general string implementation possible.

The current design with seperate limited byte based encodings means that
you are always manipulating strings in one limited encoding or another.
Each byte based encoding has it limits and may not respond the way you
think it will. A generic object based string class would always behave the
same. Yes when converting to byte based encodings it would potentially
loose information but this is a problem with the nature of byte based
encodings and will exist as long as byte based encodings are popular and in
use.

In summary, byte based string encodings are limited due to their
specifications. They can represent only a "small" subset of all the
possible characters in all the languages. While Unicode provides a large
number of characters it does not cover them all. Converting between byte
based character encodings entails a potential loss of information due to
the limits of various byte based character encodings. This is a significant
problem. When programers choose or enforce their "human language" on their
users they limit the range of "human languages"  and thus users that can be
use of their programs. Using a "generic object oriented character based
string class" is the most general string object design possible and has
many benefits. Special byte based string encodings are still required to
support different standards around the world. A generic object based string
is a perfect netural format to convert byte based encodings into while
converting between string encodings reducing the number of converters
needed. Only one converter needs to be added between the neutral object
oriented string for each byte based encoding that is supported. 

Please remember that Smalltalk was born on machines with limited power.
Today we have 32 bit processors and are on the verge of 64 bit processors
becoming mainstream. We have huge ram memories and vast disk storage that
didn't exist 30 years ago when Smalltalk was conceived of. It makes sence
to modernize every aspect of the design of Smalltalk. It seems to me that
this is what Squeak is about - pushing the boundries. Sticking to limited
byte only encodings is not progress. Assuming that space is not an issue
the only "required" reason to keep byte encodings is to support the current
character set standards such as ASCII, UNICODE, UTF-8, etc... 

There are many ways to invent the future. The question is will you be
creating a future that contains pure object systems or will you be
supporting a future that keeps things limited by thrusting the low level
bytes into our faces all the time? (By this I mean objects like the
existing String class that bottom out into bytes with just one level of
objects).

Since this issue is not likely to be resolved and it's unlikely that some
of you who responed with strong opinions will be swayed I propose that we
concentrate on ensuring that the "string protocols" and hierarchy includes
a generic object based string in it's design. This is too support the pure
object future as much as is possible at this time.

Finally, I propose this generic object based string class (and related
objects and protocols) to fully support a true international version of
Smalltalk. Remember the new Smalltalk moto: All objects, all the time.

All the best,

Peter William Lount
 





More information about the Squeak-dev mailing list