Unicode support

John Duncan jddst19+ at pitt.edu
Wed Sep 22 23:29:38 UTC 1999


> 1.  Complex systems are EVOLUTIONARY - they grow incrementally.
They DO NOT
> spring fully formed from Zeus's forehead.
>
> 2. Nothing useful ever gets done by committee.
>
> 3. Just DO something.
>
> Ok, enough of my cynical rant for now.  I don't mean to take the
wind out of
> anybody's sails, but I've seen too many projects start out with
"Let's really do
> this right  and show 'em all!" only to end up going nowhere, fast
because
> they're trying to do too much all at once.  Small, incremental
improvements
> always have, and always will rule. Revolution is the result of
undersampling the
> observed system.
>

Sounds good on paper, but:

1. I'm not proposing that we design the thing by committee. I'm saying
that we, as a community or group, do the research. We aren't in 1951
saying "Wouldn't it be neat if a computer could store text?" Rather,
we are in 1999, after about 75 "standard" text technologies are out,
wondering about making Squeak a good text processor. The activity on
this list has demonstrated that a variety of backgrounds have
contributed to many differing ideas about text.

2. For six years, a short while compared to almost 50 years of text
processing, we have had an international standard character set that
actually expresses virtually all communication on the planet. This
character set was made possible by community research and a consortium
of advocates. It comes complete with a large number of ancillary
standards, such as encoding formats for strings, canonical forms,
normalization algorithms, etc. These are all the result of someone
else "doing it", usually many people "doing it", and then a
precipitation of the available technology into a solid standard.

3. Any new text system for Squeak, whether starting with the
representation of just Hebrew or the full array of Unicode, is
evolutionary in consideration of the vast amount of research that has
gone into text processing.

4. If we "just do it", we will reinvent a hundred wheels, make
thousands of mistakes already made, and end up with something that no
one wants to touch. These discussions are intended to get a big
picture.

We have a number of competing ideas for protocol and representation.
Which one do we choose? I say that first we could look at a few
existing systems that seem to work, and discover why some are wrong.
We can also reason about the problems with certain approaches
intuitively. For example, if we just subclass String and make the
representation UTF-8, it'll be a dog to work with large amounts of
text that may contain characters other than ISO-8859-1. If we make the
representation UCS-4, strings will be easier to work with but will
take more space. Is that bad? If we take the Peter's idea of using
character objects, there will be more potential for waste, but the
representation may be even better. If we take the "string is a list of
substrings" idea, we have no idea where that will go.

So we have to look at the possibilities, and put everything in
perspective. The merits of one system over another system will not
show until someone writes his or her doctorate thesis about totem use
by ancient Egyptians in Squeak. All of these will have similar
benefits for strings of 70 or 80 characters.

So, go ahead, come up with prototypes. I didn't say that is wrong. But
what I think would be bad would be for the Multilingual project to
just come out with an implementation, without seriously considering
all the work that has been done before us.

At the very least, all of us who are interested in this should sit
down one evening and read the standards and notes about Unicode, GX
Typography, and any other documents that include useful answers to
many of the problems we have and haven't thought of. Then, when we
write our prototypes, each one of us will have a much better frame of
reference.

My committee research idea is based on the concept that we get most of
our information from the Swiki. When we read little bits of stuff that
are not entirely summarized in the big documents, we can put it on the
Swiki and all get a better understanding. This still does not preclude
anyone from deviating from anything, because there will be no actual
"decided upon" best practice until we see some implementations.

But I don't understand your sentiment. Do requirements (stories,
whatever) usually come in by someone who just "did them"? Or are they
usually written and debated by a committee? I'd say the latter,
because that's what I'm used to. We are talking requirements (and
study topics) for a seriously important change to the way Squeak does
business. The text implementation will underlie a lot of what goes on.
It's not something that can be "just done".

We already have a preliminary implementation as well, in the
Character-String-Paragraph implementation. Some have already expressed
sentiment as to why they don't think merely extending this will be
sufficient. Others have expressed the desire to keep it similar to the
way it is.

-John





More information about the Squeak-dev mailing list