Unicode support

John Duncan jddst19+ at pitt.edu
Wed Sep 22 04:50:40 UTC 1999


> If, for instance, one needed to convey a
> language which wasn't at all character based --
> say, Egyptian hieroglyphics perhaps?  -- don't
> you think the string model would just break?
> I don't think it could be used for general
> glyphs, because they would suffer kerning and
> other ills a good deal.

Hmm. I don't know. Oddly enough, there is a proposed script to fit in
ISO-10646-2, UCS-4, Plane 1, which would provide Basic Egyptian
Hieroglyphics, containing 798 glyphs. It does not describe
implementation details for writing in BEH, but it shows that there is
consideration. Linear B is also proposed for Plane 1. Plane 1 is where
many characters that are not used in popular communication go. Most
people will only need BEH for academic stuff.

Someone on the list already suggested developing a fast cache
algorithm, and they suggested 256 characters. I suggest that this is
an implementation detail, and we may well find that (1) the ISO-8859-1
characters should be cached separately in 256 positions, and (2)
another cache should be developed that's expandable at least to 4 *
32K, one character per 32-bit word. In most Japanese communication,
there should be no use for more than 2400 characters; 100 kana and
approx. 2200 standardized kanji. But in Chinese and especially Korean
communication, that number could easily grow to 20,000. The system
should cache the characters actually in use.

Indexing of text in languages other than English is a sticky business
in any encoding, because most languages other than French, German and
Spanish use composed characters that should be treated as one. Thus,
the indexing can fail.

Developing algorithms to properly collate languages is another
concern. Unicode specifies an algorithm, and it requires normalizing
the string and then collating it. This is because there are at least
two ways of encoding "office" in Unicode, o-f-f-i-c-e, and o-ffi-c-e.

As I said, this is not a situation where someone should "get his hands
dirty" and get something out the door. There's too much prior research
to ignore. Let's not talk implementation until we have a Wiki sub-site
that discusses the important features of communication encoding and
text implementation. I think that text is relatively broken in all
systems, and here we have a chance to do it right, open source, for
all to see. Then, we can come out with a Squeak word processor in
which 25 experts from 25 countries produce the demo in their paper
about archaic languages. And we can make it fast, beautiful,
easy-to-use, and everything, as long as we have our heads on straight
while we do the work.

(Having the Swiki up and running would greatly aid this measure:) )

-John





More information about the Squeak-dev mailing list