Unicode support

agree at carltonfields.com agree at carltonfields.com
Wed Sep 22 19:23:21 UTC 1999


> > Indexing of text in languages other than English is a > sticky business
> > in any encoding, because most languages other than French, > German and
> > Spanish use composed characters that should be treated as one. Thus,
> > the indexing can fail.
> > Its also sticky in English -- case.  Also it would be nice to > index compundWordTokens by their individual words so that a search for
> 'top' would find 'getTopElement', but not 'forgetOperation'.

It appears to me there is a reasonable generalization that captures some of these notions.  First, an an index subject *is* a GeneralizedString, one can express compositions in the same manner as is done in the string itself.  Second, for handling character-by-character transforms, like case, one can add to the index the notion of an equivalence map, which maps characters to cannonical characters, probably expressed with a block as is presently done with filtering conventions.  Of course, this doesn't adequately handle composed characters and/or glyphs.  Indexing on a generalized string, by its nature presumes an index-based access to characters, regardless of the underlying representation in a physical array.  You can expose the underlying composition sequence if you like, but generalized indexing will treat a composition sequence as more than one character.

Or not.  What do you think?





More information about the Squeak-dev mailing list