Unicode support
agree at carltonfields.com
agree at carltonfields.com
Wed Sep 22 19:23:21 UTC 1999
> > Indexing of text in languages other than English is a > sticky business
> > in any encoding, because most languages other than French, > German and
> > Spanish use composed characters that should be treated as one. Thus,
> > the indexing can fail.
> > Its also sticky in English -- case. Also it would be nice to > index compundWordTokens by their individual words so that a search for
> 'top' would find 'getTopElement', but not 'forgetOperation'.
It appears to me there is a reasonable generalization that captures some of these notions. First, an an index subject *is* a GeneralizedString, one can express compositions in the same manner as is done in the string itself. Second, for handling character-by-character transforms, like case, one can add to the index the notion of an equivalence map, which maps characters to cannonical characters, probably expressed with a block as is presently done with filtering conventions. Of course, this doesn't adequately handle composed characters and/or glyphs. Indexing on a generalized string, by its nature presumes an index-based access to characters, regardless of the underlying representation in a physical array. You can expose the underlying composition sequence if you like, but generalized indexing will treat a composition sequence as more than one character.
Or not. What do you think?
More information about the Squeak-dev
mailing list
|