Unicode support
Michael Klein
Mklein at nts.net
Wed Sep 22 20:34:47 UTC 1999
I think that canonicalization is good for fixing deviations from the
Unicode ideal.
o-ffi-c-e -> o-f-f-i-c-e
K -<composition diacritical oomlat>-o-n-i-s-b-e-r-g
canonicalizes to
K-<o with oomlat double dots>-n-i-s-b-e-r-g
This is just approxiamate; I left my Unicode book at home. Im sure it
has more to say on
this subject.
As far as equivalence maps go... I think what is needed is more like a
metric space,
than equivalence classes.
-- Mike Klein
>-----Original Message-----
>From: agree at carltonfields.com [SMTP:agree at carltonfields.com]
>Sent: Wednesday, September 22, 1999 12:23 PM
>To: Michael Klein; squeak at cs.uiuc.edu
>Cc: The recipient's address is unknown.
>Subject: RE: Re: Unicode support
>
>> > Indexing of text in languages other than English is a > sticky business
>> > in any encoding, because most languages other than French, > German and
>> > Spanish use composed characters that should be treated as one. Thus,
>> > the indexing can fail.
>> > Its also sticky in English -- case. Also it would be nice to > index
>>compundWordTokens by their individual words so that a search for
>> 'top' would find 'getTopElement', but not 'forgetOperation'.
>
>It appears to me there is a reasonable generalization that captures some of
>these notions. First, an an index subject *is* a GeneralizedString, one can
>express compositions in the same manner as is done in the string itself.
>Second, for handling character-by-character transforms, like case, one can
>add to the index the notion of an equivalence map, which maps characters to
>cannonical characters, probably expressed with a block as is presently done
>with filtering conventions. Of course, this doesn't adequately handle
>composed characters and/or glyphs. Indexing on a generalized string, by its
>nature presumes an index-based access to characters, regardless of the
>underlying representation in a physical array. You can expose the underlying
>composition sequence if you like, but generalized indexing will treat a
>composition sequence as more than one character.
>
>Or not. What do you think?
More information about the Squeak-dev
mailing list
|