Unicode support

Michael Klein Mklein at nts.net
Wed Sep 22 20:34:47 UTC 1999


I think that canonicalization is good for fixing deviations from the
Unicode ideal.

o-ffi-c-e  ->  o-f-f-i-c-e

K -<composition diacritical oomlat>-o-n-i-s-b-e-r-g 
	canonicalizes to
K-<o with oomlat double dots>-n-i-s-b-e-r-g

This is just approxiamate; I left my Unicode book at home.  Im sure it
has more to say on
this subject.

As far as equivalence maps go... I think what is needed is more like a
metric space,
than equivalence classes.

-- Mike Klein

>-----Original Message-----
>From:	agree at carltonfields.com [SMTP:agree at carltonfields.com]
>Sent:	Wednesday, September 22, 1999 12:23 PM
>To:	Michael Klein; squeak at cs.uiuc.edu
>Cc:	The recipient's address is unknown.
>Subject:	RE: Re: Unicode support
>
>> > Indexing of text in languages other than English is a > sticky business
>> > in any encoding, because most languages other than French, > German and
>> > Spanish use composed characters that should be treated as one. Thus,
>> > the indexing can fail.
>> > Its also sticky in English -- case.  Also it would be nice to > index
>>compundWordTokens by their individual words so that a search for
>> 'top' would find 'getTopElement', but not 'forgetOperation'.
>
>It appears to me there is a reasonable generalization that captures some of
>these notions.  First, an an index subject *is* a GeneralizedString, one can
>express compositions in the same manner as is done in the string itself.
>Second, for handling character-by-character transforms, like case, one can
>add to the index the notion of an equivalence map, which maps characters to
>cannonical characters, probably expressed with a block as is presently done
>with filtering conventions.  Of course, this doesn't adequately handle
>composed characters and/or glyphs.  Indexing on a generalized string, by its
>nature presumes an index-based access to characters, regardless of the
>underlying representation in a physical array.  You can expose the underlying
>composition sequence if you like, but generalized indexing will treat a
>composition sequence as more than one character.
>
>Or not.  What do you think?





More information about the Squeak-dev mailing list