[squeak-dev] Re: [Unicode] Normalization -- canonical equivalence vs compatibility equivalence

EuanM euanmee at gmail.com
Tue Dec 8 19:59:57 UTC 2015


canonical equivalence
================
  aString = anotherStrong
where both strings are of the same sequence of valid Unicode
representations of abstract characters
BUT each individual character MAY (or may not) be represented by
different valid Unicode representations
of the character

e.g. #(00c5) = #(212b) = #(0041 030a)

"The same meaning when printed or displayed"

Every tuple of canonically equivalent sequences is also compatibility
equivalent.

Actually yes - the easy way to define it would be
a canonically equivalent string is the same string sequence of the
same abstract characters as the string you are comparing it to.  (But
may use different codepoints)

The sequence will have the same glyph appearance.



compatibility equivalence is
=====================
  aString = anotherStrong
where both strings have the same meaning in some particular contexts,
but may have a different glyph appearance

e.g. comparing ligatures with the decomposed sequences making up a ligature
as in  "ss" = "ß"

This is compatibility equivalent, they *can* have the same meaning,
but they never have the same appearance.


The normalisation forms either
- expand all compatibility codepoints into fully composed sequences,
to create the "fully composed normal form" of the string
OR
fully condense all composed sequences into compatibility codepoints,
to create a "fully decomposed normal form".

(Note, that as there can be multiple compatibility codepoints for an
abstract character, it is necessary to create fully decomposed forms
only after first coverting to fully composed forms, or two fully
decomposed strings could fail equivalance testing *despite* being
compatibility equivalent.

e.g.

#(00c5 00c5 00c5) = #(212b 212b 212b)  is actually true, for
compatibility equivalence

But a process of
 fully composing
any of
#(00c5 212b #(0041 030a) )
#(00c5 00c5 00c5)
#(212b 212b 212b)

would always produce #( #(0041 030a) #(0041 030a) #(0041 030a) )

and then a subsequent full decomposition would produce

#(00c5 00c5 00c5)

for all three cases.


On 8 December 2015 at 16:09, H. Hirzel <hannes.hirzel at gmail.com> wrote:
> It seems that  going for Normalization first in order to implement
> String comparison is the next necessary steps. This is already a
> result by itself.
>
> Normalization (Unicode)
> Last updated at 4:05 pm UTC on 8 December 2015
> http://wiki.squeak.org/squeak/6229
>
> Can you explain to me what the difference between
>
> canonical equivalence
> and
> compatibility equivalence
>
> and for which we should go?
>
> --Hannes


More information about the Squeak-dev mailing list