UTC-8 (was Re: Celeste encoding (was: Duplicate messages in Celeste))

Richard A. O'Keefe ok at atlas.otago.ac.nz
Fri Mar 17 04:07:30 UTC 2000


	Then there is string comparison.  Probably the letter A appears in
	oogles of places throughout unicode, and all of them should be
	considered equal in a case-insensitive comparison.
	
Why speculate about this when you can grep?

a% egrep 'LATIN (CAPITAL|SMALL) LETTER A;' unidata2.txt
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
249C;PARENTHESIZED LATIN SMALL LETTER A;So;0;ON;0028 0061 0029;;;;N;;;;;
24B6;CIRCLED LATIN CAPITAL LETTER A;Lu;0;ON;<circle> 0041;;;;N;;;;24D0;
24D0;CIRCLED LATIN SMALL LETTER A;Ll;0;ON;<circle> 0061;;;;N;;;24B6;;24B6
FF21;FULLWIDTH LATIN CAPITAL LETTER A;Lu;0;L;<wide> 0041;;;;N;;;;FF41;
FF41;FULLWIDTH LATIN SMALL LETTER A;Ll;0;L;<wide> 0061;;;;N;;;FF21;;FF21

It's all free, folks!  There's *canonical* decomposition, which unpacks
the precomposed characters that are there for round-trip compatibility
with other character set standards, and *compatibility* decomposition,
which erases a few presentation details, like FULLWIDTH.

Of course, the really serious problem is that case equivalence is itself
language dependent.  The example in the Unicode book is that
capital I = small i in English, but capital I = small DOTLESS i != small i
in Turkish.  Take a good hard look at locale support for strings in Java
some time.  There's a *reason* why it's so complicated.

I repeat that Interlisp-D made the change from 8-bit strings to 8-bit/16-bit
strings remarkably smoothly, with *no* blowup of space for strings in the
existing image.





More information about the Squeak-dev mailing list