UTF-8, UCS, Unicode, 21 bit characters? (Re: Adding Accufonts to
the update stream)
Hannes Hirzel
hannes.hirzel.squeaklist at bluewin.ch
Wed Feb 26 08:06:17 UTC 2003
Ian Piumarta <ian.piumarta at inria.fr> wrote:
> On Wed, 26 Feb 2003, Richard A. O'Keefe wrote:
> >
> > UTf-8 decodes into *21-bit* characters.
>
> I may be missing some relevant context in this debate (or if not then
> maybe misunderstanding how Unicode works and/or its relationship to the
> UCS) but I understood that UTF-8 was defined as an 8-bit transport for the
> 31-bit UCS (universal character set, ISO-10646) and as such decodes into
> 31-bit characters. (The current correspondance between UCS and Unicode is
> by design of the respective standards bodies, not because they're the same
> thing -- which they aren't.)
>
> UCS (or Unicode) UTF-8
> 00000000-0000007F 0xxxxxxx
> 00000080-000007FF 110xxxxx 10xxxxxx
> 00000800-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
> 00010000-001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
> 00200000-03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> 04000000-7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
>
> Regardless of whether there will ever be any _Unicode_ characters assigned
> outside the currently-planned 21-bit Unicode limit, UTF-8 does nonetheless
> provide for up to 31 bits of charcode since this is the limit for the UCS.
>
> Pedantically,
> Ian
Ian,
Thank you for emphasizing and clarifying this.
I do not consider this to be pedantical ;-) In fact we need a good
basic understandig of the concepts and consequences of these issues.
Actually Yoshiki Ohshima and Kazuhiro Abe mention in their paper the
21-bit characters as well
http://www.is.titech.ac.jp/~ohshima/squeak/m17npaper/node3.html
I am reading it at the moment
http://www.is.titech.ac.jp/~ohshima/squeak/m17npaper/index.html
It is a must-read for all those Squeakers interested in multilingual
Squeak.
http://www.is.titech.ac.jp/~ohshima/squeak/m17npaper/node2.html
treats the overall design
* Universal Character Set
* Memory Usage
* Text Scanning Performance and Flexibility
* MacRoman vs. ISO-8859-1
* Keyboard Input
* Text File Export and Exchange
* Conclusion on Design
It would be nice to have some feedback on the paper by the list members.
Regards
Hannes
More information about the Squeak-dev
mailing list
|