UTF-8, UCS, Unicode, 21 bit characters? (Re: Adding Accufonts to the update stream)

Wed Feb 26 08:06:17 UTC 2003

Ian Piumarta <ian.piumarta at inria.fr> wrote:
> On Wed, 26 Feb 2003, Richard A. O'Keefe wrote:
> > 
> > UTf-8 decodes into *21-bit* characters.
> 
> I may be missing some relevant context in this debate (or if not then
> maybe misunderstanding how Unicode works and/or its relationship to the
> UCS) but I understood that UTF-8 was defined as an 8-bit transport for the
> 31-bit UCS (universal character set, ISO-10646) and as such decodes into
> 31-bit characters.  (The current correspondance between UCS and Unicode is
> by design of the respective standards bodies, not because they're the same
> thing -- which they aren't.)
> 
>   UCS (or Unicode)  UTF-8
>   00000000-0000007F 0xxxxxxx
>   00000080-000007FF 110xxxxx 10xxxxxx
>   00000800-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
>   00010000-001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
>   00200000-03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
>   04000000-7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> 
> Regardless of whether there will ever be any _Unicode_ characters assigned
> outside the currently-planned 21-bit Unicode limit, UTF-8 does nonetheless
> provide for up to 31 bits of charcode since this is the limit for the UCS.  
> 
> Pedantically,
> Ian

Ian, 
Thank you for emphasizing and clarifying this.
I do not consider this to be pedantical  ;-)  In fact we need a good
basic understandig of the concepts and consequences of these issues.

Actually Yoshiki Ohshima and Kazuhiro Abe mention in their paper the
21-bit characters as well
http://www.is.titech.ac.jp/~ohshima/squeak/m17npaper/node3.html

I am reading it at the moment
http://www.is.titech.ac.jp/~ohshima/squeak/m17npaper/index.html
It is a must-read for all those Squeakers interested in multilingual
Squeak.

http://www.is.titech.ac.jp/~ohshima/squeak/m17npaper/node2.html 
treats the overall design

    * Universal Character Set
    * Memory Usage
    * Text Scanning Performance and Flexibility
    * MacRoman vs. ISO-8859-1
    * Keyboard Input
    * Text File Export and Exchange
    * Conclusion on Design 

It would be nice to have some feedback on the paper by the list members.

Regards
Hannes