UTF-8, UCS, Unicode, 21 bit characters? (Re: Adding Accufonts to
the update stream)
hannes.hirzel.squeaklist at bluewin.ch
Wed Feb 26 08:06:17 UTC 2003
Ian Piumarta <ian.piumarta at inria.fr> wrote:
> On Wed, 26 Feb 2003, Richard A. O'Keefe wrote:
> > UTf-8 decodes into *21-bit* characters.
> I may be missing some relevant context in this debate (or if not then
> maybe misunderstanding how Unicode works and/or its relationship to the
> UCS) but I understood that UTF-8 was defined as an 8-bit transport for the
> 31-bit UCS (universal character set, ISO-10646) and as such decodes into
> 31-bit characters. (The current correspondance between UCS and Unicode is
> by design of the respective standards bodies, not because they're the same
> thing -- which they aren't.)
> UCS (or Unicode) UTF-8
> 00000000-0000007F 0xxxxxxx
> 00000080-000007FF 110xxxxx 10xxxxxx
> 00000800-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
> 00010000-001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
> 00200000-03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> 04000000-7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> Regardless of whether there will ever be any _Unicode_ characters assigned
> outside the currently-planned 21-bit Unicode limit, UTF-8 does nonetheless
> provide for up to 31 bits of charcode since this is the limit for the UCS.
Thank you for emphasizing and clarifying this.
I do not consider this to be pedantical ;-) In fact we need a good
basic understandig of the concepts and consequences of these issues.
Actually Yoshiki Ohshima and Kazuhiro Abe mention in their paper the
21-bit characters as well
I am reading it at the moment
It is a must-read for all those Squeakers interested in multilingual
treats the overall design
* Universal Character Set
* Memory Usage
* Text Scanning Performance and Flexibility
* MacRoman vs. ISO-8859-1
* Keyboard Input
* Text File Export and Exchange
* Conclusion on Design
It would be nice to have some feedback on the paper by the list members.
More information about the Squeak-dev