UTC-8 (was Re: Celeste encoding (was: Duplicate messages in Celeste))

Marcel Weiher marcel at metaobject.com
Thu Mar 16 21:44:23 UTC 2000


> From: AGREE at CarltonFields.com
>
> In a sense, isn't a pure ASCII string just a subset of UTC-8?

Yes, and the beauty of it is that (a) all the characters relevant to  
understanding XML structure fall within ASCII and (b) no plain ASCII  
character codes are used in UTF-8 multi-byte escapes.  So, you can  
simply ignore any UTF-8 issues for the parser itself, but the content  
it delivers won't be normalized.

>  Can't a hierarchy with
> built-in coercion be used to preserve ALL of the efficiencies of  
the status quo,
> while still permitting (or at least paving the way) toward the  
full generality of
> UTC-8 and Unicode?

Yes.

> Why can't the ASCII string be the SmallInteger of a new  
STRINGTHING hierarchy,
> where operations within the string world be seamless?

Yup.

> Every time I raise this
> point, there were countless objections about things Squeak so  
configured could
> not do (the biggest deal was auto-reversing Hebrew/Anglo-Numeric text),

Except that these problems are at a higher level, when dealing with  
words.  Strings really just deal with characters and have no idea  
about languages.

> but it
> seems that we could still accomodate many of the advantages of Unicode, 
> integrate the whole into Squeak, while preserving ALL of the  
efficiencies of the
> present ASCII world for unmixed ASCII and Character stuff.

Exactly.

> Or at least we should try real hard to think (or hack) through the  
question before
> doing nothing because of an apparent lack of purity.

If it helps, I can probably provide class-documentation of the  
NeXT/Apple NSString class-cluster, which does exactly that,  
successfully.

Marcel





More information about the Squeak-dev mailing list