Adding Accufonts to the update stream (was Re: LicencesQuestion : Squeak-L Art 6.)

Wed Feb 26 09:39:56 UTC 2003

On Wednesday, February 26, 2003, at 06:05  AM, Richard A. O'Keefe wrote:

> This is a lot like saying you don't like the letter E.

No, E is a character.  Straight quote v  curley pairs is a 
typographical convention masquerading as characters.
Yet another place the Unicode consortium has failed to  adhere to their 
"characters not glyphs" mantra.

>  - if we switch to CP1252, that will *REDUCE* compatibility issues
>    for Squeak importing text.

You mean use CP1252 internally - and assume it on import?  I don't see 
how this takes us closer to Unicode.
My preference would be to make use of UTF-8 and read/write real unicode 
- recoding internal representation as necessary.  What squeak uses 
internally I don't care (although I was up very late last night reading 
Yoshiki's paper and think he has done a wonderful job working around 
many of Unicode's quirks).

> You should not assume that someone you disagree with is ignorant.

I never did.

> All of the characters in the 128-159 block are perfectly standard 
> Unicode
> characters.

This is a meaningless statement but I'll assume you mean the characters 
win-1252 has located in that range.

> The character *repertoire* is a proper subset of Unicode.
> The encoding is close enough to deal with quite simply.

> While I loathe, detest, and utterly abominate Microsoft and all its 
> works,

Complete agreement here. :-)

> the CP 1252 character set definition is a matter of public record, it 
> is
> useful, and a lot of people are using it.
>
> 	> It is difficult to support UTF-8 _anywhere_ correctly without
> 	> supporting
> 	> 21-bit characters practically _everywhere_ except for display.
> 	
> 	Could you explain this?  UTF-8 is just variable byte length encoding
> 	and you only use it for streaming.  It generally decodes into 8 or 16
> 	bit characters.
> 	
> (a) No, you DON'T only use it for 'streaming'; it is very commonly 
> used as
>     an internal storage format by programs.

For vanishingly small definitions of "common" I'm sure, unless you are 
talking about using it to "fake" compatibility with code written with 
ascii in mind.  Let me put it another way - as chief architect for 
eTranslate I worked with a number of companies on internationalization 
issues in most of the commonly used languages as well as designed and 
implemented an architecture that let us host web apps in 34 different 
languages (we made extensive use of ICU's converters by giving them 
ObjectiveC wrappers and using them from WebObjects).  I never once saw 
anybody consider using an in-memory character size that wasn't a 
convenient even power of 2 for in-memory manipulation of text unless 
they were streaming it.  Usually 16 bits (ie UCS-2).

I'll grant you that you can often get away with piping UTF-8 as a 
stream through older programs, but this is more the exception.

As for the 21 bit character point, Unicode 3.0 fit into 16 bits (and 
was the logical stopping point - but I guess standards people have to 
eat too).  From a practical standpoint, that version is pretty much the 
end of the line unless you are working with Byzantine Musical Symbols 
or something equally obscure (and at that point I have to ask - with 
whom are you trying to stay compatible).  All code points that don't 
fit into 16 bits can safely be ignored.

But we digress.  I don't think anything at all should be done until 
Yoshiki's work can be more fully evaluated.  I personally think he's on 
to something.

-Todd Blanchard