Adding Accufonts to the update stream (was Re: LicencesQuestion
: Squeak-L Art 6.)
tblanchard at mac.com
tblanchard at mac.com
Wed Feb 26 09:39:56 UTC 2003
On Wednesday, February 26, 2003, at 06:05 AM, Richard A. O'Keefe wrote:
> This is a lot like saying you don't like the letter E.
No, E is a character. Straight quote v curley pairs is a
typographical convention masquerading as characters.
Yet another place the Unicode consortium has failed to adhere to their
"characters not glyphs" mantra.
> - if we switch to CP1252, that will *REDUCE* compatibility issues
> for Squeak importing text.
You mean use CP1252 internally - and assume it on import? I don't see
how this takes us closer to Unicode.
My preference would be to make use of UTF-8 and read/write real unicode
- recoding internal representation as necessary. What squeak uses
internally I don't care (although I was up very late last night reading
Yoshiki's paper and think he has done a wonderful job working around
many of Unicode's quirks).
> You should not assume that someone you disagree with is ignorant.
I never did.
> All of the characters in the 128-159 block are perfectly standard
> Unicode
> characters.
This is a meaningless statement but I'll assume you mean the characters
win-1252 has located in that range.
> The character *repertoire* is a proper subset of Unicode.
> The encoding is close enough to deal with quite simply.
> While I loathe, detest, and utterly abominate Microsoft and all its
> works,
Complete agreement here. :-)
> the CP 1252 character set definition is a matter of public record, it
> is
> useful, and a lot of people are using it.
>
> > It is difficult to support UTF-8 _anywhere_ correctly without
> > supporting
> > 21-bit characters practically _everywhere_ except for display.
>
> Could you explain this? UTF-8 is just variable byte length encoding
> and you only use it for streaming. It generally decodes into 8 or 16
> bit characters.
>
> (a) No, you DON'T only use it for 'streaming'; it is very commonly
> used as
> an internal storage format by programs.
For vanishingly small definitions of "common" I'm sure, unless you are
talking about using it to "fake" compatibility with code written with
ascii in mind. Let me put it another way - as chief architect for
eTranslate I worked with a number of companies on internationalization
issues in most of the commonly used languages as well as designed and
implemented an architecture that let us host web apps in 34 different
languages (we made extensive use of ICU's converters by giving them
ObjectiveC wrappers and using them from WebObjects). I never once saw
anybody consider using an in-memory character size that wasn't a
convenient even power of 2 for in-memory manipulation of text unless
they were streaming it. Usually 16 bits (ie UCS-2).
I'll grant you that you can often get away with piping UTF-8 as a
stream through older programs, but this is more the exception.
As for the 21 bit character point, Unicode 3.0 fit into 16 bits (and
was the logical stopping point - but I guess standards people have to
eat too). From a practical standpoint, that version is pretty much the
end of the line unless you are working with Byzantine Musical Symbols
or something equally obscure (and at that point I have to ask - with
whom are you trying to stay compatible). All code points that don't
fit into 16 bits can safely be ignored.
But we digress. I don't think anything at all should be done until
Yoshiki's work can be more fully evaluated. I personally think he's on
to something.
-Todd Blanchard
More information about the Squeak-dev
mailing list
|