Adding Accufonts to the update stream (was Re: LicencesQuestion
: Squeak-L Art 6.)
Richard A. O'Keefe
ok at cs.otago.ac.nz
Wed Feb 26 05:05:05 UTC 2003
Todd Blanchard continues to throw blame on innocent characters:
Well, not to belabor the point too much I'll simply put it that I don't
like "curly" quotes and friends and find them to be of dubious value.
This is a lot like saying you don't like the letter E.
These characters *are* part of our writing system,
they are used in a *vast* amount of English text,
it is advantageous for computers to be able to represent them,
and you wouldn't _believe_ the amount of mail I get that includes them.
Above all, some of these characters ARE in the Squeak sources.
In my experience they produce an unending stream of headaches and
I'm old enough to remember when the incompatibility issues between ASCII
and EBCDIC were a major headache. (And to have used EBCDIC terminals that
didn't display curly braces, a major headache when you are using TeX or C.
your insistence that they "shouldn't" notwithstanding.
I am unaware of having "insisted" any such thing.
What I _have_ said is that if HTML and XML documents identify their
encoding the way all relevant standards say they should, then the
characters are not any problem. The problem is not the characters,
but the failure to identify the character encoding. Blame must be
pointed at the right culprit!
My point is simply that
- the characters *ARE* in the Squeak character set right now and have
been for years whether you or anyone else likes it or not;
- if we switch to a character set that does not include them, that
will *CREATE* headaches and compatibility issues;
- if we switch to CP1252, that will *REDUCE* compatibility issues
for Squeak importing text.
We're in complete agreement that Squeak *exporting* text to other
systems will raise compatibility issues, BUT THAT IS ALREADY THE CASE.
In short, I'm saying "what can we do to REDUCE compatibility problems?"
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html explains things
better than I could.
You should not assume that someone you disagree with is ignorant.
Apart from which I can't be bothered to remember what 6 keys I have to
hold down together to make them. So I'm happier with them out of my
life. But that's my personal problem I guess.
On a Mac keyboard, there are only three rather obvious keys plus the
option and shift keys, so the number of keys is 5 not 6. Rather more to
the point, the majority of people generating curly quotation marks are
NOT pressing any special key, the program they use is using a "smart quote"
convention, and they may not even realise that the quotation marks they are
generating _are_ curly.
> My concern was simply to point out that something WILL be lost by the
> move, and that we don't actually need to lose quite that much.
But adopting the windows extensions just gets us a new proprietary
encoding, not closer to unicode.
All of the characters in the 128-159 block are perfectly standard Unicode
characters. The character *repertoire* is a proper subset of Unicode.
The encoding is close enough to deal with quite simply.
While I loathe, detest, and utterly abominate Microsoft and all its works,
the CP 1252 character set definition is a matter of public record, it is
useful, and a lot of people are using it.
> It is difficult to support UTF-8 _anywhere_ correctly without
> 21-bit characters practically _everywhere_ except for display.
Could you explain this? UTF-8 is just variable byte length encoding
and you only use it for streaming. It generally decodes into 8 or 16
(a) No, you DON'T only use it for 'streaming'; it is very commonly used as
an internal storage format by programs. It is, after all, the easiest
way for old 8-bit code to cope with 21-bit characters.
(b) The way to tell if someone knows much about Unicode is to ask
"How many planes have defined codes in the current Unicode standard?"
The last time I looked, planes 0, 1, 2 and 14 all had codes defined.
Unicode hasn't been a 16-bit character set for several years.
Someone who is beginning to be clued-up about Unicode knows the answer
to the question "How many characters can Unicode represent?".
The answer is approximately 1038*1024. In particular, if you represent
a "high" character by the UTF-8 for two surrogates, you are doing UTF-8
wrong. (As Java does, IIRC.) UTf-8 decodes into *21-bit* characters.
More information about the Squeak-dev