Adding Accufonts to the update stream (was Re: LicencesQuestion : Squeak-L Art 6.)

Richard A. O'Keefe ok at cs.otago.ac.nz
Wed Feb 26 05:05:05 UTC 2003


Todd Blanchard continues to throw blame on innocent characters:
	Well, not to belabor the point too much I'll simply put it that I don't 
	like "curly" quotes and friends and find them to be of dubious value.  

This is a lot like saying you don't like the letter E.
These characters *are* part of our writing system,
they are used in a *vast* amount of English text,
it is advantageous for computers to be able to represent them,
and you wouldn't _believe_ the amount of mail I get that includes them.
Above all, some of these characters ARE in the Squeak sources.

	In my experience they produce an unending stream of headaches and 
	compatibility issues, 

I'm old enough to remember when the incompatibility issues between ASCII
and EBCDIC were a major headache.  (And to have used EBCDIC terminals that
didn't display curly braces, a major headache when you are using TeX or C.

	your insistence that they "shouldn't" notwithstanding.
	
I am unaware of having "insisted" any such thing.
What I _have_ said is that if HTML and XML documents identify their
encoding the way all relevant standards say they should, then the
characters are not any problem.  The problem is not the characters,
but the failure to identify the character encoding.  Blame must be
pointed at the right culprit!

My point is simply that
 - the characters *ARE* in the Squeak character set right now and have
   been for years whether you or anyone else likes it or not;
 - if we switch to a character set that does not include them, that
   will *CREATE* headaches and compatibility issues;
 - if we switch to CP1252, that will *REDUCE* compatibility issues
   for Squeak importing text.

We're in complete agreement that Squeak *exporting* text to other
systems will raise compatibility issues, BUT THAT IS ALREADY THE CASE.

In short, I'm saying "what can we do to REDUCE compatibility problems?"

	http://www.cs.tut.fi/~jkorpela/www/windows-chars.html explains things 
	better than I could.
	
You should not assume that someone you disagree with is ignorant.

	Apart from which I can't be bothered to remember what 6 keys I have to 
	hold down together to make them.  So I'm happier with them out of my 
	life.  But that's my personal problem I guess.
	
Exactly so.

On a Mac keyboard, there are only three rather obvious keys plus the
option and shift keys, so the number of keys is 5 not 6.  Rather more to
the point, the majority of people generating curly quotation marks are
NOT pressing any special key, the program they use is using a "smart quote"
convention, and they may not even realise that the quotation marks they are
generating _are_ curly.

	> My concern was simply to point out that something WILL be lost by the
	> move, and that we don't actually need to lose quite that much.
	
	But adopting the windows extensions just gets us a new proprietary 
	encoding, not closer to unicode.

All of the characters in the 128-159 block are perfectly standard Unicode
characters.  The character *repertoire* is a proper subset of Unicode.
The encoding is close enough to deal with quite simply.

While I loathe, detest, and utterly abominate Microsoft and all its works,
the CP 1252 character set definition is a matter of public record, it is
useful, and a lot of people are using it.

	> It is difficult to support UTF-8 _anywhere_ correctly without 
	> supporting
	> 21-bit characters practically _everywhere_ except for display.
	
	Could you explain this?  UTF-8 is just variable byte length encoding 
	and you only use it for streaming.  It generally decodes into 8 or 16 
	bit characters.
	
(a) No, you DON'T only use it for 'streaming'; it is very commonly used as
    an internal storage format by programs.  It is, after all, the easiest
    way for old 8-bit code to cope with 21-bit characters.
(b) The way to tell if someone knows much about Unicode is to ask
    "How many planes have defined codes in the current Unicode standard?"
    The last time I looked, planes 0, 1, 2 and 14 all had codes defined.
    Unicode hasn't been a 16-bit character set for several years.
    Someone who is beginning to be clued-up about Unicode knows the answer
    to the question "How many characters can Unicode represent?".
    The answer is approximately 1038*1024.  In particular, if you represent
    a "high" character by the UTF-8 for two surrogates, you are doing UTF-8
    wrong.  (As Java does, IIRC.)  UTf-8 decodes into *21-bit* characters.



More information about the Squeak-dev mailing list