Adding Accufonts to the update stream (was Re: LicencesQuestion : Squeak-L Art 6.)

Wed Feb 26 23:57:55 UTC 2003

Todd Blanchard wrote:
	> This is a lot like saying you don't like the letter E.

	No, E is a character.  Straight quote v  curley pairs is a 
	typographical convention masquerading as characters.

As a matter of fact, "straight" quotes are the typographic convention.
The handwriting I was taught at school used 66...99 and 6..9 quotes;
they are as distinct as left and right parenthesis.
The so-called "straight quotes" are a typographic kluge introduced
for typewriters so that two characters ' " could do the work of six
(or seven, depending on whether you were taught to draw apostrophes
differently from right single quotes or not):  ' does duty for left
and right single quotes and on some typewriters for the top half of
an exclamation mark, and " does duty for left and right double quotes
and the diaeresis.

You might as well say that "E" -vs- "e" is just a "typographical
convention masquerading as characters" and in fact just that position
has been argued.

Let's put it this way, whatever words we wrap around it,
the distinction between left double quote, right double quote,
and diaeresis/umlaut is as real a distinction as between E and e
or between ( and ).   (And yes, there are European typewriter
conventions that use /.../ instead of (...), so the analogy is exact.)

	Yet another place the Unicode consortium has failed to adhere to
	their "characters not glyphs" mantra.

I think you mean "slogan", not "mantra".  In Unicode 3, that slogan
has not so much been abandoned as hanged, drawn, and quartered.  If
you don't already know what I mean, take a look at the umpteen copies
of ASCII, in different typographic styles, which have been added
allegedly for mathematics.

Let's be fair to the Unicode consortium, though.  In the early days they
were desperately trying to stick within the 16-bit limit, and "characters
not glyphs" was the only way they could _possibly_ have done so.  Now
that the 16-bit limit has been passed (and how!) that constraint doesn't
apply any more (and Java is left looking rather sick).

There were two far more important slogans:
 - faithful round trip conversion.
   It had to be possible to convert any of a large list of existing
   standard encodings to Unicode and back again without losing anything.
   This meant that if any existing standard made a distinction,
   Unicode had to.  And that made for a number of otherwise highly
   unpleasant distinctions, like the distinctions between Angstrom unit
   sign and A-with-ring, between micro and mu, between summation and
   capital sigma, and so on for a long list.

 - it had to cope with existing human writing systems.
   And whether Todd Blanchard or anyone else likes it or not,
   the distinction between left and right quotations marks is just as
   much a part of the English writing system as the distinction between
   left and right guillemets (which are in Latin-1) are part of the
   French writing system.

	>  - if we switch to CP1252, that will *REDUCE* compatibility issues
	>    for Squeak importing text.

	You mean use CP1252 internally - and assume it on import?  I don't see 
	how this takes us closer to Unicode.

I never said it did.  What I said is that it will reduce compatibility
issues for Squeak importing text.  And I had previously given the
explanation.
    If someone originates text in Latin 1,
    Squeak will be able to handle it.

    If someone originates text in CP 1252,
    Squeak will be able to handle it,
    even if it is incorrectly labelled as Latin 1.

    If someone originates text in MacRoman,
    Squeak won't necessarily be able to handle ALL of it,
    but it will be able to handle MORE of it than if
    Squeak were restricted to Latin 1.

By the way, anyone who is interested in converting to Unicode in the
long term should not be happy with Latin 9, if that's the one that is
just like Latin 1 except for having the Euro instead of the international
currency sign.  Why?  Because Latin 1 maps onto the bottom 256 codes of
Unicode, and Latin 9 doesn't.  You would require precisely the same
decoding mechanism (a translation table) for Latin 9 as you would for
CP 1252.

	My preference would be to make use of UTF-8 and read/write real unicode 
	- recoding internal representation as necessary.

My long term preference is for Unicode internally.
I just want to stress LONG TERM.

As for UTF-8, given the amount of stuff that is NOT in UTF-8 externally,
surely it is clear that Squeak needs the ability to read 8-bit
character sets now and will continue to need it for several years at least.

	> You should not assume that someone you disagree with is ignorant.

	I never did.

You definitely implied that I didn't know about the problems caused
by "windows characters".

To put it bluntly, the problems are *caused* by Microsoft,
but *evoked* by programs that *don't* understand the "windows characters"
(some of which, as I seem to have to keep pointing out, are in fact
"Squeak characters", that being the point.)
That's why making sure that the Squeak fonts can display them will
*reduce* compatiblity problems for Squeakers *importing* text.
Including, in this case, text imported FROM EARLIER VERSIONS OF SQUEAK.

	> All of the characters in the 128-159 block are perfectly standard 
	> Unicode
	> characters.

	This is a meaningless statement but I'll assume you mean the characters 
	win-1252 has located in that range.

It is not only not meaningless, it is true.
The context was crystal clear.  I was talking about
"the characters in thr 128-159 block [of Windows CP 1252]".

	For vanishingly small definitions of "common" I'm sure, unless you are 
	talking about using it to "fake" compatibility with code written with 
	ascii in mind.  Let me put it another way - as chief architect for 
	eTranslate I worked with a number of companies on internationalization 
	issues in most of the commonly used languages as well as designed and 
	implemented an architecture that let us host web apps in 34 different 
	languages (we made extensive use of ICU's converters by giving them 
	ObjectiveC wrappers and using them from WebObjects).  I never once saw 
	anybody consider using an in-memory character size that wasn't a 
	convenient even power of 2 for in-memory manipulation of text unless 
	they were streaming it.  Usually 16 bits (ie UCS-2).

You're talking about new or heavily maintained programs produced by
an organisation whose very reason for existence is internationalisation
(if I have understood you correctly).  Naturally your view of what counts
as typical code is biased.

I'm seeing textbooks written in the last year or two that are STILL teaching
people about character handling as if ASCII was king.

I've used XML parsers that use UTF-8 internally.  Heck, I've written one.

	As for the 21 bit character point, Unicode 3.0 fit into 16 bits (and 
	was the logical stopping point - but I guess standards people have to 
	eat too).

The goal of UCS/Unicode is to be a "Universal" character set.
As long as there are known human writing systems that are not covered,
the work is not finished.  Unicode Technical Report #4, data 2002-05,
by Michael Everson, reports that no fewer than 96 scripts then remained
to be covered, compared with 52 that were covered.  (A really interesting
paper.  If you want to know why Tengwar _will_ be covered and Klingon
_won't_, that's they place to look.)

Whether or not "standards people have to eat", they have to do their job.

	From a practical standpoint, that version is pretty much the 
	end of the line unless you are working with Byzantine Musical Symbols 
	or something equally obscure (and at that point I have to ask - with 
	whom are you trying to stay compatible).

The majority of the musical symbols that were added in Unicode 3.1
are current Western symbols, not Byzantine.  Some of them relate to
mediaeval western music.  As a matter of fact, we had someone do a
computational musicology thesis here a couple of years ago.  

Unicode 3.1 added the Supplementary Multilingual Plane, the
Supplementary Ideographic Plane, and the Supplementary Special-Purpose
Plane (plane 14) to the old Basic Multilingual Plane.

Unicode 4.0 adds  more than 1200 new characters to Unicode 3.2
Unicode 3.2 added more than 1000 new characters to Unicode 3.1.

	All code points that don't fit into 16 bits can safely be ignored.

Let me quote Mark Davis and Vladimir Weinstein,
"Migrating Software to Supplementary Characters", May 2002:

    Until recently, it was not necessary for software to deal with
    supplementary code points, those from U_10000 to U+10FFFF.

(so far so good)

    With the assignment of over 40,000 supplementary characters in
    Unicode 3.1 and the definition of new national codepage standards
    that map to these new characters, it is important to modify
    BMP-only software to handle the full range of Unicode code points.

(so they don't agree with Todd Blanchard)

    [The net national code page standards include]
    GB 18030, JIS X 0213 and Big5-HKSCS.

The paper is actually well worth reading to find out what you can DO
about all these new characters.

	I don't think anything at all should be done until Yoshiki's
	work can be more fully evaluated.  I personally think he's on to
	something.

And _this_ is what has me scared.
"The best is the enemy of the good."	

As things stand, if I want to process text in a UNIX environment
(or have students process text in a Windows environment) using
*both* Squeak *and* native tools, we can work around the line
termination problem, but we are stuck in the bad old days of ASCII
characters.  (While the second official language in this country
really requires vowels with macrons, people get by using Latin 1 &c
with butchered fonts that display diaeresis as macron.  I suspect
that being able to process "macronised" text may even be a "treaty
obligation".)  Unicode (with sufficiently capable fonts) would be
WONDERFUL.  But Latin 1 will do.  (And CP 1252 will do even better,
because it will preserve more existing Squeak characters.)

I'm afraid that too much agitation for Unicode will result in it being
put off (because Unicode really is dauntingly complex) and I'll STILL
be stuck with the ASCII subset, when getting Latin 1 or CP 1252 looks
as though it might actually happen _soon_.

I for one do not believe that a new 8-bit coding for Squeak is an
*alternative* to Unicode for Squeak.  I think it's very nearly a
precondition.  It's a task which is doable in a relatively short time,
but doing it will mean that there's a larger group of people who are
much more familiar with Squeak character issues than might otherwise be
the case.

Whether the 8-bit character set is Latin 1 or CP 1252 is largely a matter
of font display; Squeak has had accented letters for a long time without
regarding them as letters.