Adding Accufonts to the update stream (was Re: LicencesQuestion : Squeak-L Art 6.)

Thu Feb 27 00:18:49 UTC 2003

<much of an interesting discussion snipped>
> As things stand, if I want to process text in a UNIX environment
> (or have students process text in a Windows environment) using
> *both* Squeak *and* native tools, we can work around the line
> termination problem, but we are stuck in the bad old days of ASCII
> characters.
[...]
> I'm afraid that too much agitation for Unicode will result in it being
> put off (because Unicode really is dauntingly complex) and I'll STILL
> be stuck with the ASCII subset, when getting Latin 1 or CP 1252 looks
> as though it might actually happen _soon_.

I think that's the major point here. Changing to another 8bit encoding
requires only a few adjustment, maybe a bit of juggling with the current
characters but that's essentially it. I think that someone should just make
up an SM package which changes the image side of things (w/ appropriate
workarounds for the VMs character events) so that something alike can be
tried out.

The reason being that I think only a few people _really_ care but those care
a lot. They should be given a chance to see if any new scheme would match
their expectations better - be that Latin-X or CP1252 or whatever else.

Still, I can't resist that question:
> If you want to know why Tengwar _will_ be covered and Klingon
> _won't_, that's they place to look.)

I couldn't find the paper - do you have a reference?! I'm _really_ curious
how Tengwar could end up in Unicode ;-)

Cheers,
  - Andreas

> -----Original Message-----
> From: squeak-dev-bounces at lists.squeakfoundation.org 
> [mailto:squeak-dev-bounces at lists.squeakfoundation.org] On 
> Behalf Of Richard A. O'Keefe
> Sent: Thursday, February 27, 2003 12:58 AM
> To: squeak-dev at lists.squeakfoundation.org
> Subject: Re: Adding Accufonts to the update stream (was Re: 
> LicencesQuestion : Squeak-L Art 6.)
> 
> 
> Todd Blanchard wrote:
> 	> This is a lot like saying you don't like the letter E.
> 	
> 	No, E is a character.  Straight quote v  curley pairs is a 
> 	typographical convention masquerading as characters.
> 
> 
> As a matter of fact, "straight" quotes are the typographic convention.
> The handwriting I was taught at school used 66...99 and 6..9 quotes;
> they are as distinct as left and right parenthesis.
> The so-called "straight quotes" are a typographic kluge introduced
> for typewriters so that two characters ' " could do the work of six
> (or seven, depending on whether you were taught to draw apostrophes
> differently from right single quotes or not):  ' does duty for left
> and right single quotes and on some typewriters for the top half of
> an exclamation mark, and " does duty for left and right double quotes
> and the diaeresis.
> 
> You might as well say that "E" -vs- "e" is just a "typographical
> convention masquerading as characters" and in fact just that position
> has been argued.
> 
> Let's put it this way, whatever words we wrap around it,
> the distinction between left double quote, right double quote,
> and diaeresis/umlaut is as real a distinction as between E and e
> or between ( and ).   (And yes, there are European typewriter
> conventions that use /.../ instead of (...), so the analogy is exact.)
> 
> 	Yet another place the Unicode consortium has failed to adhere to
> 	their "characters not glyphs" mantra.
> 	
> I think you mean "slogan", not "mantra".  In Unicode 3, that slogan
> has not so much been abandoned as hanged, drawn, and quartered.  If
> you don't already know what I mean, take a look at the umpteen copies
> of ASCII, in different typographic styles, which have been added
> allegedly for mathematics.
> 
> Let's be fair to the Unicode consortium, though.  In the 
> early days they
> were desperately trying to stick within the 16-bit limit, and 
> "characters
> not glyphs" was the only way they could _possibly_ have done so.  Now
> that the 16-bit limit has been passed (and how!) that 
> constraint doesn't
> apply any more (and Java is left looking rather sick).
> 
> There were two far more important slogans:
>  - faithful round trip conversion.
>    It had to be possible to convert any of a large list of existing
>    standard encodings to Unicode and back again without 
> losing anything.
>    This meant that if any existing standard made a distinction,
>    Unicode had to.  And that made for a number of otherwise highly
>    unpleasant distinctions, like the distinctions between 
> Angstrom unit
>    sign and A-with-ring, between micro and mu, between summation and
>    capital sigma, and so on for a long list.
> 
>  - it had to cope with existing human writing systems.
>    And whether Todd Blanchard or anyone else likes it or not,
>    the distinction between left and right quotations marks is just as
>    much a part of the English writing system as the 
> distinction between
>    left and right guillemets (which are in Latin-1) are part of the
>    French writing system.
> 
> 	>  - if we switch to CP1252, that will *REDUCE* 
> compatibility issues
> 	>    for Squeak importing text.
> 	
> 	You mean use CP1252 internally - and assume it on 
> import?  I don't see 
> 	how this takes us closer to Unicode.
> 
> I never said it did.  What I said is that it will reduce compatibility
> issues for Squeak importing text.  And I had previously given the
> explanation.
>     If someone originates text in Latin 1,
>     Squeak will be able to handle it.
> 
>     If someone originates text in CP 1252,
>     Squeak will be able to handle it,
>     even if it is incorrectly labelled as Latin 1.
> 
>     If someone originates text in MacRoman,
>     Squeak won't necessarily be able to handle ALL of it,
>     but it will be able to handle MORE of it than if
>     Squeak were restricted to Latin 1.
> 
> By the way, anyone who is interested in converting to Unicode in the
> long term should not be happy with Latin 9, if that's the one that is
> just like Latin 1 except for having the Euro instead of the 
> international
> currency sign.  Why?  Because Latin 1 maps onto the bottom 
> 256 codes of
> Unicode, and Latin 9 doesn't.  You would require precisely the same
> decoding mechanism (a translation table) for Latin 9 as you would for
> CP 1252.
> 
> 	My preference would be to make use of UTF-8 and 
> read/write real unicode 
> 	- recoding internal representation as necessary.
> 
> My long term preference is for Unicode internally.
> I just want to stress LONG TERM.
> 
> As for UTF-8, given the amount of stuff that is NOT in UTF-8 
> externally,
> surely it is clear that Squeak needs the ability to read 8-bit
> character sets now and will continue to need it for several 
> years at least.
> 
> 	> You should not assume that someone you disagree with 
> is ignorant.
> 	
> 	I never did.
> 	
> You definitely implied that I didn't know about the problems caused
> by "windows characters".
> 
> To put it bluntly, the problems are *caused* by Microsoft,
> but *evoked* by programs that *don't* understand the "windows 
> characters"
> (some of which, as I seem to have to keep pointing out, are in fact
> "Squeak characters", that being the point.)
> That's why making sure that the Squeak fonts can display them will
> *reduce* compatiblity problems for Squeakers *importing* text.
> Including, in this case, text imported FROM EARLIER VERSIONS 
> OF SQUEAK.
> 
> 	> All of the characters in the 128-159 block are 
> perfectly standard 
> 	> Unicode
> 	> characters.
> 	
> 	This is a meaningless statement but I'll assume you 
> mean the characters 
> 	win-1252 has located in that range.
> 	
> It is not only not meaningless, it is true.
> The context was crystal clear.  I was talking about
> "the characters in thr 128-159 block [of Windows CP 1252]".
> 
> 	For vanishingly small definitions of "common" I'm sure, 
> unless you are 
> 	talking about using it to "fake" compatibility with 
> code written with 
> 	ascii in mind.  Let me put it another way - as chief 
> architect for 
> 	eTranslate I worked with a number of companies on 
> internationalization 
> 	issues in most of the commonly used languages as well 
> as designed and 
> 	implemented an architecture that let us host web apps 
> in 34 different 
> 	languages (we made extensive use of ICU's converters by 
> giving them 
> 	ObjectiveC wrappers and using them from WebObjects).  I 
> never once saw 
> 	anybody consider using an in-memory character size that 
> wasn't a 
> 	convenient even power of 2 for in-memory manipulation 
> of text unless 
> 	they were streaming it.  Usually 16 bits (ie UCS-2).
> 	
> You're talking about new or heavily maintained programs produced by
> an organisation whose very reason for existence is 
> internationalisation
> (if I have understood you correctly).  Naturally your view of 
> what counts
> as typical code is biased.
> 
> I'm seeing textbooks written in the last year or two that are 
> STILL teaching
> people about character handling as if ASCII was king.
> 
> I've used XML parsers that use UTF-8 internally.  Heck, I've 
> written one.
> 
> 	As for the 21 bit character point, Unicode 3.0 fit into 
> 16 bits (and 
> 	was the logical stopping point - but I guess standards 
> people have to 
> 	eat too).
> 
> The goal of UCS/Unicode is to be a "Universal" character set.
> As long as there are known human writing systems that are not covered,
> the work is not finished.  Unicode Technical Report #4, data 2002-05,
> by Michael Everson, reports that no fewer than 96 scripts 
> then remained
> to be covered, compared with 52 that were covered.  (A really 
> interesting
> paper.  If you want to know why Tengwar _will_ be covered and Klingon
> _won't_, that's they place to look.)
> 
> Whether or not "standards people have to eat", they have to 
> do their job.
> 
> 	From a practical standpoint, that version is pretty much the 
> 	end of the line unless you are working with Byzantine 
> Musical Symbols 
> 	or something equally obscure (and at that point I have 
> to ask - with 
> 	whom are you trying to stay compatible).
> 
> The majority of the musical symbols that were added in Unicode 3.1
> are current Western symbols, not Byzantine.  Some of them relate to
> mediaeval western music.  As a matter of fact, we had someone do a
> computational musicology thesis here a couple of years ago.  
> 
> Unicode 3.1 added the Supplementary Multilingual Plane, the
> Supplementary Ideographic Plane, and the Supplementary Special-Purpose
> Plane (plane 14) to the old Basic Multilingual Plane.
> 
> Unicode 4.0 adds  more than 1200 new characters to Unicode 3.2
> Unicode 3.2 added more than 1000 new characters to Unicode 3.1.
> 
> 	All code points that don't fit into 16 bits can safely 
> be ignored.
> 	
> Let me quote Mark Davis and Vladimir Weinstein,
> "Migrating Software to Supplementary Characters", May 2002:
> 
>     Until recently, it was not necessary for software to deal with
>     supplementary code points, those from U_10000 to U+10FFFF.
> 
> (so far so good)
> 
>     With the assignment of over 40,000 supplementary characters in
>     Unicode 3.1 and the definition of new national codepage standards
>     that map to these new characters, it is important to modify
>     BMP-only software to handle the full range of Unicode code points.
> 
> (so they don't agree with Todd Blanchard)
> 
>     [The net national code page standards include]
>     GB 18030, JIS X 0213 and Big5-HKSCS.
> 
> The paper is actually well worth reading to find out what you can DO
> about all these new characters.
> 
> 	I don't think anything at all should be done until Yoshiki's
> 	work can be more fully evaluated.  I personally think he's on to
> 	something.
> 	
> And _this_ is what has me scared.
> "The best is the enemy of the good."	
> 	
> As things stand, if I want to process text in a UNIX environment
> (or have students process text in a Windows environment) using
> *both* Squeak *and* native tools, we can work around the line
> termination problem, but we are stuck in the bad old days of ASCII
> characters.  (While the second official language in this country
> really requires vowels with macrons, people get by using Latin 1 &c
> with butchered fonts that display diaeresis as macron.  I suspect
> that being able to process "macronised" text may even be a "treaty
> obligation".)  Unicode (with sufficiently capable fonts) would be
> WONDERFUL.  But Latin 1 will do.  (And CP 1252 will do even better,
> because it will preserve more existing Squeak characters.)
> 
> I'm afraid that too much agitation for Unicode will result in it being
> put off (because Unicode really is dauntingly complex) and I'll STILL
> be stuck with the ASCII subset, when getting Latin 1 or CP 1252 looks
> as though it might actually happen _soon_.
> 
> I for one do not believe that a new 8-bit coding for Squeak is an
> *alternative* to Unicode for Squeak.  I think it's very nearly a
> precondition.  It's a task which is doable in a relatively short time,
> but doing it will mean that there's a larger group of people who are
> much more familiar with Squeak character issues than might 
> otherwise be
> the case.
> 
> Whether the 8-bit character set is Latin 1 or CP 1252 is 
> largely a matter
> of font display; Squeak has had accented letters for a long 
> time without
> regarding them as letters.
>