[ENH] Display := when pretty printing ( [sm][et][er][cd] [approved] )

Thu Oct 16 00:02:34 UTC 2003

	You are right and I'm wrong.this was easy because I'm complete
	naive in this topic.

I'm assuming this is sarcasm.

	My first name is spelled with an accent é so this is just now
	that I start to receive international mail, bills or email not
	written st&*phane.

ISO 8859-1 is has been around for over 15 years, hasn't it?
MS-DOS codepage 850 had all the ISO 8859-1 characters, just not in
the same places.  (And Xerox were using the 16-bit XNS character set
back in 1984, at least.)

So it's taken _this_ look for people to catch on that ASCII's dead?

	So sure ASCII is bad.  Now Squeak can have a really fancy
	wonderful character set.

I don't say "fancy".  In fact, when it comes to Unicode, I say
"insanely complicated".  I don't say "wonderful", just "better than ASCII".
The important thing about Unicode is that it's *one* character set that
the computing world is converging on.  Windows, MacOS, and UNIX all support
it.  Perl, Python, and even Tcl support it, not to mention Java, Ada, C++,
and C.  There are several editors out there that support Unicode, sam and
Yudit spring to mind.  XML is based on it.  It's the one character set we
_have_ to have if we're not going to have the Python folk sniggering at us.

	Three remarks:
	- My point that it should be consistent,

I don't understand.  What should be consistent with what?

	- Then I really wonder why Squeak that tries to be ANSI (you
	were the first one to claim that for initialize) is not ANSI
	with assignemnt.  This is fun to have different compatibility
	policy.  But I can live with that because I do not care about a
	bad standard.

I'm having a bit of trouble parsing that, so I'll respond to what I
_think_ you meant, fully aware that I could be quite wrong.

The ANSI Smalltalk standard is flawed in a number of ways.  It looks as
though nobody bothered to do any proof-reading at all.  It is sloppy; it
looks as though very little thought went into corner cases.  It is verbose
and occasionally vague, suffering from the ills common to informal
specifications.  And it leaves a lot out.

BUT it is the only standard we have, and the Smalltalk community is not
so large that building artificial barriers is a good idea.

The ANSI Smalltalk standard says that "_" is a character that is usable
in identifiers.  I think this is a Good Thing and I would like Squeak to
support it.  The idea of putting space between words to make text much
more readable was discovered about 500 years ago.  It's time Squeak caught up.

(Oddly enough, the S programming language used both _ and <- for assignment.
The 1.8.0 release of R has finally killed off the _ spelling of <- .)

HOWEVER, while I think it is important that ANSI-compatible identifiers
with underscores should be supported by Squeak, I *don't* want to lose
the assignment arrow.  If Unicode didn't exist, I'd be suggesting that we
add the four arrows at code positions 31 (left arrow) 30 (up arrow)
29 (down arrow) and 28 (right arrow) -- it's no coincidence that Ctrl-_
is 31 and Ctrl-^ is 30) -- to all the Squeak fonts.  Since Unicode does
exist, U+2190 will do perfectly.  I want ALL THREE things supported:
	_	in identifiers (ANSI-compatible)
	:=	assignmentOperator (ANSI-compatible)
	U+2190	assignmentOperator (historic, readable)

To quote the ANSI standard:
  "An implementation may define characters in addition to those listed
   below in each character category.  While the meaning of a program
   that uses any such characters is well defined it may not be
   portable between conforming implementations."
and:
  "Three types of operator tokens are defined in Smalltalk: binary
   selectors, the return operator, and the assignment operator.  ...
   An implementation may define additional binaryCharacters but their
   use may result in a non-portable program."

This really isn't what I wanted to hear.  It appears that while you
_are_ allowed to define new digits, new letters, and new binary
characters, you _aren't_ allowed to define new comment delimiters,
new assignment operators, or new return operators.  However, I don't
think that needs to stop us.  We can give non-conformant meanings to
non-ANSI-Smalltalk characters as long as we don't expect code using
those characters to port to other Smalltalks.  Using left arrow is
quite harmless, because when you ask for code to be saved out in a
portable way (which .cs and .st files are not) left arrows not in a
string or comment can be automatically converted to :=.

Note, by the way, that ":=" is incompatible with mathematical usage.
In mathematics "x := y" doesn't mean "change x to have y as value",
it means "define x to be y all the time".

Similarly, if we are to render (X)HTML and other files correctly,
we want ^ to display as ^.  And the ANSI Smalltalk standard says that
^ is the return operator.  But it doesn't say that the up arrow isn't
_also_ a return operator, and that's what I want:

	^	returnOperator (ANSI-compatible) displays as ^
	U+2191	returnOperator (historic, readable)

While not strictly kosher according to ANSI, it's less of a syntactic
extension than curly brace notation for arrays.  When you ask for code
to be saved out in a portable way, up arrows not in a string or comment
can be automatically converted to ^ .

With this approach, we can translate old Squeak sources to Unicode
(basically use the MacRoman -> Unicode mapping given in
MAPPINGS/VENDORS/APPLE/ROMAN.TXT modified to map ^ and _ to the
appropriate arrows (I have a modified ROMAN.TXT called SQUEAK.TXT that I
can post if anyone would be interested).

	- Third my goal is that Smalltalk or any new
	Smalltalk-based-better system grows and get more programmers.

Good.  We share that goal.  This is one reason why ANSI compatibility
is important.  This is one reason why a book should use the ANSI symbols.

	Now if I want to attract programmers of other languages:  I do
	not know any mainstream language (I'm certainly wrong here again
	I'm not expert but just an experienced promotor of Smalltalk)
	that does not have an ASCII basis

I guess you haven't looked at C, C++, or Java lately.  Java, C++, and
C99 allow any defined non-formatting Unicode character in (some) strings
and in comments, and while not being exactly compatible with the Unicode
rules for identifiers they allow *most* of the Unicode identifier characters
in identifiers.  For transport purposes these languages provide
	\uxxxx		16-bit Unicode character U+xxxx
	\U00xxxxxx	21-bit Unicode character U+xxxxxx
for use when a character is not expressible in a particular encoding.
Quoting a draft of the C++ standard,
    [Translation phase 1]
	Physical source file characters are mapped, in an implementation-
	defined manner, to the source character set (introducing new-line
	characters for end-of-line indicators) if necessary.  Trigraph
	sequences [...] are replaced by corresponding single-character
	internal representations.  Any source file character not in the
	basic source character set [...] is replaced by the universal-
	character-name that designates that character.

	<item>
	Physical source file characters are mapped, in an implementation-
	defined manner, to the source character set (introducing new-line
	characters for end-of-line indicators) if necessary.  Trigraph
	sequences [ref] are replaced by corresponding single-character
	internal representations.  Any source file character not in the
	basic source character set [ref] is replaced by the universal-
	character-name that designates that character.
	<footnote>
	The process of handling extended characters is specified in terms
	of mapping to an encoding that uses only the basic source character
	set, and, in the case of character literals and strings, further
	mapping to the execution character set.  In practical terms, however,
	any internal encoding may be used, so long as an actual extended
	character encountered in the input, and the same extended character
	expressed in the input as a universal-character-name (i.e. 
	using the [\u and \U] notation), are handled equivalently.
	</footnote>
	</item>
I suppose you could argue that having \u00e9 as a transport/internal
representation for &eacute; means that C99, C++, and Java still have
"an ASCII basis", but the intent is not that people should write
\u0039gal as an identifier, but rather that they should write *gal.

	meaning for me:  that I can type with vi (that I hate), emacs or
	**any** text editor.

Thanks to Squeak's insistence on using Ctrl-M as line terminator,
I can't use "any" text editor on Squeak sources right now.  Try using
vi or ex or edit or ed or even view on a .st or .cs file.  Even using
Emacs, what you see is a sea of ^Ms.

Limit yourself to what you have use in "**any** text editor", and you
can forget about using any accented letters anywhere.

Sam, Yudit, and MULE exist.  And I am NOT saying that nobody should be
allowed to use ":=" for assignment and "^" for returning, quite the
contrary.  What I am saying is that nobody should be FORCED to use them.

	So sure we can be the one that does not have the problems you
	mention but right now the only thing I see is inconsistencies
	everywhere.  So may be I'm just too pragmatic.

No, someone who's too pragmatic would put up with inconsistency.

	So as I say I'm wrong but I feel like in a museum sometimes

Insisting on being able to use any old text editor certainly seems that way.
But then, I write HTML in an ASCII editor, and still get to use &larr;
and &uarr; ...