[ENH] CharacterTimes-rsb

Wed Jun 23 03:25:38 UTC 2004

I wrote:
	> The obvious question is, "just how common is this operation, anyway?"
	> The only time I've ever wanted $c repeatedTimes: n, $c has been
	> Character space, and what I've _really_ wanted has been aStream space: 
	> n.

	I must say that I find this suggestion, to presume that space: exists 
	on streams and is somehow more natural, to be a horrible idea, whose 
	only factor in its favor is legacy presence from Smalltalk-80.

The Stream classes in Squeak are a horrible mess.  Even in ANSI Smalltalk,
which is pretty minimalist, there are some quirks, such as
  - requiring that all <gettable streams> support #nextLine, even
    binary streams
  - requiring that all <puttable streams> support #cr, #space, and #tab,
    even binary streams
  - (aStream space) is equivalent to (aStream nextPut: Character space);
    (aStream tab)   is equivalent to (aStream nextPut: Character tab);
    but (aStream cr) might NOT be equivalent to (aStream nextPut:
    Character cr).
and even requirements that are obviously IMPOSSIBLE to satisfy, such as
  - requiring that EVERY kind of stream support the <sequenced stream>
    protocol, which means supporting #position and #position: and #contents.
    * The various requirements for #position cannot be jointly satisfied
      when reading character data from a UTF-8-encoded file; for that
      matter.  (Section 5.10.1 only envisages #'binary' and #'text'
      external stream types, but #'utf8' is an obvious thing to want.)
    * #position: cannot be implemented AT ALL on many kinds of <file stream>;
      how the heck do you reposition a keyboard or a serial line or a socket?
    * To implement #contents means that you cannot open files write-only;
      they must be opened read-write.  So if you have write-only or
      append-only access to a file, you can't use ANSI Smalltalk to write
      to it.  Presumably it also means that the number of bytes you are
      allowed to pump down a socket is limited by the amount of memory
      available to hold (aSocket contents).  This is ridiculous.

Input and output operations are so mingled together in Stream that on
several occasions when I've wanted to define a new kind of stream, I've
*NOT* made it a subclass of Stream.

So I don't want to defend the status quo in any kind of detail.

What I *do* want to defend is the idea that something *very like* a
Stream is the right way to construct Strings.

This opinion goes back a long way, to my experience using Burroughs
Algol 'POINTER' for manipulating strings and my experience of
implementing XPL-like strings in Burroughs Algol (my first published
paper).  'POINTER', doing vaguely-Stream-like things, was both easier to
use and considerably more efficient than XPL-like strings.  Then a
friend of mine wrote a batch editor based on Prime's line editor and
made a fair bit of pocket money selling it to IBM mainframe users.  It
was the commands that were based on Prime's editor; the code was a thick
wodge of PL/I all his very own, fighting PL/I strings every inch of the
way.  As my first serious exercise in learning C on a PDP-11, I wrote a
much more powerful version of his editor (it even had undo!)  in 10
pages of C taking 5 days.  What was it about C that helped?  *NOT*
having strings built into the language.  Another similar experience was
with Xerox Quintus Prolog.  This was on the Xerox Dandelions,
Dandetigers, and so on (1108, 1109, 1185, 1186); the machines were
running Interlisp-D and we'd been given a tiny corner of the microcode
and just 4MB of the virtual memory to put Prolog in.  I made the
pleasant discovery that manipulating strings as Prolog lists of
characters using definite clause grammars (a rough equivalent to using
streams, in this context) was considerably more efficient than using
Interlisp-D's native strings (about 5 times faster, in fact, on the
benchmarks I was using) as well as being considerably easier to do.

Java, of course, has Strings.  But for any serious String construction,
Sun warn you over and over and over NOT to use String concatenation but
to use StringBuffers.  And what is a Java StringBuffer but
(WriteStream on: (String new: someDefault))?

So, while I don't want to defend the details of Stream, I cherish the
way Smalltalk uses Streams for building Strings as one of the excellent
core ideas (marred by the typical sloppiness in followup design).

	Perhaps Squeak is interested in this for its own sake, but I see
	no reason to continue in this path.  #space: is nondescript,

#space: is to #space as #tab: is to #tab and #crtab: is to #crtab.
The name could be better, maybe #nextPutSpaces: would have been clearer.

	presumes that streams have knowledge of characters,

which they do, fairly pervasively.

	and presumes that they care about printing of any kind.

which they do, fairly pervasively.

I don't see any problem with there being *some* kinds of Stream that
know about characters and are "for" printing, so I presume that the
complaint is that *every* kind of Stream (even, say, a WriteStream on
a FloatArray) is burdened with these things.

I would love to see a redesign of the Stream classes, probably based
on Traits, with a proper factoring between reading, writing, positioning,
and text.

	While removing space: from Squeak would take a bit of work,

Actually, the Squeak 3.6 image I'm running contains no senders of
#space: outside my own code.  It's #tab: and #crtab: that would be
hard to expunge.

	encouraging continued reliance on it as though its design were
	truly justified instead of haphazard, is intellectually
	irresponsible.

Just what is bad about having an operation for writing several spaces
to a character stream?  Why is, for example,
    (format a-stream "~6 at T")
OK, but
    aStream space: 6
is 'intellectually responsible'?

	What Slate does for timesRepeated: is totally element-type-compatible 
	and abstract, creating a Repetition sequence object, with slots for 
	element and size.

"Preview" can't find "Repetition" in progman.pdf.  Is there are more
recent version of the Slate programmer's manual that describes this?
(And is it available in A4 format, pretty please?)

	But then in my eyes the only reason this use-case is 
	rare in Smalltalk is because it isn't natural with the given idioms.

Like I said, we already have
    aCollectionClass new: someSize withAll: initialElement
We also have
    aCollection add: anElement withOccurrences: someNaturalNumber
and it is a great pity that RunArray introduces a new name
    aRunArray addLast: anElement times: someNaturalNumber
for that and doesn't even map #add:withOccurrences: to it.

In fact,
    (RunArray new: size withAll: element)
pretty much *is* just such a Repetition sequence object; you can if
you wish write things like

    'abc', (RunArray new: 3 withAll: $*), 'def'

to get 'abc***def'.  (Doing that way is inefficient in Smalltalk, as is
all non-trivial non-stream-based concatenation.  It could be more efficient
if there were a Concatenation object, but the result _in Smalltalk_ is
expected to be mutable.  If you want to argue for an implementation of
sequences that supports O(1) concatenation, I'm with you.  But that's not
what we have at the moment.)

	I modified my suggestion since there'd be an uproar about a new class 
	just for this apparently distasteful idiom.

Not really.  You'd just have had me pointing out that RunArray already
does that job.

	I'm of the opinion that if there's /any/ kind of idiom, it should be 
	refactored into something concise and natural, even if only for the 
	sake of seeing whether the idiom is common, can be common, or just to 
	reify it for your code-aware environment (senders-browsing).

That really is a "motherhood and apple pie" sentiment.
How could I disagree without appearing stupid?
Well, for one thing, I could point out that natural languages tend to
follow a Zipfian distribution:  a small number of constructs are *VERY*
common and a large number of constructs are extremely rare.  Let's take
a concrete case.

The Hebrew bible (removing the bits in Aramaic) contains

    418,732	morpheme tokens, of
      8,689     morpheme types, of which
      2,896     morpheme types only occur once.

Half of the tokens come from just 27 different types.
In the transliteration used by the Wheeler/Groves Morphology, they are

 [1] W         H         L         B.        )"T_1     MIN       YHWH     
 [8] (AL_2     )EL       ):A$ER    K.OL      )MR_1     LO)       B."N_1   
[15] K.IY_2    HYH       K.        (&H_1     ):ELOHIYM B.W)      MELEK:_1 
[22] YI&:RF)"L )EREC     YOWM      )IY$      P.FNEH    B.AYIT_1 

You will notice that the very high frequency words are short.  In fact,
if you do

    plot(heb$count, nchar(heb$morpheme))

you'll see a clear pattern: rare words can be any length but common words
are only short.

You can produce similar results for most languages (and only caution stops
me saying 'all').  By and large, if a word is common enough, people will
chop and mangle and squeeze it in their speech until it is about as short
as the phonological rules of the language will allow, because (a) it's
less work that way and (b) if a word is common enough, you can get away
with mangling it as people will figure it out.

There's one fairly obvious example of this principle at work in Smalltalk:
blocks are so common that they just get [...] around them.

An operation has to be *extremely* common to justify a single-character
name, or else it has to be justified by strong 'cultural' requirements.

What about (aCharacter * anInteger)?  Well, so far it ISN'T common.
The 3.6 image has quite a few calls to #new:withAll:, making Arrays,
ByteArrays, Bitmaps, and all sorts of things.  There are even a couple
of (String new: i-1 withAll: $ ) calls and in OldSocket a bunch of
(String new: n withAll: $x) calls where any character would probably
have done, maybe even (String new: n).  We also have
    aString padded: (#left or #right) to: length with: aCharacter
which can be used like this:
    (' ' padded: #right to: n with: $ )
to make strings.  Then of course, if you don't like
    String streamContents: [:s | s space: n]
there's
    String streamContents: [:s | s next: n put: Character space].
which can of course be used to make array-like collections with any
element:
    FloatArray streamContents: [:s |
        s next: 10 put: -1.0; next:  5 put:  0.0; next: 10 put:  1.0]

Squeak is littered with ways to make blocks of repeated values.  If
this idiom were going to be common, it would already be common.  And
with 59 senders of #new:withAll: in the 3.6 image, I suppose it is.
Commoner than some things anyway.  But it's nowhere near as common as
#+ (4599 senders), #, (2105), #* (1812), #at: (2925), #do: (2210),
or even #error: (948 senders, many of which account for a lot of the
#, sends).

What about cultural pressure?  Well, mathematically, the concatenation
operator acts like multiplication, not addition, so, appropriately, the
conventional sign for repetition is exponentiation.  So $* is
*definitely* not a useful name for "make a sequence with n copies of this".

What about #repeatedTimes:?  Well, it doesn't fail that criterion.  It
is not so short as to be absurd (which #* was for this use).  It is not
inappropriate (as using a multiplication sign for an operation that acts
like exponentiation would be).  It is quite clear, as long as you have
never heard of #timesRepeat:, and as long as it is clear to you what
"repeat" means.

And that's problem 1 with #repeatedTimes:.
Looking for methods containing 'repeat' I found a bunch of things
that weren't there, nor is this a complete list of what _was_ there,
but it's reasonably thorough.

Utilities class>>awaitMouseUpIn: box repeating: doBlock ifSucceed: succBlock
    -- repeats a block
ButtonPropertiesMoprh>>toggleTargetRepeatingWhileDown
ButtonPropertiesMorph>>paneForRepeatingInterval
    -- concern how often an *action* is to be repeated.
BlockContext>>repeat
    -- repeats a block
BlockContext>>repeatWithGCIf: testBlock
    -- repeats a block, used somehow to get a valid Socket?
Benchmark>>time: aBlock repeated10K: tenKTimes
Benchmark>>time: aBlock repeated: nTimes
    -- repeat a block
FormEditor>>repeatCopy
    -- repeats an *action* as long as the red button is pressed.
Benchmark>>report: label timedAt: time repeated: numberOfTimes
Benchmark>>reportStringFor: label timedAt: time repeated: numberOfTimes
Benchmark>>test: aBlock labeled: label repeated: nTimes
    -- has to do with how often a block is done.
Parser>>keylessMessagePartTest: level repeat: repeat
    -- empty body, no senders called; what's it for?
Parser>>messagePart: level repeat: repeat
    -- repeat is a Boolean saying whether to do it many times or just once
    -- it is an *action* that is repeated.
DialectParser>>messagePart: level repeat: repeat initialKeyWord: kwdIfAny
    -- repeat is a Boolean again.
MPEGDisplayMoprh>>toggleRepeat
MPEGDisplayMorph>>repeat[: aBoolean]
ScorePlayer>>repeat[: aBoolean]
StreamingMP3Sound>>repeat[: aBoolean]
StreamingMonoSound>>repeat[: aBoolean]
    -- report/set a Boolean saying whether an *action* is to be repeated
BalloonEnginePlugIn>>repeatValue: delta max: maxValue
    -- basically (delta \\ maxValue) done slowly in Slang.
    -- buit it's about repeating a drawing action

SampledSound class>>defaultSample: anArray repeated: n
    -- concatenates many copies of a sequence
RepeatingSound class>>repeat: aSound count: anInteger
    -- concatenates many copies of a sound to make a sound
RepeatingSound class>>repeatForever: aSound
    -- concatenate infinitely many copies of a sound

RunArray>>repeatLast: times ifEmpty: defaultBlock
    -- roughly, x addLast: x last times: times
RunArray>>repeatLastIfEmpty: defaultBlock
    -- x repeatLast: 1 ifEmpty: defaultBlock
    -- These two don't seem to be used in the image.  I think they are
    -- bad names.  I would prefer
    -- aRunArray addLastAgainIfEmpty: defaultBlock [times: times]

So we have three meanings for "repeat" in Squeak at the moment:
(1) Repeat a block or some other kind of action.
(2) Concatenate multiple copies of a sound to make another sound.
(3) A meaning peculiar to RunArray (does anyone other than the 'ar'
    who added those methods use them?)

Only the two methods in (3) refer to putting multiple copies of an
element into a sequence, and those two methods do not mention the element.

Overwhelmingly, 'repeat' in Squeak means to repeat a block or other kind
of action.  The name #repeatedTimes: sits uncomfortably with Squeak.

Cultural pressure from other languages could perhaps override this.
But the languages I've used that had a REPEAT or RPT or RPT$ function
gave it the same kind of input (a string) as result (a string).

This runs us headlong into the second problem with #repeatedTimes:.
If we look at existing uses of #new:withAll: we find a variety of
collections being constructed.  Most are Arrays or Strings, but many
are not.

What good is an idiom for a construction (making a collection of a particular
size filled with a particular element) which fails to cover so many
instances of that construction?

Both the name and the coverage problem end up suggesting that something
like

    SequenceableCollection>.
    copied: nTimes
        ^self species streamContents: [:s |
            nTimes timesRepeat: [s nextPutAll: self]]

might work.  This *is* like a (shallow) copy, so the name fits.
It *is* a fairly short name.  It means that

    instead of					you write
    nil repeatedTimes: 27			#(nil) copied: 27
    x isNil repeatedTimes: 10			{x isNil} copied: 10
    ($ ) repeatedTimes: 80			' ' copied: 80
    ($x repeatedTimes: 80) asArray		{$x} copied: 80
    (255 repeatedTimes: 256) asByteArray	#[255] copied: 256
    ?no can do?					'<>' copied: 16

Funny, an approach which is more general *and* shorter *and* doesn't
confuse people by looking like #timesRepeat:.

With specific reference to characters and strings, is there any reason
to prefer #copied: to #timesRepeat:?

Yes.  It's called Unicode.  Maybe one day Squeak will support Unicode.
When it does, there are things that occupy a single printing position
which can be written as a String but NOT as a Character.  For example,
if we assume that #<hex hex ... hex> is a way of writing a Unicode
string, then #<0078 20DB> represents "x with three dots above", which
might mean "d^3x/dt^3".  Now (#<0078 20DB> copied: 2) makes perfect
sense, but there is no one character (not even in Unicode 4.0) that
we could send #repeatedTimes: to get the same printed effect.

	It's too bad then that stream operations in Smalltalk are so
	clumsy and relatively restricted.  :-)

When I look at Streams, I feel that you are being too polite.
But at least they are a lot better than Java, and #printOn: is a
better way to go than .toString().