ANSI Smalltalk streams

Mon Mar 3 04:18:21 UTC 2003

Before you say "take this to comp.lang.smalltalk", I don't get
comp.lang.smalltalk, and I'm interested in the relationship between
ANSI Smalltalk and Squeak, not other Smalltalks.

The thing which strikes me forcibly is how few changes there are between
the freely available 1.9 draft and the official standard.  This includes
very little in the way of typo correction...

Here's a list of issues I have found in sections 5.9 "Stream Protocols"
and 5.10 "File Stream Protocols".  Note the word "ISSUE"; this is not the
same as "ERROR".

1.  "Some stream classes will build sequenceable collections or report
    the values of a sequenceable collection."  p255

    These are the only occurrences of "sequenceable collection";
    SequenceableCollection is not an ANSI class or protocol and
    there is no "sequenceable collection" concept elsewhere in the ANSI ST
    standard.

    MINOR.    

2.  "Other types of streams may operate on files, positive integers,
    random numbers, and so forth."  p255

    There _are_ ANSI Smalltalk streams that operate on files, but not
    any that operate on random numbers.  That's a pity, because most
    Smalltalk users will not be expert in random number generation.

    SCOPE issue; the standardisers no doubt had good reason for keeping
    RNG out and there are public algorithms one can adapt.  However, a
    common protocol one could follow would have been nice.  In particular,
    should one ensure that 0.0 _is_ a possible result of #next or that
    it is _not_?

3.  "Transcript is a stream that may be used to log textual message[s]
    generated by a Smalltalk program."  p255

    TYPO.

4.  The protocol hierarchy lumps "gettable streams" (basically, things that
    can support #atEnd and #next) and "peekable streams" (basically, things
    that can in addition support #peek) into one <gettableStream> protocol.
    As a means of specifying the streams that happen to be present in ANSI
    Smalltalk itself, that's OK.  But it provides people with no guidance
    about which methods they should implement if they _can't_ implement #peek
    (efficiently).  This matters to me because I have several times been in
    just that situation.  #atEnd and #next, no problem.  #peek, only if I
    implemented lookahead myself (an extra two instance variables
    'peeked lookahead' and slowdown in #atEnd and #next checking/maintaining
    them).

    SCOPE.  It is important for readers of the ANSI protocols to realise
    that they are primarily meant for specifying ANSI classes; if they are
    useful as models for your own classes, that's nice, but don't expect it.

5.  5.9.1.5 Definition, p257
    "move objects in sequence from the front of the receiver's
    future sequence values to the back of th[e] receiver's past..."
    This occurs twice.  Surely a spelling checker would have found
    "the" written as "th"???

    TYPO

6.  5.9.1.5 Errors, p257
    "If the receiver has any sequence values and amount is greater
    than or equal to the total number of sequence values of the receiver".
    But this forbids you moving the position to the end of the sequence.
    I think this should read "If amount is greater than the total number
    of sequence values of the receiver".

    ERROR.

-   5.9.1.6 Definition
    I find "appended with" grating.  Less contentiously, the two
    sentences are different in mood, for no apparent reason.  Better as
    "Sets the receiver's future sequence values to the concatenation of
    its past sequence values and its future sequence values, and makes the
    receiver's past sequence values empty."

    WRITING.

-   #contents and #upTo: are there, why not #upToEnd?
    The irony here is that #upToEnd is implementable even in some
    streams where #contents is not.  But we've had discussions about
    whether the standardisers were willing to require implementors to
    add even quite tiny methods, and the answer was that they weren't.
    This is an issue for me trying to put together an ANSI-like implementation
    I could use, but the standard is what the standardisers wanted.

7.  5.9.2.2 page 259
    "and, passed as the argument to an evaluation of operand."
    The comma should not be there, and the argument is called
    "operation" not "operand", as the very next sentence points out.

    TYPOS.

8.  5.9.2.4 #next: p259

    "The result is undefined if amount is larger than the number of objects
    in the receiver's future sequence values".

    It really is not clear to me why the last block in a sequence of #next:
    calls could not be short.  (That's how read(2) works in UNIX, after all.)
    This prompted me to look at Squeak.  I find that in Squeak *some*
    implementations of #next: will return a short block and some will go
    rumble-rumble-CRASH.  But it is easy to implement safely:

	next: amount
	  |result|
	  result := self species new: amount.
	  1 to: amount do: [:index |
	    self atEnd ifTrue: [^result copyFrom: 1 to: index - 1].
	    result at: index put: self next].
	  ^result

    Why this matters is that if you have a stream that is not positionable,
    you don't KNOW whether there are that many items left.  As it is, the
    ANSI definition of #next: pushes the responsibility on the user without
    providing the user with any means of carrying out that responsibility,
    other that to program his/her own version of #next:.

    This is particularly important because some ANSI Smalltalk streams
    are supposed to be positionable, but if you try it, you may be in for
    a nasty surprise.  (See below.)

    The "style", so to speak, is a little inconsistent here, because
    #upTo: *will* accept a final block that doesn't end in the usual way.

    I'm not quite sure how to classify this, but a method you cannot use
    when you most need it (see below) sounds like an issue to me.

9.  5.9.2.6 p260

    "The results are undefined if there are no future sequence values
    in the receiver."  But there is only one result, a <boolean>.

    TYPO.

10. 5.9.4.4 p264
    "Has the effect of enumerating the (sic.) aCollection with the
    message #do: and adding each element to the receiver with #nextPut:."
    (a) The second "the" shouldn't be there.
    (b) Collection methods generally say that elements are enumerated
        _in the same order as_ they would be by #do: but don't actually
        require that #do: be used.  Squeak's implementations of #nextPutAll:
        do *not* always use #do:, sometimes they use a block copy and it is
        a pity that the standard doesn't actually let them do this.
        (The difference is observable in Squeak.)

    TYPO
    OVERSPECIFICATION

11. 5.9.10 <ReadWriteStream factory> Description p267
    "<ReadWriteStreamfactory> provides for the creation of objects
    conforming to the <WriteStream> protocol whose sequence values
    are supplied by a collection."
    (1) "ReadWriteStreamfactory" should be "ReadWriteStream factory"
    (2) "WriteStream" should be "ReadWriteStream"
    (3) "objects conforming to the ... protocol whose ... values ..."
         should be
         "objects, conforming to the ... protocol, whose ... values ..."

    TYPOS

-   ReadStreams are created with #on:.
    WriteStreams are created with #with:.
    Why is there no #on: for ReadWriteStreams?  You can get the same
    effect by using (ReadWriteStream with: aCollection) reset; yourself
    but why should you have to?

    SCOPE

12. 5.10 pervasive

    This is possibly the single biggest problem.  Explicit reference is
    made to POSIX, yet the strong assumption is made that every nameable
    file is positionable.

    To that I can only say '/dev/tty'.

    (Well, I can also say pipes, named pipes.  On my machine, there's
    /dev/mouse and /dev/pty*. on Linux machines there's /dev/proc.
    Unix systems have lots of unseekable "files".  Rather strangely,
    when I tested lseek() on /dev/tty or stdin, on one UNIX system it
    claimed to have succeeded, and on another it closed my connection.
    (Yes, really!)  I think some people have interpreted EISPIPE too
    narrowly.)

    It's not just a matter of #position and #position: being in the
    interface, it's a matter of there being no provision at all for
    them failing.

    Oh yes, there's another way that positioning can fail.  Consider    
    VM/CMS.  (My knowledge of VM/CMS ends with major version 6, and is
    a little fuzzy because I haven't used it for a while.)  VM/CMS is
    record-oriented.  It has fixed length records and variable length
    records.  Unusually, if you have a disc file full of variable length
    records, you _can_ seek to an arbitrary record in the file.  What
    you can't do is see to an arbitrary byte/character.  The only way I
    can see to implement #position: on such files (which are quite a nice
    way to represent text) is
	position: amount
	    self reset.
	    self skip: amount.
    which isn't very nice.

    Then there is a third way that positioning can fail, and I'm surprised
    that no-one who spent much time working on Xerox systems managed to get
    the point across.  It was certainly an issue in Xerox Quintus Prolog,
    and the reason why I fought hard to keep byte-addressing for files
    out of the Prolog standard.  Consider a file of 16-bit text (our
    original concern was the XNS 16-bit character set used in Xerox Lisp)
    or wider (such as Unicode), represented in "compressed" form (these
    days, think of UTF-8 or better still, UTR-6).  With UTF-8, position
    _as measured in characters_ is not the same as position _as measured
    in bytes_, a problem that already exists in MS-DOS and Windows when
    you do CRLF->cr mapping.  (That's why the C standard does not require
    byte addressing for text streams.)  The simplest way out then is to
    define #position to return a magic cookie that increases monotonically
    with position in the file (as C does) and to define #position: to work
    only for magic cookies recorded by #position, or possibly with 0.
    But it gets worse.  The "compressed" form used with XNS characters,
    the SBCS->DBCS/DBCS->SBCS shift codes commonly used in VM/CMS, the
    similar code set escapes used in ISO 2022, and the rather nice UTR-6
    have the property that you cannot interpret bytes at some position in
    a file without knowing what "shift state" you are in, even if the
    position is known to be where a valid character starts.  So a "position"
    has to be an object that may contain a shift state as well as a byte
    position.  Indeed, it could encode position in character sequence,
    position in underlying byte sequence, and shift state (if any).

    The CRLF->cr problem is a serious one even for the file streams in
    ANSI Smalltalk.  A further problem here is that #cr is defined to
    deposit the same sequence of characters for all streams, so that
    aStream cr is not the same as aStream nextPut: Character cr.
    Since Character cr is a single character, an ANSI Smalltalk implementation
    appears to be *forbidden* to convert it to CR+LF on writing.

13. Also, there seems to be an assumption either that every file that
    can be opened for output can be opened in read-write mode.

    Well, in POSIX it is quite possible for a user to have permission
    to write to a file without having permission to read from it.  So
    suppose there is a log file that I am supposed to append to:

        logStream := FileStream write: logFileName mode: #'append'
                                check: true type: #'text'.

    This makes sense, does it not?  But #contents is in the <writeFileStream>
    protocol, and it is absolutely unimplementable in this case, because
    the file system permissions flatly refuse to let me see the "past
    sequence values" of the file, yet there is no provision for #contents
    failing.  Nor is there any way for a program to determine whether it
    _would_ fail without trying it and catching the entirely unspecified
    error that might (or then again, might not) be raised.

    There really _really_ need to be methods

    canChangePosition -> <boolean>
    canReportContents -> <boolean>

    Ah, you say, if someone is writing ANSI Smalltalk, it is up to them
    to ensure that they stay within the limits.  But there is no way for
    a programmer to find out whether they are staying within the limits
    or not.  Suppose you open a file (which _can_ be positioned), read a
    file name from it, and try to open _that_ file.  You have no way of
    telling whether #position: or #contents can work, and in the case of
    a file with '-w--w----' permission, no way of telling even whether the
    attempt to open it will work, short of trying.

    Failure to adequately consider current file systems.

14. 5.10.1 Description p270
    "translated from or two" should be "translated from or to"

    TYPO.

15.  "is treated as a sequenced of 8-bit characters"
                              ^ lose the "d"

    TYPO.

16. 5.10.2.1 #next: Definition p271
    "The result is undefined if amount is larger than the number of objects
    in the receiver's future sequence values."

    Remember that we're talking about reading from file system objects here.
    The only way to know in advance whether there are that many characters
    left is to ask whether
	theStream contents size - theStream position >= amount
    but because #contents is unimplementable for many file system objects,
    the programmer cannot use this.  This is not a cheap way to find the
    size of a file especially in a 64-bit file system.  So you might try
	t := theStream position.
	theStream setToEnd.
	size := theStream position - t.
	theStream position.
    and cache the size.  (It's a little odd that there is no #size message
    for positionable streams so that this dance can be avoided.)  However,
    in many systems, the size of a file may change, even be reduced.
    So between the time that you determine the amount of data left and 
    the time that you ask for the next n items, the amount left may have
    changed, so even in a positionable stream, you CAN'T predict whether
    it will be safe to call #next:.

    So in the very case where #next: has the greatest payoff (from using
    block copies) it is least safe to use it.