ANSI Smalltalk streams
Richard A. O'Keefe
ok at cs.otago.ac.nz
Mon Mar 3 04:18:21 UTC 2003
Before you say "take this to comp.lang.smalltalk", I don't get
comp.lang.smalltalk, and I'm interested in the relationship between
ANSI Smalltalk and Squeak, not other Smalltalks.
The thing which strikes me forcibly is how few changes there are between
the freely available 1.9 draft and the official standard. This includes
very little in the way of typo correction...
Here's a list of issues I have found in sections 5.9 "Stream Protocols"
and 5.10 "File Stream Protocols". Note the word "ISSUE"; this is not the
same as "ERROR".
1. "Some stream classes will build sequenceable collections or report
the values of a sequenceable collection." p255
These are the only occurrences of "sequenceable collection";
SequenceableCollection is not an ANSI class or protocol and
there is no "sequenceable collection" concept elsewhere in the ANSI ST
standard.
MINOR.
2. "Other types of streams may operate on files, positive integers,
random numbers, and so forth." p255
There _are_ ANSI Smalltalk streams that operate on files, but not
any that operate on random numbers. That's a pity, because most
Smalltalk users will not be expert in random number generation.
SCOPE issue; the standardisers no doubt had good reason for keeping
RNG out and there are public algorithms one can adapt. However, a
common protocol one could follow would have been nice. In particular,
should one ensure that 0.0 _is_ a possible result of #next or that
it is _not_?
3. "Transcript is a stream that may be used to log textual message[s]
generated by a Smalltalk program." p255
TYPO.
4. The protocol hierarchy lumps "gettable streams" (basically, things that
can support #atEnd and #next) and "peekable streams" (basically, things
that can in addition support #peek) into one <gettableStream> protocol.
As a means of specifying the streams that happen to be present in ANSI
Smalltalk itself, that's OK. But it provides people with no guidance
about which methods they should implement if they _can't_ implement #peek
(efficiently). This matters to me because I have several times been in
just that situation. #atEnd and #next, no problem. #peek, only if I
implemented lookahead myself (an extra two instance variables
'peeked lookahead' and slowdown in #atEnd and #next checking/maintaining
them).
SCOPE. It is important for readers of the ANSI protocols to realise
that they are primarily meant for specifying ANSI classes; if they are
useful as models for your own classes, that's nice, but don't expect it.
5. 5.9.1.5 Definition, p257
"move objects in sequence from the front of the receiver's
future sequence values to the back of th[e] receiver's past..."
This occurs twice. Surely a spelling checker would have found
"the" written as "th"???
TYPO
6. 5.9.1.5 Errors, p257
"If the receiver has any sequence values and amount is greater
than or equal to the total number of sequence values of the receiver".
But this forbids you moving the position to the end of the sequence.
I think this should read "If amount is greater than the total number
of sequence values of the receiver".
ERROR.
- 5.9.1.6 Definition
I find "appended with" grating. Less contentiously, the two
sentences are different in mood, for no apparent reason. Better as
"Sets the receiver's future sequence values to the concatenation of
its past sequence values and its future sequence values, and makes the
receiver's past sequence values empty."
WRITING.
- #contents and #upTo: are there, why not #upToEnd?
The irony here is that #upToEnd is implementable even in some
streams where #contents is not. But we've had discussions about
whether the standardisers were willing to require implementors to
add even quite tiny methods, and the answer was that they weren't.
This is an issue for me trying to put together an ANSI-like implementation
I could use, but the standard is what the standardisers wanted.
7. 5.9.2.2 page 259
"and, passed as the argument to an evaluation of operand."
The comma should not be there, and the argument is called
"operation" not "operand", as the very next sentence points out.
TYPOS.
8. 5.9.2.4 #next: p259
"The result is undefined if amount is larger than the number of objects
in the receiver's future sequence values".
It really is not clear to me why the last block in a sequence of #next:
calls could not be short. (That's how read(2) works in UNIX, after all.)
This prompted me to look at Squeak. I find that in Squeak *some*
implementations of #next: will return a short block and some will go
rumble-rumble-CRASH. But it is easy to implement safely:
next: amount
|result|
result := self species new: amount.
1 to: amount do: [:index |
self atEnd ifTrue: [^result copyFrom: 1 to: index - 1].
result at: index put: self next].
^result
Why this matters is that if you have a stream that is not positionable,
you don't KNOW whether there are that many items left. As it is, the
ANSI definition of #next: pushes the responsibility on the user without
providing the user with any means of carrying out that responsibility,
other that to program his/her own version of #next:.
This is particularly important because some ANSI Smalltalk streams
are supposed to be positionable, but if you try it, you may be in for
a nasty surprise. (See below.)
The "style", so to speak, is a little inconsistent here, because
#upTo: *will* accept a final block that doesn't end in the usual way.
I'm not quite sure how to classify this, but a method you cannot use
when you most need it (see below) sounds like an issue to me.
9. 5.9.2.6 p260
"The results are undefined if there are no future sequence values
in the receiver." But there is only one result, a <boolean>.
TYPO.
10. 5.9.4.4 p264
"Has the effect of enumerating the (sic.) aCollection with the
message #do: and adding each element to the receiver with #nextPut:."
(a) The second "the" shouldn't be there.
(b) Collection methods generally say that elements are enumerated
_in the same order as_ they would be by #do: but don't actually
require that #do: be used. Squeak's implementations of #nextPutAll:
do *not* always use #do:, sometimes they use a block copy and it is
a pity that the standard doesn't actually let them do this.
(The difference is observable in Squeak.)
TYPO
OVERSPECIFICATION
11. 5.9.10 <ReadWriteStream factory> Description p267
"<ReadWriteStreamfactory> provides for the creation of objects
conforming to the <WriteStream> protocol whose sequence values
are supplied by a collection."
(1) "ReadWriteStreamfactory" should be "ReadWriteStream factory"
(2) "WriteStream" should be "ReadWriteStream"
(3) "objects conforming to the ... protocol whose ... values ..."
should be
"objects, conforming to the ... protocol, whose ... values ..."
TYPOS
- ReadStreams are created with #on:.
WriteStreams are created with #with:.
Why is there no #on: for ReadWriteStreams? You can get the same
effect by using (ReadWriteStream with: aCollection) reset; yourself
but why should you have to?
SCOPE
12. 5.10 pervasive
This is possibly the single biggest problem. Explicit reference is
made to POSIX, yet the strong assumption is made that every nameable
file is positionable.
To that I can only say '/dev/tty'.
(Well, I can also say pipes, named pipes. On my machine, there's
/dev/mouse and /dev/pty*. on Linux machines there's /dev/proc.
Unix systems have lots of unseekable "files". Rather strangely,
when I tested lseek() on /dev/tty or stdin, on one UNIX system it
claimed to have succeeded, and on another it closed my connection.
(Yes, really!) I think some people have interpreted EISPIPE too
narrowly.)
It's not just a matter of #position and #position: being in the
interface, it's a matter of there being no provision at all for
them failing.
Oh yes, there's another way that positioning can fail. Consider
VM/CMS. (My knowledge of VM/CMS ends with major version 6, and is
a little fuzzy because I haven't used it for a while.) VM/CMS is
record-oriented. It has fixed length records and variable length
records. Unusually, if you have a disc file full of variable length
records, you _can_ seek to an arbitrary record in the file. What
you can't do is see to an arbitrary byte/character. The only way I
can see to implement #position: on such files (which are quite a nice
way to represent text) is
position: amount
self reset.
self skip: amount.
which isn't very nice.
Then there is a third way that positioning can fail, and I'm surprised
that no-one who spent much time working on Xerox systems managed to get
the point across. It was certainly an issue in Xerox Quintus Prolog,
and the reason why I fought hard to keep byte-addressing for files
out of the Prolog standard. Consider a file of 16-bit text (our
original concern was the XNS 16-bit character set used in Xerox Lisp)
or wider (such as Unicode), represented in "compressed" form (these
days, think of UTF-8 or better still, UTR-6). With UTF-8, position
_as measured in characters_ is not the same as position _as measured
in bytes_, a problem that already exists in MS-DOS and Windows when
you do CRLF->cr mapping. (That's why the C standard does not require
byte addressing for text streams.) The simplest way out then is to
define #position to return a magic cookie that increases monotonically
with position in the file (as C does) and to define #position: to work
only for magic cookies recorded by #position, or possibly with 0.
But it gets worse. The "compressed" form used with XNS characters,
the SBCS->DBCS/DBCS->SBCS shift codes commonly used in VM/CMS, the
similar code set escapes used in ISO 2022, and the rather nice UTR-6
have the property that you cannot interpret bytes at some position in
a file without knowing what "shift state" you are in, even if the
position is known to be where a valid character starts. So a "position"
has to be an object that may contain a shift state as well as a byte
position. Indeed, it could encode position in character sequence,
position in underlying byte sequence, and shift state (if any).
The CRLF->cr problem is a serious one even for the file streams in
ANSI Smalltalk. A further problem here is that #cr is defined to
deposit the same sequence of characters for all streams, so that
aStream cr is not the same as aStream nextPut: Character cr.
Since Character cr is a single character, an ANSI Smalltalk implementation
appears to be *forbidden* to convert it to CR+LF on writing.
13. Also, there seems to be an assumption either that every file that
can be opened for output can be opened in read-write mode.
Well, in POSIX it is quite possible for a user to have permission
to write to a file without having permission to read from it. So
suppose there is a log file that I am supposed to append to:
logStream := FileStream write: logFileName mode: #'append'
check: true type: #'text'.
This makes sense, does it not? But #contents is in the <writeFileStream>
protocol, and it is absolutely unimplementable in this case, because
the file system permissions flatly refuse to let me see the "past
sequence values" of the file, yet there is no provision for #contents
failing. Nor is there any way for a program to determine whether it
_would_ fail without trying it and catching the entirely unspecified
error that might (or then again, might not) be raised.
There really _really_ need to be methods
canChangePosition -> <boolean>
canReportContents -> <boolean>
Ah, you say, if someone is writing ANSI Smalltalk, it is up to them
to ensure that they stay within the limits. But there is no way for
a programmer to find out whether they are staying within the limits
or not. Suppose you open a file (which _can_ be positioned), read a
file name from it, and try to open _that_ file. You have no way of
telling whether #position: or #contents can work, and in the case of
a file with '-w--w----' permission, no way of telling even whether the
attempt to open it will work, short of trying.
Failure to adequately consider current file systems.
14. 5.10.1 Description p270
"translated from or two" should be "translated from or to"
TYPO.
15. "is treated as a sequenced of 8-bit characters"
^ lose the "d"
TYPO.
16. 5.10.2.1 #next: Definition p271
"The result is undefined if amount is larger than the number of objects
in the receiver's future sequence values."
Remember that we're talking about reading from file system objects here.
The only way to know in advance whether there are that many characters
left is to ask whether
theStream contents size - theStream position >= amount
but because #contents is unimplementable for many file system objects,
the programmer cannot use this. This is not a cheap way to find the
size of a file especially in a 64-bit file system. So you might try
t := theStream position.
theStream setToEnd.
size := theStream position - t.
theStream position.
and cache the size. (It's a little odd that there is no #size message
for positionable streams so that this dance can be avoided.) However,
in many systems, the size of a file may change, even be reduced.
So between the time that you determine the amount of data left and
the time that you ask for the next n items, the amount left may have
changed, so even in a positionable stream, you CAN'T predict whether
it will be safe to call #next:.
So in the very case where #next: has the greatest payoff (from using
block copies) it is least safe to use it.
More information about the Squeak-dev
mailing list
|