Extending FileList with CrLf
Ned Konz
ned at bike-nomad.com
Thu Jul 24 15:24:53 UTC 2003
On Thursday 24 July 2003 02:05 am, Daniel Vainsencher wrote:
> Just to ask for some more information, does anyone know how low
> level text reading C libraries handle line endings in a file that
> do not agree with the platform convention/currently used
> convention?
You see garbage characters in the stream if they don't exactly match
the platform convention.
> I am even less partial to CrLfFS extreme liberality during reading
> that to its detection responsibilities.
Daniel, what's bothering you about the idea of a text file stream that
tries to be liberal when reading text files?
Again, my definition of text files is something like: "a file
containing lines of text with delimiters".
Using this definition, it's the text itself (not the delimiters) that
matters to a text file. Delimiters aren't content. When I see a
"Character cr" in a read stream on a text file, I don't think
"there's an ASCII CR character", I think "there's the end of the
line".
Since each OS has its preferred scheme for delimiting text lines, we
should probably write new text files using that scheme.
Since we should be robust when dealing with files from other systems
(even, I would argue, ones that aren't as robust and might insert
spurious delimiter characters), we should be able to read text that
contains other flavors of line delimiters.
You were worried about "asymmetry" between reading and writing. There
isn't any if you realize that delimiters aren't content. Text lines
in, text lines out.
However, I understand your concern about reading a pre-existing file
and writing it back out with different line endings. The scheme we
ended up with in Monticello (for a while, anyway) and the one that
CrLfFileStream uses is to auto-detect the existing convention and use
that as the default for re-writing (of course, one can override that
default).
I don't understand your concern about accepting different flavors of
line endings within a single file. As long as they're unambiguous,
there's no problem. CrLfFileStream already accepts CR, LF, or CRLF on
read, and writes using whatever the first delimiter sequence was. I
can see that it would be useful for some applications to know that
there were mixed delimiters in a pre-existing file.
My suggestion would extend this to handle damaged files (certainly not
the common case, I'd hope!) in a more sensible way. For instance, it
should be possible to read a *text* file that has CR/CR/LF between
every line as a bunch of lines separated by single (logical CR)
delimiters. And in this case it might make sense to *not* use
CR/CR/LF when re-writing, but to attempt to fix the file and use the
platform (or preferred) default. Which might be CR/LF in this case
because that is the last "accepted" sequence of delimiters in each
line.
I'd suggest:
- default file opens are *not* text unless you explicitly use a text
stream class, wrapper, or constructor method. You read one character
per character in the file.
- we review the reading and writing of text files in the Basic image
to make sure that the behavior is what we want. For instance, we may
decide to maintain the Mac delimiters in ChangeSet's file-out format,
but to make the "save as text" from the Workspaces save in the
default text format.
- we very carefully review the uses of #position and #position: on
text streams in the image, especially if there's math being done on
these values. These are potential sources of bugs. It may be possible
to come up with a safer idiom for many of these uses, for instance:
- come up with a StreamPosition object that knows about line numbers
and offsets within a line
- make a #mark and #returnToMark: (whatever their names should be)
that can be used in the common cases where you just want to return to
where you were
- we add either:
- text flavors of the open calls
FileStream readOnlyTextFileNamed: ...
- a message that will change the stream behavior (or wrap the stream)
to text (this is my preference)
someStream asText ...
- there should be a preference for the default delimiter flavor for
new text files. The choices should include "OS default" as well as
specific flavors (CR, LF, or CRLF probably).
- this preference should be set to "OS default" in the distributed
images
- the default write behavior for new text files should be to use the
preferred delimiters
- the default read behavior of text files should be to handle
arbitrary delimiters at least as well as the CrLfFileStream does.
(that is, transparent handling of at least CR, LF, or CRLF
delimiters).
- but it should be possible to get a Notification (whose default
handler just ignores it) when you encounter a different (unexpected)
line ending. Or at least to query the stream to see if such
unexpected delimiter sequences have been read.
- the default write behavior for pre-existing text files should be to
use the auto-detected line endings (which could be as simple as
CrLfFileStream's search for the first delimiter).
- it should be possible to override the defaults and specify:
- read delimiter translation mode (i.e. strict or liberal)
- write delimiter sequences
- behavior on encountering non-default delimiter sequences on read
--
Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE
More information about the Squeak-dev
mailing list
|