Extending FileList with CrLf

Ned Konz ned at bike-nomad.com
Thu Jul 24 15:24:53 UTC 2003


On Thursday 24 July 2003 02:05 am, Daniel Vainsencher wrote:
> Just to ask for some more information, does anyone know how low
> level text reading C libraries handle line endings in a file that
> do not agree with the platform convention/currently used
> convention?

You see garbage characters in the stream if they don't exactly match 
the platform convention.

> I am even less partial to CrLfFS extreme liberality during reading
> that to its detection responsibilities.

Daniel, what's bothering you about the idea of a text file stream that 
tries to be liberal when reading text files?

Again, my definition of text files is something like: "a file 
containing lines of text with delimiters".

Using this definition, it's the text itself (not the delimiters) that 
matters to a text file. Delimiters aren't content. When I see a 
"Character cr" in a read stream on a text file, I don't think 
"there's an ASCII CR character", I think "there's the end of the 
line".

Since each OS has its preferred scheme for delimiting text lines, we 
should probably write new text files using that scheme.

Since we should be robust when dealing with files from other systems 
(even, I would argue, ones that aren't as robust and might insert 
spurious delimiter characters), we should be able to read text that 
contains other flavors of line delimiters.

You were worried about "asymmetry" between reading and writing. There 
isn't any if you realize that delimiters aren't content. Text lines 
in, text lines out.

However, I understand your concern about reading a pre-existing file 
and writing it back out with different line endings. The scheme we 
ended up with in Monticello (for a while, anyway) and the one that 
CrLfFileStream uses is to auto-detect the existing convention and use 
that as the default for re-writing (of course, one can override that 
default).

I don't understand your concern about accepting different flavors of 
line endings within a single file. As long as they're unambiguous, 
there's no problem. CrLfFileStream already accepts CR, LF, or CRLF on 
read, and writes using whatever the first delimiter sequence was. I 
can see that it would be useful for some applications to know that 
there were mixed delimiters in a pre-existing file.

My suggestion would extend this to handle damaged files (certainly not 
the common case, I'd hope!) in a more sensible way. For instance, it 
should be possible to read a *text* file that has CR/CR/LF between 
every line as a bunch of lines separated by single (logical CR) 
delimiters. And in this case it might make sense to *not* use 
CR/CR/LF when re-writing, but to attempt to fix the file and use the 
platform (or preferred) default. Which might be CR/LF in this case 
because that is the last "accepted" sequence of delimiters in each 
line.

I'd suggest:

- default file opens are *not* text unless you explicitly use a text 
stream class, wrapper, or constructor method. You read one character 
per character in the file.

- we review the reading and writing of text files in the Basic image 
to make sure that the behavior is what we want. For instance, we may 
decide to maintain the Mac delimiters in ChangeSet's file-out format, 
but to make the "save as text" from the Workspaces save in the 
default text format.

- we very carefully review the uses of #position and #position: on 
text streams in the image, especially if there's math being done on 
these values. These are potential sources of bugs. It may be possible 
to come up with a safer idiom for many of these uses, for instance:
	- come up with a StreamPosition object that knows about line numbers 
and offsets within a line
	- make a #mark and #returnToMark: (whatever their names should be) 
that can be used in the common cases where you just want to return to 
where you were

- we add either:
	- text flavors of the open calls
		FileStream readOnlyTextFileNamed: ...
	- a message that will change the stream behavior (or wrap the stream) 
to text (this is my preference)
		someStream asText ...

- there should be a preference for the default delimiter flavor for 
new text files. The choices should include "OS default" as well as 
specific flavors (CR, LF, or CRLF probably).

- this preference should be set to "OS default" in the distributed 
images

- the default write behavior for new text files should be to use the 
preferred delimiters

- the default read behavior of text files should be to handle 
arbitrary delimiters at least as well as the CrLfFileStream does. 
(that is, transparent handling of at least CR, LF, or CRLF 
delimiters).

- but it should be possible to get a Notification (whose default 
handler just ignores it) when you encounter a different (unexpected) 
line ending. Or at least to query the stream to see if such 
unexpected delimiter sequences have been read.

- the default write behavior for pre-existing text files should be to 
use the auto-detected line endings (which could be as simple as 
CrLfFileStream's search for the first delimiter).

- it should be possible to override the defaults and specify:
	- read delimiter translation mode (i.e. strict or liberal)
	- write delimiter sequences
	- behavior on encountering non-default delimiter sequences on read

-- 
Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE



More information about the Squeak-dev mailing list