Extending FileList with CrLf

Thu Jul 24 17:03:04 UTC 2003

I am obviously having a hard time explaining what I mean here, so
forgive me if I'm being too explicit in the following. But I prefer to
write one long email than 30 more short ones. Please let me know if some
point is not clear.

Ned Konz <ned at bike-nomad.com> wrote:
> On Thursday 24 July 2003 02:05 am, Daniel Vainsencher wrote:
> > Just to ask for some more information, does anyone know how low
> > level text reading C libraries handle line endings in a file that
> > do not agree with the platform convention/currently used
> > convention?
> You see garbage characters in the stream if they don't exactly match 
> the platform convention.
This is the behavior I think Squeak should have. C indeed has a text
abstraction, and as some of you have pointed out, this is indeed OS
dependent, but the abstraction doesn't mean that irregularities should
be hidden from higher levels. 

> > I am even less partial to CrLfFS extreme liberality during reading
> > that to its detection responsibilities.
> 
> Daniel, what's bothering you about the idea of a text file stream that 
> tries to be liberal when reading text files?
I think the very worthy "be liberal in input, strict in output" is a
useful rule of thumb at the interoperation level. That is, an
application should be liberal in input, to save the user from being the
only person on the block that can't read a file, and strict in output in
order for all other implementations of the format be able to read its
file.

This is very generally true - from applications to electric devices,
users want interoperation to be seemless, and therefore the device as a
whole should BLIISIO. However, streams are too low a level to implement
this. 

What is a useful property in a product for the end user can be a bug in
a component. A text abstraction should hide a very specific thing - in
this case what bytes are being used to represent line endings, or more
generally, the text encoding. Within that abstraction, it should allow
the application writer maximal freedom in doing whatever they want. For
example, the application writer might want to edit text file that uses a
mixed convention by design, where the non-OS-convention compliant line
endings are meaningful (because they are later processed by a
line-ending sensitive program, for example).

I know that the application programmer would always have the option of
deciding to be picky, by using binary for example. But my point is that
the programmer expects a certain level of smartness from a format
abstraction (such as the text encoding abstraction), and no more. If the
text abstraction uses the OS encoding then I will naturally rely on
everything being treated as being in the OS encoding, *and nothing
else*. 

IMO, the policies of when to detect/decide/convert different formats are
a higher level concern that mechanisms that implement formats should be
unaware of. People mention (from a users perspective, as far as I can
tell) that they want their Squeak tools to work in "friendly" way with
strangely delimited text files. I have no quarrel with this. 

I am saying that from a developers view, the implementation should be
more along the lines of the following. The image would include a class
that hold CrLfFS algorithms for encoding detection (probably with more
elaboration when m17n comes), and tools should be able to use this
facility in a very friendly way, with a message such as
FileStream>>asSmartText. But this decision should be left an explicit
decision in each application/tool, and left somewhere between the
application and the format abstraction. 

>From another design perspective, a general reason abstractions should be
fine grained is that this allows them to be reused. For example, one
might concievably want to detect line endings for a string that is in
memory, or coming off a socket, which is clumsy if the line detection is
mixed in with a file stream. Or as above, one might want to use the text
abstraction to deal with reading lines, without doing any other
transformations.

> Again, my definition of text files is something like: "a file 
> containing lines of text with delimiters".
> 
> Using this definition, it's the text itself (not the delimiters) that 
> matters to a text file. Delimiters aren't content. When I see a 
> "Character cr" in a read stream on a text file, I don't think 
> "there's an ASCII CR character", I think "there's the end of the 
> line".
But when you detect a CR character in an LF based file (possibly at the
beginning of an LF based file), and what you are implementing is an RFC
that designates LF as line endings, and CRs are part of the payload, you
would be most surprised at things disappearing on you. "What are line
endings" is an application level question. That you have a liberal
position on it as a user that works 95% of the time doesn't mean that
the distinction should be lost, and it even doesn't mean that the
distinction should be ignored by default. It should be ignored when it
is correct for the application to do so.

> Since each OS has its preferred scheme for delimiting text lines, we 
> should probably write new text files using that scheme.
> 
> Since we should be robust when dealing with files from other systems 
> (even, I would argue, ones that aren't as robust and might insert 
> spurious delimiter characters), we should be able to read text that 
> contains other flavors of line delimiters.
You are assuming that you are smarter than the originator of your data.
That may be true, but as someone that's done data conversions
professionally, and has had to deal with the results of lost information
because of people that thought they were smarter than a legacy
application ("that's strange. doesn't make any sense. probably doesn't
mean anything, lets drop it."), I'd prefer that assumption to be made
sparingly. And it certainly shouldn't be made by our low level stream
classes, and by default at that.

> You were worried about "asymmetry" between reading and writing. There 
> isn't any if you realize that delimiters aren't content. Text lines 
> in, text lines out.
> 
> However, I understand your concern about reading a pre-existing file 
> and writing it back out with different line endings. The scheme we 
> ended up with in Monticello (for a while, anyway) and the one that 
> CrLfFileStream uses is to auto-detect the existing convention and use 
> that as the default for re-writing (of course, one can override that 
> default).
I objected quite adamantly then too.. never mind.

> I don't understand your concern about accepting different flavors of 
> line endings within a single file. As long as they're unambiguous, 
> there's no problem. CrLfFileStream already accepts CR, LF, or CRLF on 
> read, and writes using whatever the first delimiter sequence was. I 
> can see that it would be useful for some applications to know that 
> there were mixed delimiters in a pre-existing file.

As I hinted before, auto detection in this case is a heuristic decision
- if the file includes a few Crs and a few Lfs, and one of them should
be considered as data, you have no guarantee that the "right one" will
appear first in the file. It is impossible to evaluate this risk
globally. It is application dependent.

> My suggestion would extend this to handle damaged files (certainly not 
> the common case, I'd hope!) in a more sensible way. For instance, it 
> should be possible to read a *text* file that has CR/CR/LF between 
> every line as a bunch of lines separated by single (logical CR) 
> delimiters. And in this case it might make sense to *not* use 
> CR/CR/LF when re-writing, but to attempt to fix the file and use the 
> platform (or preferred) default. Which might be CR/LF in this case 
> because that is the last "accepted" sequence of delimiters in each 
> line.
I don't care if you have an AI algorithm evaluate every character in the
file stream, guess the application format including what fractional
version of Word it comes from, and tell the application what Stream to
use. I just don't think that any such decisions should be made by
Streams, by default.

So I have no objection to making text files default to the OS
convention. In fact, being compatible with C's text abstraction in this
manner sounds reasonable to me. I object to any kind of autodetection
that happens implicitly. I object to any kind of conversion of
unexpected input that happens implicitly.

Daniel