Extending FileList with CrLf

Tue Jul 22 23:27:48 UTC 2003

On Tuesday 22 July 2003 02:33 pm, Daniel Vainsencher wrote:
> I used to have a Celeste DB. I work in Linux most days. I go to
> ESUG with Windows laptop. On it I start another DB. I come back and
> try to merge. Blech, my files are not bit compatible because they
> happened to have been created on different OSes... And because my
> images used CrLfFS by default, I wasn't even aware of this.
>
> After that, I stopped using it.
>
> Philosophically:
> When you say that "text" automatically means one specific mode, in
> which in-the-image representations do not match in-the-file
> representations, you are ignoring other possibilities. I should be
> able to specify text behavior, and decide what mapping I want
> exactly, but the default mapping should be the transparent one -
> the identity mapping.

The Celeste problem is due to not being robust enough with respect to 
differing text file formats. I don't know if this is because it's 
trying to keep indexes into the bytes of the file (which tends to 
break when you change the line endings), or because of something 
else.

It looks to me like EMAIL.index might be keeping character offsets.

But Celeste should be able to import (and re-index) EMAIL.message 
files that have any kind of line ending (and should deal with missing 
EMAIL.index files if it doesn't already).

I think that it's quite valid to think that some files are "text".

Which means to me that they should be written in whatever the native 
text format is.

Which also means that on reading we should be more robust with respect 
to differing line-ending conventions. Even if you have a mixture of 
them in a file (which can happen when you append text with one kind 
of line endings to a file you moved from another system).

CrLfFileStream isn't robust during reads, because it decides early on 
what the line-ending convention is and sticks to it.

So if you have a file that starts off with CR line ends, and has a 
chunk with CRLF line ends later, you'll have garbage LF characters.

What we *should* have, I think (as I've said before) is a wrapper that 
lets you read any stream as filtered text. That is, it should convert 
any of:

CRLF
LF
CR

to a single "Character cr" on read, and should convert "Character cr" 
to whatever the line end convention in the system is on write (by 
default; it should be possible to specify the line end convention).

A reasonable algorithm that will deal with most mixed-ending files 
pretty well would be to take each string of consecutive CRs and/or 
LFs and count how many CRs and how many LFs. Then take the minimum of 
those two counts (as long as that minimum is > 0) and report that 
many Squeak crs.

So:

CR	-> cr
CRLF -> cr
LF -> cr
CRLFLF -> cr
CRCRLF -> cr
CRLFCRLF -> cr cr
LFCRLF -> cr

-- 
Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE