Extending FileList with CrLf
Ned Konz
ned at bike-nomad.com
Tue Jul 22 23:27:48 UTC 2003
On Tuesday 22 July 2003 02:33 pm, Daniel Vainsencher wrote:
> I used to have a Celeste DB. I work in Linux most days. I go to
> ESUG with Windows laptop. On it I start another DB. I come back and
> try to merge. Blech, my files are not bit compatible because they
> happened to have been created on different OSes... And because my
> images used CrLfFS by default, I wasn't even aware of this.
>
> After that, I stopped using it.
>
> Philosophically:
> When you say that "text" automatically means one specific mode, in
> which in-the-image representations do not match in-the-file
> representations, you are ignoring other possibilities. I should be
> able to specify text behavior, and decide what mapping I want
> exactly, but the default mapping should be the transparent one -
> the identity mapping.
The Celeste problem is due to not being robust enough with respect to
differing text file formats. I don't know if this is because it's
trying to keep indexes into the bytes of the file (which tends to
break when you change the line endings), or because of something
else.
It looks to me like EMAIL.index might be keeping character offsets.
But Celeste should be able to import (and re-index) EMAIL.message
files that have any kind of line ending (and should deal with missing
EMAIL.index files if it doesn't already).
I think that it's quite valid to think that some files are "text".
Which means to me that they should be written in whatever the native
text format is.
Which also means that on reading we should be more robust with respect
to differing line-ending conventions. Even if you have a mixture of
them in a file (which can happen when you append text with one kind
of line endings to a file you moved from another system).
CrLfFileStream isn't robust during reads, because it decides early on
what the line-ending convention is and sticks to it.
So if you have a file that starts off with CR line ends, and has a
chunk with CRLF line ends later, you'll have garbage LF characters.
What we *should* have, I think (as I've said before) is a wrapper that
lets you read any stream as filtered text. That is, it should convert
any of:
CRLF
LF
CR
to a single "Character cr" on read, and should convert "Character cr"
to whatever the line end convention in the system is on write (by
default; it should be possible to specify the line end convention).
A reasonable algorithm that will deal with most mixed-ending files
pretty well would be to take each string of consecutive CRs and/or
LFs and count how many CRs and how many LFs. Then take the minimum of
those two counts (as long as that minimum is > 0) and report that
many Squeak crs.
So:
CR -> cr
CRLF -> cr
LF -> cr
CRLFLF -> cr
CRCRLF -> cr
CRLFCRLF -> cr cr
LFCRLF -> cr
--
Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE
More information about the Squeak-dev
mailing list
|