CrLfFileStream as default?

Dwight Hughes dwighth at ipa.net
Sun Oct 25 23:48:36 UTC 1998


As a quick and modestly sick example of eol formatting there is:

ftp://st.cs.uiuc.edu/pub/Smalltalk/Squeak/Goodies/XBaseFile.st

To get some seriously ill example of eol formatting check out the New
Compiler Implementation and the change set to it at:

http://www.hellblazer.com/personal/squeak/

These problems seem to come from code developed on several versions of
Smalltalk and/or several platforms and cut-and-pasted - and probably
given some unneeded help by over-helpful decompression programs or
certain servers.

I haven't looked any deeper into the problem than just being aggravated
by it and whining :-) -- it is easy enough to deal with in other ways.
Thinking a bit more about it, I think the difficulties in creating a
one-size-fits-all pathological file filter (which Richard Harmon talks
about in his messages) makes putting this filtering in as a default a
bad idea - the assumptions one has to make to filter in any reasonably
effective way will also turn around and bite you sooner or later.

Since the real problem I see is in filing in such files and getting
screwy formatting when I browse the code, it might be better to have
this filter applied by nextChunk during fileIn (it would be nice to base
this filter on a very general and stupid nextLine method in FileStream -
just read in anything that is not cr or lf as a line - stop and return
the line when you see either a cr or lf, if cr or lf is the first char
seen by nextLine ignore and get next char, etc. -- then put whatever
end-of-line char you want between lines read).

However, since my time is focused on some other things, I leave it to
the interested reader :-).

-- Dwight

Since

lex at cc.gatech.edu wrote:
> 
> Dwight Hughes <dwighth at ipa.net> wrote:
> > I would also like to see it handle pathological files -- I seem to run
> > into a number of source files that have (for example) LF in one part, CR
> > in another, CRLF in another, and, just for fun, mutants like LFCRLF,
> > CRLFLF, CRLFCR, and CRCRLF sprinkled here and there. Perhaps
> > CrLfFileStream should simply chomp all of the above combinations into a
> > single CR on a read (which seems to be the "right thing" in most cases
> > I've seen), and only bother with line end conventions when filing out.
> >
> > I guess I'm wondering if this would be useful to others or if I'm just
> > lucky.
> >
> 
> I think it would be, though I think you're quite "lucky" if you are actually seeing things quite as bad as you describe!  It's easy to do.  I think you only need to change next and next: and remove references to lineEndConvention:
 ----code snipped---- 
> There is still a problem with using CrLfFileStream as the default, however: dealing with file positions.  Right now, file positions in a CrLfFileStream will sometimes jump up by 2 even though you've only read 1 Squeak-visible character.  Either this needs to be sanctioned like it is in C (yick!), or code needs to be added to keep up with the "virtual" position in the file.
> 
> With the latter approach, it's interesting that the position in ASCII mode is different than the position in binary mode.  But a file with mixed modes is a strange beast indeed, I would think.  If there is any Squeak binary data in a file, then the whole file may as well be considered a binary file, yes?
> 
> Lex
----snip----





More information about the Squeak-dev mailing list