Dwight Hughes dwighth@ipa.net wrote:
I would also like to see it handle pathological files -- I seem to run into a number of source files that have (for example) LF in one part, CR in another, CRLF in another, and, just for fun, mutants like LFCRLF, CRLFLF, CRLFCR, and CRCRLF sprinkled here and there. Perhaps CrLfFileStream should simply chomp all of the above combinations into a single CR on a read (which seems to be the "right thing" in most cases I've seen), and only bother with line end conventions when filing out.
I guess I'm wondering if this would be useful to others or if I'm just lucky.
I think it would be, though I think you're quite "lucky" if you are actually seeing things quite as bad as you describe! It's easy to do. I think you only need to change next and next: and remove references to lineEndConvention:
CrLfFileStream.next | char nextC | char _ super next. self isBinary ifTrue: [^ char ]. char = Cr ifTrue: [ "funny code because of how peek is implemented" nextC _ super next. super position: super position - 1.
nextC = Lf ifTrue: [super next]. ^ Cr]. char = Lf ifTrue: [^ Cr]. ^ char
CrLfFileStream.next: n | string | string _ super next: n. string size = 0 ifTrue: [ ^string ]. self isBinary ifTrue: [ ^string ]. string _ string withSqueakLineEndings. string size = n ifTrue: [ ^string ].
"string shrunk due to embedded crlfs; make up the difference" ^string, (self next: n - string size)
There is still a problem with using CrLfFileStream as the default, however: dealing with file positions. Right now, file positions in a CrLfFileStream will sometimes jump up by 2 even though you've only read 1 Squeak-visible character. Either this needs to be sanctioned like it is in C (yick!), or code needs to be added to keep up with the "virtual" position in the file.
With the latter approach, it's interesting that the position in ASCII mode is different than the position in binary mode. But a file with mixed modes is a strange beast indeed, I would think. If there is any Squeak binary data in a file, then the whole file may as well be considered a binary file, yes?
Lex
-- Dwight
Lex Spoon wrote:
If a file has a CRLF pair in it, and CrLfFileStream converts that to a CR, then the idea of "position" gets messed up. It looks like 1 character, but "file position" will act like 2 characters have gone by. For CrLfFileStream to go the final mile, it should probably have some code to make positions truly transparent to the user.
Other than that, CrLfFileStream seems very nice. I've been using it as the default for several months now and found it nothing but convenient.
Lex
"Pennell" pennell@tiac.net wrote:
Dan - thanks for adding this in 2.2. Is there any reason not to make this the default? If you don't change the default for 2.2, can you add it as a preference?
- david
As a quick and modestly sick example of eol formatting there is:
ftp://st.cs.uiuc.edu/pub/Smalltalk/Squeak/Goodies/XBaseFile.st
To get some seriously ill example of eol formatting check out the New Compiler Implementation and the change set to it at:
http://www.hellblazer.com/personal/squeak/
These problems seem to come from code developed on several versions of Smalltalk and/or several platforms and cut-and-pasted - and probably given some unneeded help by over-helpful decompression programs or certain servers.
I haven't looked any deeper into the problem than just being aggravated by it and whining :-) -- it is easy enough to deal with in other ways. Thinking a bit more about it, I think the difficulties in creating a one-size-fits-all pathological file filter (which Richard Harmon talks about in his messages) makes putting this filtering in as a default a bad idea - the assumptions one has to make to filter in any reasonably effective way will also turn around and bite you sooner or later.
Since the real problem I see is in filing in such files and getting screwy formatting when I browse the code, it might be better to have this filter applied by nextChunk during fileIn (it would be nice to base this filter on a very general and stupid nextLine method in FileStream - just read in anything that is not cr or lf as a line - stop and return the line when you see either a cr or lf, if cr or lf is the first char seen by nextLine ignore and get next char, etc. -- then put whatever end-of-line char you want between lines read).
However, since my time is focused on some other things, I leave it to the interested reader :-).
-- Dwight
Since
lex@cc.gatech.edu wrote:
Dwight Hughes dwighth@ipa.net wrote:
I would also like to see it handle pathological files -- I seem to run into a number of source files that have (for example) LF in one part, CR in another, CRLF in another, and, just for fun, mutants like LFCRLF, CRLFLF, CRLFCR, and CRCRLF sprinkled here and there. Perhaps CrLfFileStream should simply chomp all of the above combinations into a single CR on a read (which seems to be the "right thing" in most cases I've seen), and only bother with line end conventions when filing out.
I guess I'm wondering if this would be useful to others or if I'm just lucky.
I think it would be, though I think you're quite "lucky" if you are actually seeing things quite as bad as you describe! It's easy to do. I think you only need to change next and next: and remove references to lineEndConvention:
----code snipped----
There is still a problem with using CrLfFileStream as the default, however: dealing with file positions. Right now, file positions in a CrLfFileStream will sometimes jump up by 2 even though you've only read 1 Squeak-visible character. Either this needs to be sanctioned like it is in C (yick!), or code needs to be added to keep up with the "virtual" position in the file.
With the latter approach, it's interesting that the position in ASCII mode is different than the position in binary mode. But a file with mixed modes is a strange beast indeed, I would think. If there is any Squeak binary data in a file, then the whole file may as well be considered a binary file, yes?
Lex
----snip----
squeak-dev@lists.squeakfoundation.org