If a file has a CRLF pair in it, and CrLfFileStream converts that to a CR, then the idea of "position" gets messed up. It looks like 1 character, but "file position" will act like 2 characters have gone by. For CrLfFileStream to go the final mile, it should probably have some code to make positions truly transparent to the user.
Other than that, CrLfFileStream seems very nice. I've been using it as the default for several months now and found it nothing but convenient.
Lex
"Pennell" pennell@tiac.net wrote:
Dan - thanks for adding this in 2.2. Is there any reason not to make this the default? If you don't change the default for 2.2, can you add it as a preference?
- david
I would also like to see it handle pathological files -- I seem to run into a number of source files that have (for example) LF in one part, CR in another, CRLF in another, and, just for fun, mutants like LFCRLF, CRLFLF, CRLFCR, and CRCRLF sprinkled here and there. Perhaps CrLfFileStream should simply chomp all of the above combinations into a single CR on a read (which seems to be the "right thing" in most cases I've seen), and only bother with line end conventions when filing out.
I guess I'm wondering if this would be useful to others or if I'm just lucky.
-- Dwight
Lex Spoon wrote:
If a file has a CRLF pair in it, and CrLfFileStream converts that to a CR, then the idea of "position" gets messed up. It looks like 1 character, but "file position" will act like 2 characters have gone by. For CrLfFileStream to go the final mile, it should probably have some code to make positions truly transparent to the user.
Other than that, CrLfFileStream seems very nice. I've been using it as the default for several months now and found it nothing but convenient.
Lex
"Pennell" pennell@tiac.net wrote:
Dan - thanks for adding this in 2.2. Is there any reason not to make this the default? If you don't change the default for 2.2, can you add it as a preference?
- david
Lex Spoon wrote:
If a file has a CRLF pair in it, and CrLfFileStream converts that to a CR, then the idea of "position" gets messed up. It looks like 1 character, but "file position" will act like 2 characters have gone by. For CrLfFileStream to go the final mile, it should probably have some code to make positions truly transparent to the user.
I've been bitten by the "position mismatch" in a number of places. Us poor windoze users probably get it more than most, having CrLf as the default. Attached is a change set that fixes the problems I've seen.
Doing a 'save as' of the image messes up the source pointers to the changes file when the Cr's get written as CrLf's. I changed the copy to use 'binary' mode, which seems appropriate.
Both #peek and #upTo: in StandardFileStream assume the 1:1 correlation between characters and positions. The re-writes for CrLfFileStream fix that. I'm sure there are other parts to the protocol that also need fixing -- I just haven't run across them yet.
Also included is a #correctLineEndings method that makes a copy of a file while changing every type of line ending into the current default. I've found it to be handy in dealing with certain "problem" files where the line endings have been mixed and matched.
I also explored the slowness of some file parsing I was doing. By changing CrLfFileStream>>next (and 3 other methods) I got better than a factor of 5 speed up.
I agree with others here that the FileStream (and probably the whole Stream) hierarchy needs some rework to both clean it up and make it more flexible. However, I'm hoping that the attached changes can get incorporated into the standard image, while we figure out what we really want to do.
I've added below the more detailed description from the change set.
------------------------------------------- Bill Dargel wdargel@shoshana.com Shoshana Technologies 100 West Joy Road, Ann Arbor, MI 48105 USA ------------------------------------------------------------------
Change Set: CrLfFileStream updates Date: 5 November 1998 Author: William O. Dargel
Some bug fixes and enhancements for CrLfFileStream for Squeak 2.2.
FileDirectory>>copyFileNamed:toFileNamed: - Set both files to #binary mode in order to make an exact copy. Fixes problem with 'saveAs' possibly converting Cr to CrLf in the changes file, thereby messing up the image offsets to the source code.
CrLfFileStream>>peek - Fixes the erroneous assumption made in StandardFileStream that #next will always change the position by 1. Note that this implementation in CrLfFileStream could be promoted, replacing StandardFileStream's.
CrLfFileStream>>upTo: - Fixes the assumption made by StandardFileStream that positions and character counts are directly correlated. Uses the simple, one character at a time implementation similar to PositionableStream. At one point I implemented a complicated buffer based version, but found that after optimizing #next, there was no appreciable speed difference between it and the simple version.
CrLfFileStream>>next - Optimized to be about 5.5 times faster when just doing #next through a large file. This was accomplished by adding a caching instance variable for 'lineEndFirstCharacter', used instead of doing a dictionary lookup each time. This is updated whenever the line-ending convention is set. The speedup was measured for Windows (NT 4.0), and is predicated on a fairly efficient primitive for getting a single character from the file system, YMMV.
CrLfFileStream>>lineEndConvention: - Setter method for lineEndConvention. Updates the 'lineEndFirstCharacter' caching instance variable. Also useful for explicit external control of the lineEndConvention on individual file instances.
CrLfFileStream>>detectLineEndConvention - Uses the lineEndConvention: setter. Also changed to use a 'nil' lineEndConvention while it is operating in order to have next and peek operate in a basic manner.
CrLfFileStream>>binary - Changed to use the lineEndConvention: setter.
FileStream>>correctLineEndings - A utility that re-writes a file changing all flavors of line endings, Cr, Lf and CrLf, into the current CrLfFileStream default line ending. Makes use of a temporary file, renaming it after copying and deleting the original. I hooked this method up as an option on the FileList menu. 'From Squeak 2.2 of Sept 23, 1998 on 5 November 1998 at 8:23:18 pm'! "Change Set: CrLfFileStream updates Date: 5 November 1998 Author: William O. Dargel
Some bug fixes and enhancements for CrLfFileStream for Squeak 2.2.
FileDirectory>>copyFileNamed:toFileNamed: - Set both files to #binary mode in order to make an exact copy. Fixes problem with 'saveAs' possibly converting Cr to CrLf in the changes file, thereby messing up the image offsets to the source code.
CrLfFileStream>>peek - Fixes the erroneous assumption made in StandardFileStream that #next will always change the position by 1. Note that this implementation in CrLfFileStream could be promoted, replacing StandardFileStream's.
CrLfFileStream>>upTo: - Fixes the assumption made by StandardFileStream that positions and character counts are directly correlated. Uses the simple, one character at a time implementation similar to PositionableStream. At one point I implemented a complicated buffer based version, but found that after optimizing #next, there was no appreciable speed difference between it and the simple version.
CrLfFileStream>>next - Optimized to be about 5.5 times faster when just doing #next through a large file. This was accomplished by adding a caching instance variable for 'lineEndFirstCharacter', used instead of doing a dictionary lookup each time. This is updated whenever the line-ending convention is set. The speedup was measured for Windows (NT 4.0), and is predicated on a fairly efficient primitive for getting a single character from the file system, YMMV.
CrLfFileStream>>lineEndConvention: - Setter method for lineEndConvention. Updates the 'lineEndFirstCharacter' caching instance variable. Also useful for explicit external control of the lineEndConvention on individual file instances.
CrLfFileStream>>detectLineEndConvention - Uses the lineEndConvention: setter. Also changed to use a 'nil' lineEndConvention while it is operating in order to have next and peek operate in a basic manner.
CrLfFileStream>>binary - Changed to use the lineEndConvention: setter.
FileStream>>correctLineEndings - A utility that re-writes a file changing all flavors of line endings, Cr, Lf and CrLf, into the current CrLfFileStream default line ending. Makes use of a temporary file, renaming it after copying and deleting the original. I hooked this method up as an option on the FileList menu. "!
StandardFileStream subclass: #CrLfFileStream instanceVariableNames: 'lineEndConvention lineEndFirstCharacter ' classVariableNames: 'Cr CrLf Lf LineEndDefault LineEndStrings LookAheadCount ' poolDictionaries: '' category: 'System-Files'!
!FileDirectory methodsFor: 'file operations' stamp: 'wod 11/5/1998 18:41'! copyFileNamed: fileName1 toFileNamed: fileName2 "Copy the contents of the existing file with the first name into a new file with the second name. Both files are assumed to be in this directory." "FileDirectory default copyFileNamed: 'todo.txt' toFileNamed: 'todocopy.txt'"
| file1 file2 buffer | file1 _ (self readOnlyFileNamed: fileName1) binary. file2 _ (self newFileNamed: fileName2) binary. buffer _ String new: 50000. [file1 atEnd] whileFalse: [file2 nextPutAll: (file1 nextInto: buffer)]. file1 close. file2 close. ! !
!FileStream methodsFor: 'as yet unclassified' stamp: 'wod 11/5/1998 17:46'! correctLineEndings "Correct the line-endings in the receiver by writing a new copy of the file. All combinations of line-endings found, Cr, Lf and CrLf, are converted into the default used by CrLfFileStream." | tempName out char | tempName _ self directory fullNameFor: '~LinEnd~.tmp'. out _ CrLfFileStream newFileNamed: tempName. self reset. [self atEnd] whileFalse: [char _ self next. char == Character cr ifTrue: [out cr. self peek == Character lf ifTrue: [self next]] ifFalse: [char == Character lf ifTrue: [out cr] ifFalse: [out nextPut: char]]]. self close. out close. self directory deleteFileNamed: self name. self directory rename: tempName toBe: self name.! !
!CrLfFileStream methodsFor: 'open/close' stamp: 'wod 11/5/1998 17:31'! lineEndConvention: aSymbolOrNil "Set the line-ending convention of the receiver to be #cr, #lf, #crlf, or nil." lineEndFirstCharacter _ aSymbolOrNil ifNotNil: [(LineEndStrings at: aSymbolOrNil) first]. ^ lineEndConvention _ aSymbolOrNil. ! !
!CrLfFileStream methodsFor: 'access' stamp: 'wod 6/18/1998 14:00'! binary super binary. self lineEndConvention: nil! !
!CrLfFileStream methodsFor: 'access' stamp: 'wod 11/5/1998 15:01'! detectLineEndConvention "Detect the line end convention used in this stream. The result may be either #cr, #lf or #crlf." | char numRead pos | self isBinary ifTrue: [^ self error: 'Line end conventions are not used on binary streams']. self lineEndConvention: nil. numRead _ 0. pos _ self position. self position: 0. [self atEnd not and: [numRead < LookAheadCount]] whileTrue: [char _ self next. char = Lf ifTrue: [self position: pos. ^ self lineEndConvention: #lf]. char = Cr ifTrue: [self peek = Lf ifTrue: [self lineEndConvention: #crlf] ifFalse: [self lineEndConvention: #cr]. self position: pos. ^ lineEndConvention]. numRead _ numRead + 1]. self position: pos. ^ lineEndConvention ifNil: [self lineEndConvention: LineEndDefault]! !
!CrLfFileStream methodsFor: 'access' stamp: 'wod 11/5/1998 17:49'! next | char secondChar | char _ super next. self isBinary ifTrue: [^char]. char == Cr ifTrue: [secondChar _ super next. secondChar ifNotNil: [secondChar == Lf ifFalse: [self skip: -1]]. ^Cr]. char == Lf ifTrue: [^Cr]. ^char! !
!CrLfFileStream methodsFor: 'access' stamp: 'wod 6/18/1998 13:52'! peek "Answer what would be returned if the message next were sent to the receiver. If the receiver is at the end, answer nil. " | next pos | self atEnd ifTrue: [^ nil]. pos _ self position. next _ self next. self position: pos. ^ next! !
!CrLfFileStream methodsFor: 'access' stamp: 'wod 11/5/1998 14:15'! upTo: aCharacter | newStream char | newStream _ WriteStream on: (String new: 100). [(char _ self next) isNil or: [char == aCharacter]] whileFalse: [newStream nextPut: char]. ^ newStream contents ! !
At 17:38 -0500 9/24/98, Lex Spoon wrote:
If a file has a CRLF pair in it, and CrLfFileStream converts that to a CR, then the idea of "position" gets messed up. It looks like 1 character, but "file position" will act like 2 characters have gone by. For CrLfFileStream to go the final mile, it should probably have some code to make positions truly transparent to the user.
Other than that, CrLfFileStream seems very nice. I've been using it as the default for several months now and found it nothing but convenient.
Lex
I've just discovered this thread and read it quickly. The root seems a good a spot as any as a base a few comments.
I'm wearing my Here's-What-Other-System-Do hat.
IBM Smalltalk has a different design which I've kind of liked. It gets around the problems of having explicit classes for CrLf streams or of guesing.
There are line delimiter values in the same pool dictionary that holds character definitions:
MACLineDelimiter There's no Mac version but you might get Mac files. PMLineDelimiter OS/2 UNIXLineDelimiter UNIX (tm) WINLineDelimiter Windows LineDelimiter The appropriate one of the above for this platform
Streams and file streams all have a #lineDelimiter: method that lets the value be changed (and a #lineDelimiter method too). By default, the value is that of the current host platform. Line delimiters are strings.
So, if I open a file on Windows, I get CrLf by default. I can do a #nextLine which internally does a (self #upToAll: lineDelimiter) to read a line. A #cr puts the stream's current line delimiter value.
For example, here is some (untested) code:
| in out | in := CfsReadFileStream open: 'UnixFile'. in lineDelimiter: UNIXLineDelimiter.
out := CfsWriteFileStream open: 'WinFile'. out lineDelimiter: WINLineDelimiter.
[ in atEnd ] whileFalse: [ out nextPutAll: in nextLine; cr ]. in close. out close.
This reads a file in UNIX format and writes one in Windows format, regardless of what platform it is run on. Omit the 5th line of code and it converts a UNIX format file to whatever the native platform format is.
There is a general mechanism for calling code at image startup (and at image shutdown) and at other times. With such a mechanism it is easy to set the defaults, and it is an easy mechanism to build (if there isn't already one in Squeak that I've missed).
There is one flaw, though: if you read the file character by character then you get whatever is in the file. There is no attempt to map, say CrLf, to a native Cr. One usually avoids this by reading lines but that is not always what you want. I suspect that it'd be easy to add a #mapToLineDelimiter: method which specifies another line delimiter for a mapping from the stream's line delimiter but then you'd have the length of the read text differ from the stream position.
Dave
_______________________________ David N. Smith IBM T J Watson Research Center Hawthorne, NY _______________________________ Any opinions or recommendations herein are those of the author and not of his employer.
squeak-dev@lists.squeakfoundation.org