Extending FileList with CrLf

Richard A. O'Keefe ok at cs.otago.ac.nz
Tue Aug 5 02:48:56 UTC 2003


Ned Konz <ned at bike-nomad.com> wrote:
	Not in Squeak. We already have the #ascii/#binary distinction which 
	only affects whether reads will return Character/String or 
	SmallInteger/ByteArray.
	
	Unfortunately, this meaning of #binary conflicts with the C(++) 
	meaning of binary (which means: no line ending translation).
	
It suddenly occurred to me that Catch 28 applies:
    Things are always more complicated than they appear,
    even when you take Catch 28 into account.

C has a *two*-way distinction:
    "binary" => get the "raw" bytes (for some value of "raw"; mapping the
        C I/O model onto record-oriented systems is non-trivial)
    "text" => do line terminator translation, end of file truncation, &c
        As far as I am aware, nothing in the C89 or C99 standard prevents
        a C system from adopting "CR+LF *or* CR *or* LF -> \n" on input
        and "\n -> platform-specific" on output.
It manages without Squeak's #ascii/#binary distinction by "punning" on
"character" and "byte integer".

A three-way distinction has been proposed:
    - pure binary
    - characters without end of line/end of file conversion
    - characters with end of line/end of file conversion

But of course, all of this presumes that the character set is unproblematic.
Trying to read Latin-1 characters into Squeak doesn't work very well because
Squeak thinks it is dealing with (modified) MacRoman.

We _can_ figure out quite easily what the end of line convention is just
by looking at the data; we _can't_ figure out quite easily what 16rd4
stands for.

Ideally, we'd be able to pick up the encoding from the file system.
I have a vague memory that the NTFS can use file attributes to do this.
I have equally vague memories that MacOS (pre-X) didn't do this, but there
is probably some way to hack it using resource forks.
UNIX does not, alas, have any notion of file attributes or properties,
but you can pick up a "default" encoding from the $LANG environment variable.
Really nasty are things like XML, where you have an encoding directive
embedded in the data, and have to switch encoding dynamically.

What I'm getting at is that a simple binary/"raw text"/"useful text"
distinction is a dead end; the distinction is really
 - how are the elements of the file encoded?
 - do we want those elements as characters or integers?
and long term, we'll *have* to have
    FileStream encoding "answer the encoding"
    FileStream encoding: anEncoding "set the encoding"

When you realise that, "raw text" looks like a very very very specific
encoding, and even "smart text" looks like a bandaid.



More information about the Squeak-dev mailing list