FileStream and TextConverters etc (reposting from another thread)

Tue Apr 11 13:46:57 UTC 2006

Hi Yoshiki!

Yoshiki Ohshima <yoshiki at squeakland.org> wrote:
>   Göran,
> 
> > Btw, yesterday I was staring at the MultiByteFileStream stuff and...
> > well, IMHO it would have been better *for me* (other users may have
> > other stories to tell) if the default was binary and not ascii. The
> > principle of least surprise. If I open a filestream and don't tell it
> > *anything*, then I would expect it to just feed me the bits and bytes -
> > as Strings or ByteArrays, but not doing any conversions or line end
> > mumbo jumbo or any other non expected "nice things". An example of this
> > is inspecting a file in the file list - I really appreciated the fact
> > that filelist didn't do *any* conversion on the stuff it showed me - now
> > it does. And I also wonder where the hex view went... anyway:
> 
>   Again, "Strings" now include WideStrings, so "no conversion" would not
> work for the users of such strings.

Hmmmm. Looking at defs of instvar converter in MultiByteFileStream I
can't say it looks fully "sound" (in a 7022 image).

First of all it is lazily set in #converter using logic that looks
"right" to me - it ends up with the Latin1TextConverter for me, which
you below explain indeed is the Null converter (and yes, looking at it,
it sure is - not sure why you do the "(Character value: aCharacter
charCode)" though).

But then #open:forWrite: has logic setting it to MacRoman or Utf8 - and
#reset sets it lazily to utf8. I don't get it. :)

> > What I ended up doing was creating NullTextConverter (which does no
> > conversion at all, trivial to write) and then it worked fine.
> 
>   Sorry about that, but we actually have it.... It is just called
> Latin1TextConverter.  (There was some argument for intentional
> revealing names and we were almost about to add a empty subclass of
> Latin1TextConverter, but we didn't get around it.)

As always a slightly more "revealing" class comment would have saved me.
:) It doesn't mention that the internal encoding these days actually is
iso8859-1 (right?) and that this converter does no conversion at all.

> > It seems
> > to me that a
> > cleaner approach here would be to:
> > 
> > 1. Do line end conversions or not regardless of the 2 choices below..
> > 2. Binary or ascii - only decides if we use ByteArrays or Strings,
> > doesn't concern conversions or line ends.
> > 3. Selection of converter where we also have a NullConverter that does
> > nothing.
> > 
> > IMHO (having not dissected this in total detail) the above three options
> > should be combinable. So for example, in our case we have utf8 strings
> > that we want to write out *as is* and use #cr to get platform specific
> > line endings.
> 
>   Mostly I agree, as we do have almost independent choice of 1. and
> 2., as well as NullConverter under the name of Latin1TextConverter.
> 
>   But, isn't the combination of #binary and a line end conversion confusing?

Possibly. :) Given that I can trust the stream to not muck about with
the strings I feed it (which Latin1TextConverter indeed seems to make
sure) then sure, perhaps we could say that binary means no line end
conversions at all.

Ehm, btw, does this mean that the only way to make it do no conversions
and still operate using Strings and not ByteArrays is to use the latin1
converter which operates one character at a time? Because that is way
too slow.

And yes, Andreas is probably right that most of all these issues should
better be dealt with in the TextFiles package.

> > I also think that a default FileStream should not do any line end
> > conversions or conversions at all by default (but still use Strings
> > instead of ByteArrays). In other words - I would like the "least
> > surprise" principle to hold. Am I alone in this idea? I love the work of
> > Yoshiki and friends in this area - I just want to iron out the small
> > "gotchas" with it.
> >
> > Now, Yoshiki and all the rest of you - feel free to correct me with the
> > real facts. :)
> 
>   I wrote a reply to you on this regard last week.

Yes, I just thought it was a reply to my first mentioning of this and
not the second, sorry. And I also think most people didn't read that
thread too closely. :)

> For the least
> surprise principle, I would say using UTF8 conversion for text would
> make sense.

Not to me. Least surprise to me is "no conversion" - which indeed the
system default converter would have given me (albeit one char at a
time).

>   And, as Andreas wrote, the best thing is to separate the concerns.
> If somebody manages to separate the fileOut and fileIn aspect from
> FileStream (there were discussions to move to XML-based external
> format...), it would be a great advance in that front.
> 
> -- Yoshiki

Right.

regards, Göran