[Seaside-dev] Re: encoded stream

Sat Jan 24 15:49:20 UTC 2009

(moving this to the list)

On Sat, Jan 24, 2009 at 3:16 PM, Lukas Renggli <renggli at gmail.com> wrote:
> On Sat, Jan 24, 2009 at 1:14 PM, Julian Fitzell <julian at fitzell.ca> wrote:
>> Philippe says the decoding needs to be done based on the same setting
>> as the encoding (makes sense) so if we are encoding based on an
>> Application setting, we need to be decoding based on the same setting.
>> I guess this means putting raw data in the Request object and
>> specifying a Codec when you read from it (or just configuring it with
>> one as it goes by so things further down the handling pipeline use
>> that encoding automatically). All seems a little gross...
>
> Then I wonder why is there a char-set setting in the application at
> all? Is there a reason to set it to something different than the
> server adaptor?

As best as I can understand, the Codec is supposed to be a
configuration of what the encoding is *guaranteed* to be and the
setting in the Application is what encoding is *reported*. By
specifying a NullCodec, you are making no guarantee (you just give and
take whatever the data is), but still might want to report that it is
UTF-8 because that's what you think it is and you're going to just
deal with it not working if not.

Personally, though, while I sort of understand the logic behind this,
I think this is a major source of confusion for people (it certainly
is for me). And since we don't know for sure most of the time what
encoding is used in a request, it isn't really a guarantee anyway.

I think (and I'm really not certain, I'm just thinking aloud here, so
go easy on me!!) I would rather see something where you specified an
external/internal encoding *pair* in the configuration and an
appropriate Codec was selected based on that. For example:

UTF-8/Smalltalk:
 + Response data is assumed to be provided in Smalltalk encoding and
is converted to UTF-8.
 + UTF-8 is reported to the browser as the encoding.
 + Request data is assumed to be UTF-8 unless an encoding is specified
in the request.
 + Data is converted to Smalltalk encoding from either UTF-8 or
whatever is specified.

Latin-1/Smalltalk:
 + Same as above but read "Latin-1" instead of "UTF-8"

UTF-8/UTF-8:
 + Response data is assumed to be provided in UTF-8 encoding. Since
that is the same as the external encoding, no conversion is therefore
necessary.
 + UTF-8 is reported to the browser.
 + Request data is assumed to be UTF-8 unless an encoding is specified
in the request.
 + No conversion is needed unless a different encoding was specified
in the request, in which case the data is converted to UTF-8.

And so on... Basically, when moving data in either direction, you:
  + assume the default encoding specified on input side unless told
otherwise; and
  + convert to the encoding specified on the output side (if different).

It might be that these would be better specified as some kind of
encoding *policy* and the specific Codec selected as needed based on
what two encodings you actually end up with.

But I think this makes it much clearer than the current implementation
what the different options are doing and might actually help people to
make decisions based on their real needs. Given the lack of
understanding around this, just saying "use WAKom" or "use
WAKomEncoded" seems to be a bit opaque.

As for whether the setting is on the Application or the ServerAdaptor,
I think the above works in either case, but at least everything can
use the same setting, removing a source of confusion. Obviously it
would be nice to have a per-application setting so you can have
(legacy?) applications with one behaviour alongside applications with
another. But since you can now easily have multiple ServerAdaptors
running, this is somewhat less of an issue.

(note the above doesn't deal with URL encoding - I don't want to make
this email any longer by delving into that at the instant)

Thoughts?

Julian