[Seaside] requests and encodings (was Re: fix for issue 21)

Sat Jun 28 09:10:29 UTC 2008

Moving to seaside-dev...

On Sat, Jun 28, 2008 at 2:56 PM, Philippe Marschall
<philippe.marschall at gmail.com> wrote:
> 2008/6/27, Julian Fitzell <julian at fitzell.ca>:
>>  On Fri, Jun 27, 2008 at 1:07 PM, Philippe Marschall
>>  <philippe.marschall at gmail.com> wrote:
>>  > 2008/6/26, Julian Fitzell <julian at fitzell.ca>:
>>  >>  - I wonder whether we should add a "path" instVar to WARequest.
>>  >>  Currently the (unfortunately-named) "url" instvar doesn't provide any
>>  >>  way to tell the difference between a '/' and a '%2f' in the original
>>  >>  URL. I broke my fix up into two methods so that we could store the
>>  >>  result of #pathSegmentsFrom: in another instvar.
>>  >
>>  > Ideally IMHO "url" would hold a WAUrl that is the request URL parsed.
>>  > I don't know though if this is enough and what it all will break.
>>  > Right now "url" is also always utf-8 decoded which made me create
>>  > issue 79.
>>
>>
>> Well, I thought that too but it would kind of break things to change
>>  it from a string to a WAUrl. Also, after more thought, I realized that
>>  an HTTP request doesn't have a protocol, port, or (necessarily)
>>  server.
>
> Yes it does. The server is in the HOST header. The protocol is either
> http or https we can get this from the configuration. Same for the
> port.

Yeah, ok, I suppose you /could/ fake it with the information from the
configuration (there is no Host: header in HTTP/1.0 but that's likely
not a big problem these days). Is that misleading though since the
user might actually have connected differently (particularly for an
initial connection where seaside's configuration doesn't enter into
the equation? You could also presumably find the port and protocol of
the Kom connection from Kom itself somehow...

In either case, it seems to me that changing #url from a string to a
WAUrl would break existing code. Maybe it's desirable... not a
difficult fix to code that does break and it would probably break
pretty obviously.

>>  >>  - do you know if the header values in HTTPRequest also need to be
>>  >>  decoded? They aren't currently and I don't know if they support UTF-8
>>  >>  values or not...
>>  >
>>  > If they are really UTF-8 that would be good. An example is cookie
>>  > values which are transmitted through headers. See also issue 63.
>>  > Before adding such a thing, please make sure it really works with IE
>>  > 6, Firefox 2, Safari 2 and Opera 9 with utf-8, ISO-8859-1 and utf-16.
>>  > Ideally also Big5 and  Shift JIS though I have to admit I never tested
>>  > with those. Unfortunately the HTTP spec/theory and browsers/reality
>>  > are different.
>>
>>
>> Are you suggesting auto-detecting the encoding of headers sent by the
>>  browser?
>
> No not at all. But in Seaside 2.9 we now know the encoding oft the web
> application. Even if there is a spec, you will simply have to try all
> browsers with at least iso-8859-1 and utf-8. Either there is a rule or
> we can't support it. It's as simple as that. A short googling suggests
> that headers are ASCII. We might or might not want to support a custom
> encoding for cookie values.
>
>> I don't think the browser specifies an encoding in the
>>  headers does it? I'm not sure I want to tackle this mess right now but
>>  I'll keep it in mind. :)
>
> It can, in the content-type header. Not that it often does.
>
>>  I'd have to think about this more but if we are supporting all those
>>  encodings, wouldn't it be nice to have a pair of encoders: one for
>>  what we want our Response encoding to be and one for the encoding we
>>  want to use internally (convert Request data *TO* and Response data
>>  *from*). So you could use a UTF-8 converter for "outside" and a Squeak
>>  encoding converter for "inside"; all incoming data would be converted
>>  to Squeak encoding and anything going out would be converted from
>>  Squeak encoding to UTF-8. If you had UTF-8 encoders for both then you
>>  wouldn't have to do any encoding going out but incoming might still
>>  have to be converted to UTF-8 if it was, for example, UTF-16.
>
> No, internally we ideally want only Squeak/Smalltalk encoding.
> Otherwise we can throw String away and just use ByteArray. The problem
> is that WideStrings are bugged and slow and for legacy reasons we have
> to support "null encoding". Everything else is insanity. Same goes for
> using utf-8 internally and utf-16 externally. Second for some external
> parts (like URLs) the external ecoding is given.

It doesn't appear quite that simple to me... if you have data in UTF
format in a database, you might well prefer to use UTF encoding
internally (or at very least be able to specify the encoding of that
data when giving it to the canvas). Does squeak encoding doesn't
support anything outside basic accented characters does it? Same goes
for incoming form data if you have to put it in a database... you
don't want to be putting it in in Squeak encoding.

Julian