[Seaside] Re: requests and encodings (was Re: fix for issue 21)

Philippe Marschall philippe.marschall at gmail.com
Sun Jun 29 13:48:18 UTC 2008


2008/6/28, Julian Fitzell <jfitzell at gmail.com>:
> Moving to seaside-dev...
>
>  On Sat, Jun 28, 2008 at 2:56 PM, Philippe Marschall
>  <philippe.marschall at gmail.com> wrote:
>  > 2008/6/27, Julian Fitzell <julian at fitzell.ca>:
>  >>  On Fri, Jun 27, 2008 at 1:07 PM, Philippe Marschall
>  >>  <philippe.marschall at gmail.com> wrote:
>  >>  > 2008/6/26, Julian Fitzell <julian at fitzell.ca>:
>  >>  >>  - I wonder whether we should add a "path" instVar to WARequest.
>  >>  >>  Currently the (unfortunately-named) "url" instvar doesn't provide any
>  >>  >>  way to tell the difference between a '/' and a '%2f' in the original
>  >>  >>  URL. I broke my fix up into two methods so that we could store the
>  >>  >>  result of #pathSegmentsFrom: in another instvar.
>  >>  >
>  >>  > Ideally IMHO "url" would hold a WAUrl that is the request URL parsed.
>  >>  > I don't know though if this is enough and what it all will break.
>  >>  > Right now "url" is also always utf-8 decoded which made me create
>  >>  > issue 79.
>  >>
>  >>
>  >> Well, I thought that too but it would kind of break things to change
>  >>  it from a string to a WAUrl. Also, after more thought, I realized that
>  >>  an HTTP request doesn't have a protocol, port, or (necessarily)
>  >>  server.
>  >
>  > Yes it does. The server is in the HOST header. The protocol is either
>  > http or https we can get this from the configuration. Same for the
>  > port.
>
>  Yeah, ok, I suppose you /could/ fake it with the information from the
>  configuration (there is no Host: header in HTTP/1.0 but that's likely
>  not a big problem these days). Is that misleading though since the
>  user might actually have connected differently (particularly for an
>  initial connection where seaside's configuration doesn't enter into
>  the equation? You could also presumably find the port and protocol of
>  the Kom connection from Kom itself somehow...

Well then, let's exclude the port and scheme:

WAUrl new parsePath: '/ch/de/index.html'

works quite well.

>  In either case, it seems to me that changing #url from a string to a
>  WAUrl would break existing code. Maybe it's desirable...

Breaking client code is never desirable.

> not a
>  difficult fix to code that does break and it would probably break
>  pretty obviously.

and there should be pretty few users.

>  >>  >>  - do you know if the header values in HTTPRequest also need to be
>  >>  >>  decoded? They aren't currently and I don't know if they support UTF-8
>  >>  >>  values or not...
>  >>  >
>  >>  > If they are really UTF-8 that would be good. An example is cookie
>  >>  > values which are transmitted through headers. See also issue 63.
>  >>  > Before adding such a thing, please make sure it really works with IE
>  >>  > 6, Firefox 2, Safari 2 and Opera 9 with utf-8, ISO-8859-1 and utf-16.
>  >>  > Ideally also Big5 and  Shift JIS though I have to admit I never tested
>  >>  > with those. Unfortunately the HTTP spec/theory and browsers/reality
>  >>  > are different.
>  >>
>  >>
>  >> Are you suggesting auto-detecting the encoding of headers sent by the
>  >>  browser?
>  >
>  > No not at all. But in Seaside 2.9 we now know the encoding oft the web
>  > application. Even if there is a spec, you will simply have to try all
>  > browsers with at least iso-8859-1 and utf-8. Either there is a rule or
>  > we can't support it. It's as simple as that. A short googling suggests
>  > that headers are ASCII. We might or might not want to support a custom
>  > encoding for cookie values.
>  >
>  >> I don't think the browser specifies an encoding in the
>  >>  headers does it? I'm not sure I want to tackle this mess right now but
>  >>  I'll keep it in mind. :)
>  >
>  > It can, in the content-type header. Not that it often does.
>  >
>  >>  I'd have to think about this more but if we are supporting all those
>  >>  encodings, wouldn't it be nice to have a pair of encoders: one for
>  >>  what we want our Response encoding to be and one for the encoding we
>  >>  want to use internally (convert Request data *TO* and Response data
>  >>  *from*). So you could use a UTF-8 converter for "outside" and a Squeak
>  >>  encoding converter for "inside"; all incoming data would be converted
>  >>  to Squeak encoding and anything going out would be converted from
>  >>  Squeak encoding to UTF-8. If you had UTF-8 encoders for both then you
>  >>  wouldn't have to do any encoding going out but incoming might still
>  >>  have to be converted to UTF-8 if it was, for example, UTF-16.
>  >
>  > No, internally we ideally want only Squeak/Smalltalk encoding.
>  > Otherwise we can throw String away and just use ByteArray. The problem
>  > is that WideStrings are bugged and slow and for legacy reasons we have
>  > to support "null encoding". Everything else is insanity. Same goes for
>  > using utf-8 internally and utf-16 externally. Second for some external
>  > parts (like URLs) the external ecoding is given.
>
>  It doesn't appear quite that simple to me... if you have data in UTF
>  format in a database, you might well prefer to use UTF encoding
>  internally

There is no such thing as UTF encoding. Using an encoding other than
Squeak fixes #= but breaks _every_ method except #,. The only reason
you might want this is to avoid the performance penalties of
WideString. But then again have you profiled your application and can
you prove to me that WideStrings are your performance bottleneck? Else
this is pure premature optimization.

> (or at very least be able to specify the encoding of that
>  data when giving it to the canvas).

No, you must adhere to the Seaside contract. You give Strings to
Seaside in the same encoding you expect Seaside to give Strings to
you. Everything else is a pure horror.

> Does squeak encoding doesn't
>  support anything outside basic accented characters does it?

Squeak supports a superset of Unicode including astral planes.

> Same goes
>  for incoming form data if you have to put it in a database... you
>  don't want to be putting it in in Squeak encoding.

That's between you and your database driver. That doesn't include
Seaside at all.

I still think this belongs to seaside-dev.

Cheers
Philippe


More information about the seaside mailing list