[Seaside] Re: requests and encodings (was Re: fix for issue 21)

Sun Jun 29 15:19:54 UTC 2008

2008/6/29, Julian Fitzell <julian at fitzell.ca>:
> Hi Philippe,
>
>  I did semd my previous message to seaside-dev.

All my headers show your message went to
seaside at lists.squeakfoundation.org and not
seaside-dev at lists.squeakfoundation.org

>  I feel like maybe I've offended you somehow, which was absolutely not
>  my intention. If so, I apologize. As I said, I love the pluggability
>  of this new encoding stuff... it's very clean and well done. My
>  intention was only to fix the bug in issue 21, which I did. The rest
>  was just thinking aloud.
>
>  My knowledge of the encoding in Squeak must be out of date (I was
>  familiar with it before the internationalization stuff went in). At
>  the time, MacRoman was used and, as I understand it, MacRoman only has
>  256 characters.

That was the state of Squeak 3.7 to my knowledge. Squeak 3.8 switched
to layer violated Unicode. So all the MacRoman issues fall away and
new ones appear like #= and a lot of methods in WideString broke. I
don't know if everything of this is fixed in Squeak 3.10 but
WideString had some show stoppers in Squeak 3.8 and 3.9.

Seriously when dealing with Strings we must be sure that they are
Strings. That is only the case if the String has Smalltalk encoding.
Else the String is a mere ByteArray. A byte in it has no semantics at
all. It is not possible to do anything meaningful at all with such an
abstraction because we can not assume anything about it. So it has the
byte value 60 in it. Is that $<? We don't know and can't know because
it has no semantics.

Cheers
Philippe

> Obviously you want to be dealing with string literals,
>  etc. in squeak's encoding but data coming out of an existing database
>  is going to be in something else and outputting data from such a
>  database is going to be a common case.
>
>  Assuming my understanding of MacRoman is correct, you obviously can't
>  convert UTF-8 database data to MacRoman, then back to UTF-8 for output
>  back to the browser because the conversion would be lossy. It sounds
>  like you're saying MacRoman is no longer the encoding used. As long as
>  the full character space is available in the native encoding, then I
>  agree that having seaside deliver everything in that native encoding
>  is a reasonable implementation.
>
>  I don't necessarily agree that being able to specify the encoding of a
>  piece of data is "pure horror" but I agree what is there now is going
>  to be adequate as long as the internal encoding is appropriate for the
>  task. Again, sorry for any offense.
>
>  Julian
>
>  On Sun, Jun 29, 2008 at 9:48 PM, Philippe Marschall
>
> <philippe.marschall at gmail.com> wrote:
>  > 2008/6/28, Julian Fitzell <jfitzell at gmail.com>:
>  >> Moving to seaside-dev...
>  >>
>  >>  On Sat, Jun 28, 2008 at 2:56 PM, Philippe Marschall
>  >>  <philippe.marschall at gmail.com> wrote:
>  >>  > 2008/6/27, Julian Fitzell <julian at fitzell.ca>:
>  >>  >>  On Fri, Jun 27, 2008 at 1:07 PM, Philippe Marschall
>  >>  >>  <philippe.marschall at gmail.com> wrote:
>  >>  >>  > 2008/6/26, Julian Fitzell <julian at fitzell.ca>:
>  >>  >>  >>  - I wonder whether we should add a "path" instVar to WARequest.
>  >>  >>  >>  Currently the (unfortunately-named) "url" instvar doesn't provide any
>  >>  >>  >>  way to tell the difference between a '/' and a '%2f' in the original
>  >>  >>  >>  URL. I broke my fix up into two methods so that we could store the
>  >>  >>  >>  result of #pathSegmentsFrom: in another instvar.
>  >>  >>  >
>  >>  >>  > Ideally IMHO "url" would hold a WAUrl that is the request URL parsed.
>  >>  >>  > I don't know though if this is enough and what it all will break.
>  >>  >>  > Right now "url" is also always utf-8 decoded which made me create
>  >>  >>  > issue 79.
>  >>  >>
>  >>  >>
>  >>  >> Well, I thought that too but it would kind of break things to change
>  >>  >>  it from a string to a WAUrl. Also, after more thought, I realized that
>  >>  >>  an HTTP request doesn't have a protocol, port, or (necessarily)
>  >>  >>  server.
>  >>  >
>  >>  > Yes it does. The server is in the HOST header. The protocol is either
>  >>  > http or https we can get this from the configuration. Same for the
>  >>  > port.
>  >>
>  >>  Yeah, ok, I suppose you /could/ fake it with the information from the
>  >>  configuration (there is no Host: header in HTTP/1.0 but that's likely
>  >>  not a big problem these days). Is that misleading though since the
>  >>  user might actually have connected differently (particularly for an
>  >>  initial connection where seaside's configuration doesn't enter into
>  >>  the equation? You could also presumably find the port and protocol of
>  >>  the Kom connection from Kom itself somehow...
>  >
>  > Well then, let's exclude the port and scheme:
>  >
>  > WAUrl new parsePath: '/ch/de/index.html'
>  >
>  > works quite well.
>  >
>  >>  In either case, it seems to me that changing #url from a string to a
>  >>  WAUrl would break existing code. Maybe it's desirable...
>  >
>  > Breaking client code is never desirable.
>  >
>  >> not a
>  >>  difficult fix to code that does break and it would probably break
>  >>  pretty obviously.
>  >
>  > and there should be pretty few users.
>  >
>  >>  >>  >>  - do you know if the header values in HTTPRequest also need to be
>  >>  >>  >>  decoded? They aren't currently and I don't know if they support UTF-8
>  >>  >>  >>  values or not...
>  >>  >>  >
>  >>  >>  > If they are really UTF-8 that would be good. An example is cookie
>  >>  >>  > values which are transmitted through headers. See also issue 63.
>  >>  >>  > Before adding such a thing, please make sure it really works with IE
>  >>  >>  > 6, Firefox 2, Safari 2 and Opera 9 with utf-8, ISO-8859-1 and utf-16.
>  >>  >>  > Ideally also Big5 and  Shift JIS though I have to admit I never tested
>  >>  >>  > with those. Unfortunately the HTTP spec/theory and browsers/reality
>  >>  >>  > are different.
>  >>  >>
>  >>  >>
>  >>  >> Are you suggesting auto-detecting the encoding of headers sent by the
>  >>  >>  browser?
>  >>  >
>  >>  > No not at all. But in Seaside 2.9 we now know the encoding oft the web
>  >>  > application. Even if there is a spec, you will simply have to try all
>  >>  > browsers with at least iso-8859-1 and utf-8. Either there is a rule or
>  >>  > we can't support it. It's as simple as that. A short googling suggests
>  >>  > that headers are ASCII. We might or might not want to support a custom
>  >>  > encoding for cookie values.
>  >>  >
>  >>  >> I don't think the browser specifies an encoding in the
>  >>  >>  headers does it? I'm not sure I want to tackle this mess right now but
>  >>  >>  I'll keep it in mind. :)
>  >>  >
>  >>  > It can, in the content-type header. Not that it often does.
>  >>  >
>  >>  >>  I'd have to think about this more but if we are supporting all those
>  >>  >>  encodings, wouldn't it be nice to have a pair of encoders: one for
>  >>  >>  what we want our Response encoding to be and one for the encoding we
>  >>  >>  want to use internally (convert Request data *TO* and Response data
>  >>  >>  *from*). So you could use a UTF-8 converter for "outside" and a Squeak
>  >>  >>  encoding converter for "inside"; all incoming data would be converted
>  >>  >>  to Squeak encoding and anything going out would be converted from
>  >>  >>  Squeak encoding to UTF-8. If you had UTF-8 encoders for both then you
>  >>  >>  wouldn't have to do any encoding going out but incoming might still
>  >>  >>  have to be converted to UTF-8 if it was, for example, UTF-16.
>  >>  >
>  >>  > No, internally we ideally want only Squeak/Smalltalk encoding.
>  >>  > Otherwise we can throw String away and just use ByteArray. The problem
>  >>  > is that WideStrings are bugged and slow and for legacy reasons we have
>  >>  > to support "null encoding". Everything else is insanity. Same goes for
>  >>  > using utf-8 internally and utf-16 externally. Second for some external
>  >>  > parts (like URLs) the external ecoding is given.
>  >>
>  >>  It doesn't appear quite that simple to me... if you have data in UTF
>  >>  format in a database, you might well prefer to use UTF encoding
>  >>  internally
>  >
>  > There is no such thing as UTF encoding. Using an encoding other than
>  > Squeak fixes #= but breaks _every_ method except #,. The only reason
>  > you might want this is to avoid the performance penalties of
>  > WideString. But then again have you profiled your application and can
>  > you prove to me that WideStrings are your performance bottleneck? Else
>  > this is pure premature optimization.
>  >
>  >> (or at very least be able to specify the encoding of that
>  >>  data when giving it to the canvas).
>  >
>  > No, you must adhere to the Seaside contract. You give Strings to
>  > Seaside in the same encoding you expect Seaside to give Strings to
>  > you. Everything else is a pure horror.
>  >
>  >> Does squeak encoding doesn't
>  >>  support anything outside basic accented characters does it?
>  >
>  > Squeak supports a superset of Unicode including astral planes.
>  >
>  >> Same goes
>  >>  for incoming form data if you have to put it in a database... you
>  >>  don't want to be putting it in in Squeak encoding.
>  >
>  > That's between you and your database driver. That doesn't include
>  > Seaside at all.
>  >
>  > I still think this belongs to seaside-dev.
>  >
>  > Cheers
>  > Philippe
>
> > _______________________________________________
>  > seaside mailing list
>  > seaside at lists.squeakfoundation.org
>  > http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
>  >
>  _______________________________________________
>  seaside mailing list
>  seaside at lists.squeakfoundation.org
>  http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
>