[Seaside-dev] Re: [Seaside] Re: requests and encodings (was Re: fix for issue 21)

Julian Fitzell julian at fitzell.ca
Mon Jun 30 02:56:52 UTC 2008


Oops... you're right. I didn't know there was a separate seaside-dev
so missed the distinction you were making.

I've moved this over there now, though I think you've sufficiently
clarified the issue for me at this point anyway.

Julian

On Sun, Jun 29, 2008 at 11:19 PM, Philippe Marschall
<philippe.marschall at gmail.com> wrote:
> 2008/6/29, Julian Fitzell <julian at fitzell.ca>:
>> Hi Philippe,
>>
>>  I did semd my previous message to seaside-dev.
>
> All my headers show your message went to
> seaside at lists.squeakfoundation.org and not
> seaside-dev at lists.squeakfoundation.org
>
>>  I feel like maybe I've offended you somehow, which was absolutely not
>>  my intention. If so, I apologize. As I said, I love the pluggability
>>  of this new encoding stuff... it's very clean and well done. My
>>  intention was only to fix the bug in issue 21, which I did. The rest
>>  was just thinking aloud.
>>
>>  My knowledge of the encoding in Squeak must be out of date (I was
>>  familiar with it before the internationalization stuff went in). At
>>  the time, MacRoman was used and, as I understand it, MacRoman only has
>>  256 characters.
>
> That was the state of Squeak 3.7 to my knowledge. Squeak 3.8 switched
> to layer violated Unicode. So all the MacRoman issues fall away and
> new ones appear like #= and a lot of methods in WideString broke. I
> don't know if everything of this is fixed in Squeak 3.10 but
> WideString had some show stoppers in Squeak 3.8 and 3.9.
>
> Seriously when dealing with Strings we must be sure that they are
> Strings. That is only the case if the String has Smalltalk encoding.
> Else the String is a mere ByteArray. A byte in it has no semantics at
> all. It is not possible to do anything meaningful at all with such an
> abstraction because we can not assume anything about it. So it has the
> byte value 60 in it. Is that $<? We don't know and can't know because
> it has no semantics.
>
> Cheers
> Philippe
>
>> Obviously you want to be dealing with string literals,
>>  etc. in squeak's encoding but data coming out of an existing database
>>  is going to be in something else and outputting data from such a
>>  database is going to be a common case.
>>
>>  Assuming my understanding of MacRoman is correct, you obviously can't
>>  convert UTF-8 database data to MacRoman, then back to UTF-8 for output
>>  back to the browser because the conversion would be lossy. It sounds
>>  like you're saying MacRoman is no longer the encoding used. As long as
>>  the full character space is available in the native encoding, then I
>>  agree that having seaside deliver everything in that native encoding
>>  is a reasonable implementation.
>>
>>  I don't necessarily agree that being able to specify the encoding of a
>>  piece of data is "pure horror" but I agree what is there now is going
>>  to be adequate as long as the internal encoding is appropriate for the
>>  task. Again, sorry for any offense.
>>
>>  Julian
>>
>>  On Sun, Jun 29, 2008 at 9:48 PM, Philippe Marschall
>>
>> <philippe.marschall at gmail.com> wrote:
>>  > 2008/6/28, Julian Fitzell <jfitzell at gmail.com>:
>>  >> Moving to seaside-dev...
>>  >>
>>  >>  On Sat, Jun 28, 2008 at 2:56 PM, Philippe Marschall
>>  >>  <philippe.marschall at gmail.com> wrote:
>>  >>  > 2008/6/27, Julian Fitzell <julian at fitzell.ca>:
>>  >>  >>  On Fri, Jun 27, 2008 at 1:07 PM, Philippe Marschall
>>  >>  >>  <philippe.marschall at gmail.com> wrote:
>>  >>  >>  > 2008/6/26, Julian Fitzell <julian at fitzell.ca>:
>>  >>  >>  >>  - I wonder whether we should add a "path" instVar to WARequest.
>>  >>  >>  >>  Currently the (unfortunately-named) "url" instvar doesn't provide any
>>  >>  >>  >>  way to tell the difference between a '/' and a '%2f' in the original
>>  >>  >>  >>  URL. I broke my fix up into two methods so that we could store the
>>  >>  >>  >>  result of #pathSegmentsFrom: in another instvar.
>>  >>  >>  >
>>  >>  >>  > Ideally IMHO "url" would hold a WAUrl that is the request URL parsed.
>>  >>  >>  > I don't know though if this is enough and what it all will break.
>>  >>  >>  > Right now "url" is also always utf-8 decoded which made me create
>>  >>  >>  > issue 79.
>>  >>  >>
>>  >>  >>
>>  >>  >> Well, I thought that too but it would kind of break things to change
>>  >>  >>  it from a string to a WAUrl. Also, after more thought, I realized that
>>  >>  >>  an HTTP request doesn't have a protocol, port, or (necessarily)
>>  >>  >>  server.
>>  >>  >
>>  >>  > Yes it does. The server is in the HOST header. The protocol is either
>>  >>  > http or https we can get this from the configuration. Same for the
>>  >>  > port.
>>  >>
>>  >>  Yeah, ok, I suppose you /could/ fake it with the information from the
>>  >>  configuration (there is no Host: header in HTTP/1.0 but that's likely
>>  >>  not a big problem these days). Is that misleading though since the
>>  >>  user might actually have connected differently (particularly for an
>>  >>  initial connection where seaside's configuration doesn't enter into
>>  >>  the equation? You could also presumably find the port and protocol of
>>  >>  the Kom connection from Kom itself somehow...
>>  >
>>  > Well then, let's exclude the port and scheme:
>>  >
>>  > WAUrl new parsePath: '/ch/de/index.html'
>>  >
>>  > works quite well.
>>  >
>>  >>  In either case, it seems to me that changing #url from a string to a
>>  >>  WAUrl would break existing code. Maybe it's desirable...
>>  >
>>  > Breaking client code is never desirable.
>>  >
>>  >> not a
>>  >>  difficult fix to code that does break and it would probably break
>>  >>  pretty obviously.
>>  >
>>  > and there should be pretty few users.
>>  >
>>  >>  >>  >>  - do you know if the header values in HTTPRequest also need to be
>>  >>  >>  >>  decoded? They aren't currently and I don't know if they support UTF-8
>>  >>  >>  >>  values or not...
>>  >>  >>  >
>>  >>  >>  > If they are really UTF-8 that would be good. An example is cookie
>>  >>  >>  > values which are transmitted through headers. See also issue 63.
>>  >>  >>  > Before adding such a thing, please make sure it really works with IE
>>  >>  >>  > 6, Firefox 2, Safari 2 and Opera 9 with utf-8, ISO-8859-1 and utf-16.
>>  >>  >>  > Ideally also Big5 and  Shift JIS though I have to admit I never tested
>>  >>  >>  > with those. Unfortunately the HTTP spec/theory and browsers/reality
>>  >>  >>  > are different.
>>  >>  >>
>>  >>  >>
>>  >>  >> Are you suggesting auto-detecting the encoding of headers sent by the
>>  >>  >>  browser?
>>  >>  >
>>  >>  > No not at all. But in Seaside 2.9 we now know the encoding oft the web
>>  >>  > application. Even if there is a spec, you will simply have to try all
>>  >>  > browsers with at least iso-8859-1 and utf-8. Either there is a rule or
>>  >>  > we can't support it. It's as simple as that. A short googling suggests
>>  >>  > that headers are ASCII. We might or might not want to support a custom
>>  >>  > encoding for cookie values.
>>  >>  >
>>  >>  >> I don't think the browser specifies an encoding in the
>>  >>  >>  headers does it? I'm not sure I want to tackle this mess right now but
>>  >>  >>  I'll keep it in mind. :)
>>  >>  >
>>  >>  > It can, in the content-type header. Not that it often does.
>>  >>  >
>>  >>  >>  I'd have to think about this more but if we are supporting all those
>>  >>  >>  encodings, wouldn't it be nice to have a pair of encoders: one for
>>  >>  >>  what we want our Response encoding to be and one for the encoding we
>>  >>  >>  want to use internally (convert Request data *TO* and Response data
>>  >>  >>  *from*). So you could use a UTF-8 converter for "outside" and a Squeak
>>  >>  >>  encoding converter for "inside"; all incoming data would be converted
>>  >>  >>  to Squeak encoding and anything going out would be converted from
>>  >>  >>  Squeak encoding to UTF-8. If you had UTF-8 encoders for both then you
>>  >>  >>  wouldn't have to do any encoding going out but incoming might still
>>  >>  >>  have to be converted to UTF-8 if it was, for example, UTF-16.
>>  >>  >
>>  >>  > No, internally we ideally want only Squeak/Smalltalk encoding.
>>  >>  > Otherwise we can throw String away and just use ByteArray. The problem
>>  >>  > is that WideStrings are bugged and slow and for legacy reasons we have
>>  >>  > to support "null encoding". Everything else is insanity. Same goes for
>>  >>  > using utf-8 internally and utf-16 externally. Second for some external
>>  >>  > parts (like URLs) the external ecoding is given.
>>  >>
>>  >>  It doesn't appear quite that simple to me... if you have data in UTF
>>  >>  format in a database, you might well prefer to use UTF encoding
>>  >>  internally
>>  >
>>  > There is no such thing as UTF encoding. Using an encoding other than
>>  > Squeak fixes #= but breaks _every_ method except #,. The only reason
>>  > you might want this is to avoid the performance penalties of
>>  > WideString. But then again have you profiled your application and can
>>  > you prove to me that WideStrings are your performance bottleneck? Else
>>  > this is pure premature optimization.
>>  >
>>  >> (or at very least be able to specify the encoding of that
>>  >>  data when giving it to the canvas).
>>  >
>>  > No, you must adhere to the Seaside contract. You give Strings to
>>  > Seaside in the same encoding you expect Seaside to give Strings to
>>  > you. Everything else is a pure horror.
>>  >
>>  >> Does squeak encoding doesn't
>>  >>  support anything outside basic accented characters does it?
>>  >
>>  > Squeak supports a superset of Unicode including astral planes.
>>  >
>>  >> Same goes
>>  >>  for incoming form data if you have to put it in a database... you
>>  >>  don't want to be putting it in in Squeak encoding.
>>  >
>>  > That's between you and your database driver. That doesn't include
>>  > Seaside at all.
>>  >
>>  > I still think this belongs to seaside-dev.
>>  >
>>  > Cheers
>>  > Philippe
>>
>> > _______________________________________________
>>  > seaside mailing list
>>  > seaside at lists.squeakfoundation.org
>>  > http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
>>  >
>>  _______________________________________________
>>  seaside mailing list
>>  seaside at lists.squeakfoundation.org
>>  http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
>>
> _______________________________________________
> seaside mailing list
> seaside at lists.squeakfoundation.org
> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
>


More information about the seaside-dev mailing list