[Seaside-dev] Re: encoded stream

Philippe Marschall philippe.marschall at gmail.com
Thu Feb 5 08:22:58 UTC 2009


2009/2/1 Julian Fitzell <jfitzell at gmail.com>:
> On Thu, Jan 29, 2009 at 7:56 PM, Philippe Marschall
> <philippe.marschall at gmail.com> wrote:
>> I think there are several different issues:
>> 1. Should we make the configuration and their consequences for explicit?
>> 2. What should the configuration options look like?
>> 3. Should we provide some kind of accessor to what we think the
>> internal encoding is?
>> 4. Should try to detect if incoming data is in the expected encoding?
>> 5. Should we support some non-Smalltalk encoding internally that is
>> different from the external encoding?
>> 6. Where should be encoding be specified?
>>
>> 1. Yes, "encoded server adapter" is not the best name ever.
>> 2. Dunno, see above.
>> 3. Sure, why not, for example:
>> context usesSmalltalkEncoding
>>   ifTrue: [ 'Smalltalk' ]
>>   ifFalse: [ server encoding ]
>
> Is there a reason we are calling it "Smalltalk"?

I needed a name ;-)

> Is that an accepted name for this?

I don't think so. Encoding has been generally disregarded by Smalltalk
(see ANSI) with the dire consequences today.

> I can't even figure out what Smalltalk encoding is
> except that in Squeak is seems to be pretty close to Unicode but some
> characters don't seem to match up.

No, Squeak is Unicode plus 8 random high bits.

> Do we really just mean "native" encoding here?

Yes we mean native, where native means native to your Smalltalk
dialect and not native to your operating system. They may or may not
be the same depending on your Smalltalk dialect and your operating
system ;-)

>> 4. No, the browser does not send in what encoding the data is. I know
>> because all but the very latest version of Kom ("Dolphin & Monkey") do
>> blow up if the browser sends the encoding. Additionally we have no way
>> of finding out in what encoding incoming data is by just looking at
>> it.
>
> That's not strictly true:

Yes it is.

>  * You told me that Opera does include the encoding

If you set up your page in a very special way (utf-16) and tell the
browser to send the data in a very special way (utf-16) and Opera
decides to send data in a different way (utf-8) then it will tell you.
That's how I found the bug. Every other browser will send in utf-8
without telling you so relying on Opera behavior is not possible.

>  * It is possible in many cases to detect the encoding. PHP does this.
> Or look at http://chardet.feedparser.org/ for example. So it *could*
> be done but...

What you looked at their implementation and their FAQ? Their FAQ tells
you that it is generally impossible "Isn't that impossible?" that's
why they operate with probabilities. They have huge tables with
character frequencies in certain languages. They additionally have the
advantage that they operate on much bigger datas ets than we do. They
have whole pages that they can analyze while we have only short
strings. We don't know the language of the input. And if we would
detect that one string is likely not in an expected character set we
would have to undo all previously done decoding and redo it with a
different character set. And the very best of it is that it doesn't
even work in Mozilla, where their code originally comes from. It's not
rare to get pages where apostrophes are wrong because they author
indicated a wrong character set and character set sniffing didn't work
because.

>  * If browsers are supposed to submit data in the same encoding as the
> page the form was on (or the character set specified on the form),

1. they are not
2. it doesn't matter what browsers are supposed to do, it only matters
what browsers do

> then we could record that information in the SessionContinuation so
> that when the callbacks were triggered we would know exactly what
> encoding was used for the page that generated that callback.
>
>> 5. No, until we actually have somebody who needs this I consider this
>> a purely theoretical use case.
>
> My point isn't that we need to support conversions to arbitrary
> encodings. My point is that it is easier to understand what is
> happening when you know what the two encodings *are*. We don't have to
> allow (for the moment) anything but UTF-8 for the external encoding
> when UTF-8 is specified as the internal encoding. I have no problem
> with that limitation for now.
>
> All I'm saying is if you know your options are "utf-8/native",
> "latin-1/native", or "utf-8/utf-8" then it is perfectly clear what is
> being sent to the browser *and* what you are supposed to be dealing
> with in your image.

If the options are like this, then I'm cool with it.

>> 6. Lukas made a pretty good argument for doing it in the server adapter.
>
> Well, I actually don't think it does make sense on the server adaptor
> in the long term. There is little reason to believe that all
> applications would necessarily be using the same encoding, let alone
> every request. JSP seems to provide setCharacterEncoding() on both
> their Request and Response objects and I'd rather see us do something
> along those lines (with a default specified per-application).

Have you ever used this? Do you know how many strings there are
attached to this (pun)? Do you know that at the end of the day you
still have to configure it in the server to make it actually works
which means you can't mix applications with different encodings in the
the same server. Check out this bug:
https://issues.apache.org/bugzilla/show_bug.cgi?id=23929

> That said, I think it may be overkill to be starting on this now. I
> suggest Lukas finish his changes to have the Response encode on the
> fly using the codec specified in the server adaptor. We can revisit
> this again for the next release.

Sure

> It would be nice if we could find a way to address the confusion a bit
> in this release simply by improving naming or the way we present the
> encodings (as I suggest in reply to your #5, for example). If we can
> do this without changing the architecture, this would help pave the
> way for further improvements in a later release and not slow this one
> down any further.

Sure

Cheers
Philippe


More information about the seaside-dev mailing list