[Seaside-dev] WACodecTest>>testCodecUtf8ShortestForm

Mon Jun 29 19:36:39 UTC 2009

Philippe Marschall wrote:
> 2009/6/29 Michael Lucas-Smith <mlucas-smith at cincom.com>:
>   
>> Philippe Marschall wrote:
>> ....
>> But this attack is based on the idea that you would attempt to filter
>> certain words -before- you've decoded the UTF8.
>> That's insane. Period.
>>     
>
> It could also mean somebody at some point made a mistake like forgot
> to decode something when he should have and somehow later some class
> that tries to be helpful fixes it up. That's not beyond imagination.
>
>   
>> I acknowledge the idea that it'd be nice to protect
>> our users from themselves.. hah. The post also mixes up illegal sequences
>> with non-shortest form - which the spec goes to pains to differentiate in
>> its verbiage.
>>
>> May be Java has decided that users don't want to decode UTF8 and therefore
>> it's a security risk, but I don't think that's necessarily the right thing
>> for us to do in Smalltalk.
>>
>> You won't get this kind of attack using Opentalk-HTTP ...unless you're using
>> Seaside with a WANullCodec. It's therefore possible to get this attack with
>> Seaside, but only if you're using WANullCodec - which from what I gather is
>> what every body is using. However, it is also the intent to move off of
>> WANullCodec ...so crippling an otherwise correct UTF8 decoder to satisfy
>> WANullCodec would bt the wrong thing to do.
>>     
>
> If the server has no bugs which might well be the case Opentalk.
> However some other server or implementation could have bugs (I
> wouldn't be surprised if my code has). I see it more as a safety net
> that is there if some other safety net breaks. It's not a big deal if
> the test isn't green, it's an expected failure on Squeak and probably
> will stay so for the foreseeable future.
>
>   
>> I'm all for rejecting the illegal sequences, but the spec is pretty specific
>> about non-shortest forms being parsable... and since when did we start
>> looking to Java for "the right thing to do" ? ;)
>>     
>
> They pretty much trash us when it comes to Unicode. And they have a
> stream hierarchy that's based on decoration, does a clear separation
> between character oriented and byte oriented IO (which compiler
> checks) in fact even between I and O. If I compare that with Squeak,
> well how does MultiByteBinaryOrTextStream sound?
>
>   
If they have byte arrays for encoded utf8 characters, then they 
shouldn't have the scenario described in the link.. ever. You can't do 
string operations on arrays of bytes. That's the case in VisualWorks 
too, our 'encoded' form is always a byte array, compared with our 
strings which are decoded and contain 'characters'. I really dislike the 
way Seaside puts "bytes" in to strings.. it allows for bugs/security 
holes like this one and in general, is just wrong.

As far as I understood it, the only real change to support this is to 
require the adaptors to expect bytes to come out of a Seaside handler. 
If you think I'm being unreasonable, you should have a chat with our 
Opentalk engineers who really dislike how Seaside tries to do more of 
HTTP than they believe it should - such as encoding at all.

We can't please everyone, I grok that, that's fine - but breaking the 
UTF8 parser because we have issues with how bytes are stored in the 
smalltalk image is just fixing the wrong thing in the wrong place IMHO.

I've changed the test and published it last week(?) now and no one has 
unpublished the change ... If no one objects, I'd like to keep it that 
way.. I feel more confident follow the spec for utf8 than following 
java. We can fix the potential security hole by storing bytes in byte 
arrays and characters in strings.. I don't think this is unreasonable.

Michael