[Seaside-dev] WACodecTest>>testCodecUtf8ShortestForm

Tue Jun 30 05:23:55 UTC 2009

2009/6/29 Michael Lucas-Smith <mlucas-smith at cincom.com>:
> Philippe Marschall wrote:
>>
>> 2009/6/22 Michael Lucas-Smith <mlucas-smith at cincom.com>:
>>
>>>
>>> Hi All,
>>>
>>> This test has a non-shorted form for the characters 'abc'.
>>>
>>
>> It should be 'ABC'.
>>
>
> Yep that's what the code tests for.
>>
>>
>>>
>>> The specification
>>> that you should reject -illegal- sequences, but non-shortest are okay.
>>> You're not meant to -generate- non-shortest, perhaps the test should be
>>> flipped to make sure the non-shorted form is produced when encoding
>>> 'ABC'.
>>> However, this seems a little redundant as I can't imagine any Smalltalker
>>> would go out of his/her way to make a UTF8 encoder that produces the
>>> non-shortest of the letters ABC.
>>>
>>
>> If you wanted to attack a system, eg. bypass certain word filters, it
>> might well be the case [1]
>>
>>  [1] http://blogs.sun.com/xuemingshen/entry/the_big_overhaul_of_java
>>
>
> But this attack is based on the idea that you would attempt to filter
> certain words -before- you've decoded the UTF8.
> That's insane. Period. I acknowledge the idea that it'd be nice to protect
> our users from themselves.. hah. The post also mixes up illegal sequences
> with non-shortest form - which the spec goes to pains to differentiate in
> its verbiage.
>
> May be Java has decided that users don't want to decode UTF8 and therefore
> it's a security risk, but I don't think that's necessarily the right thing
> for us to do in Smalltalk.
>
> You won't get this kind of attack using Opentalk-HTTP ...unless you're using
> Seaside with a WANullCodec. It's therefore possible to get this attack with
> Seaside, but only if you're using WANullCodec - which from what I gather is
> what every body is using. However, it is also the intent to move off of
> WANullCodec ...so crippling an otherwise correct UTF8 decoder to satisfy
> WANullCodec would bt the wrong thing to do.

The corrigendum linked in the method comment [1] says:

"To address this issue, the Unicode Technical Committee has modified
the definition of UTF-8 to forbid conformant implementations from
interpreting non-shortest forms for BMP characters, and clarified some
of the conformance clauses."

"(b) When a process interprets data in a Unicode Transformation
Format, it shall treat illegal code unit sequences as an error
condition."
"(c) A conformant process shall not interpret illegal UTF code unit
sequences as characters."

"The problematic "non-shortest form" byte sequences in UTF-8 were
those where BMP characters could be represented in more than one way.
These sequences are illegal, since they are not allowed by Table
3.1B."

So no, I don't think your parser is conformant.

 [1] http://www.unicode.org/versions/corrigendum1.html

Cheers
Philippe