[Seaside-dev] Seaside 3.0a6ish

Philippe Marschall philippe.marschall at gmail.com
Thu May 20 09:12:47 UTC 2010


2010/5/20 Philippe Marschall <philippe.marschall at gmail.com>:
> 2010/5/19 Michael Lucas-Smith <mlucas-smith at cincom.com>:
>> On 5/19/10 10:19 AM, Paolo Bonzini wrote:
>>>
>>> On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote:
>>>>
>>>> Can someone speak to the platforms that have trouble with #= here?
>>>
>>> GNU Smalltalk has problems comparing an encoded string with #latin1String.
>>>  The problem is that the GRCodecTest>>#asString: method does not store the
>>> encoding of the string in its result, so GNU Smalltalk assumes it is in the
>>> default encoding (typically UTF-8).  Then when "self latin1String" has to be
>>> compared with an ISO-8859-1 string (the output of "codec encode: self
>>> decodedString"), GNU Smalltalk fails because it finds an invalid UTF-8
>>> sequence in "self latin1String".
>>>
>>> Comparing bytearrays instead takes encodings out of the picture and works.
>>>
>>> VisualWorks seems to have the opposite problem.  #encode: needs to know
>>> what encoding was applied in order to convert to raw bytes.  This seems to
>>> be a bug to me.  The #encode:-d representation should contain the raw bytes,
>>> not the Unicode characters.
>>
>> I think there's a misunderstanding here somewhere. The generic String object
>> (subclasses ByteString, TwoByteString, FourByteString) represents unicode
>> characters. This is completely independent of any encoding at all. We have
>> some specific ByteEncodedString subclasses for ISO8859L1 and MSCP1252 but
>> they don't really enter the picture here.
>>
>> The terminology is important here, may be that's where we're struggling -
>> when you start off with bytes, the bytes cannot represent their encoding
>> (auto detecting is a fools game) so you must -decode- the bytes in to
>> characters. Once that is done, we have our String object. The String object
>> does not have a link back to the original bytes, nor does it know what
>> encoding you used to create the characters - not should it. To turn the
>> string back in to bytes, you have to -encode- the characters in to bytes
>> using an encoding.
>>>
>>> So, I could fix it by adding a platform-specific hack to #asString:, but
>>> it seems wrong.  Can you check what breaks if you return a ByteArray from
>>> your codec's #encode: method?
>>>
>> The expectation of the GRCodec is that, unfortunately, you will get back a
>> String object no matter whether you're doing an encode: or a decode: .. I
>> would *LOVELOVELOVE* to return a ByteArray when you call #encode: - but this
>> has never worked because Pharo/Squeak could never do it. May be this has
>> changed, but none of the code that calls #decode: gives us a ByteArray - all
>> of the tests, examples, seaside actual code passes in a String containing
>> characters representing bytes. I would really love it if the API had a
>> contract like this:
>>
>> GRCodec>>encode: (String)
>>    ^(ByteArray)
>>
>> GRCodec>>decode: (ByteArray)
>>    ^(String)
>
> I agree. I believe the first one is doable, I'll hack together a
> prototype today. The second is more tricky because the servers
> themselves (Comanche and Swazoo) already give us a String which is
> actually just a ByteArray.

There's one trouble point:
WAUrlEncoder >> #nextPutAll:

The trouble is we first need to convert a URL to bytes and then
interpret these bytes as Latin-1 and do percent encoding accordingly.

Cheers
Philippe


More information about the seaside-dev mailing list