[Seaside-dev] Seaside 3.0a6ish

Thu May 20 05:32:03 UTC 2010

2010/5/19 Michael Lucas-Smith <mlucas-smith at cincom.com>:
> On 5/19/10 10:19 AM, Paolo Bonzini wrote:
>>
>> On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote:
>>>
>>> Can someone speak to the platforms that have trouble with #= here?
>>
>> GNU Smalltalk has problems comparing an encoded string with #latin1String.
>>  The problem is that the GRCodecTest>>#asString: method does not store the
>> encoding of the string in its result, so GNU Smalltalk assumes it is in the
>> default encoding (typically UTF-8).  Then when "self latin1String" has to be
>> compared with an ISO-8859-1 string (the output of "codec encode: self
>> decodedString"), GNU Smalltalk fails because it finds an invalid UTF-8
>> sequence in "self latin1String".
>>
>> Comparing bytearrays instead takes encodings out of the picture and works.
>>
>> VisualWorks seems to have the opposite problem.  #encode: needs to know
>> what encoding was applied in order to convert to raw bytes.  This seems to
>> be a bug to me.  The #encode:-d representation should contain the raw bytes,
>> not the Unicode characters.
>
> I think there's a misunderstanding here somewhere. The generic String object
> (subclasses ByteString, TwoByteString, FourByteString) represents unicode
> characters. This is completely independent of any encoding at all. We have
> some specific ByteEncodedString subclasses for ISO8859L1 and MSCP1252 but
> they don't really enter the picture here.
>
> The terminology is important here, may be that's where we're struggling -
> when you start off with bytes, the bytes cannot represent their encoding
> (auto detecting is a fools game) so you must -decode- the bytes in to
> characters. Once that is done, we have our String object. The String object
> does not have a link back to the original bytes, nor does it know what
> encoding you used to create the characters - not should it. To turn the
> string back in to bytes, you have to -encode- the characters in to bytes
> using an encoding.
>>
>> So, I could fix it by adding a platform-specific hack to #asString:, but
>> it seems wrong.  Can you check what breaks if you return a ByteArray from
>> your codec's #encode: method?
>>
> The expectation of the GRCodec is that, unfortunately, you will get back a
> String object no matter whether you're doing an encode: or a decode: .. I
> would *LOVELOVELOVE* to return a ByteArray when you call #encode: - but this
> has never worked because Pharo/Squeak could never do it. May be this has
> changed, but none of the code that calls #decode: gives us a ByteArray - all
> of the tests, examples, seaside actual code passes in a String containing
> characters representing bytes. I would really love it if the API had a
> contract like this:
>
> GRCodec>>encode: (String)
>    ^(ByteArray)
>
> GRCodec>>decode: (ByteArray)
>    ^(String)

I agree. I believe the first one is doable, I'll hack together a
prototype today. The second is more tricky because the servers
themselves (Comanche and Swazoo) already give us a String which is
actually just a ByteArray.

Cheers
Philippe