[Seaside-dev] Seaside 3.0a6ish

Wed May 19 18:07:22 UTC 2010

On 5/19/10 10:19 AM, Paolo Bonzini wrote:
> On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote:
>>
>> Can someone speak to the platforms that have trouble with #= here?
>
> GNU Smalltalk has problems comparing an encoded string with 
> #latin1String.  The problem is that the GRCodecTest>>#asString: method 
> does not store the encoding of the string in its result, so GNU 
> Smalltalk assumes it is in the default encoding (typically UTF-8).  
> Then when "self latin1String" has to be compared with an ISO-8859-1 
> string (the output of "codec encode: self decodedString"), GNU 
> Smalltalk fails because it finds an invalid UTF-8 sequence in "self 
> latin1String".
>
> Comparing bytearrays instead takes encodings out of the picture and 
> works.
>
> VisualWorks seems to have the opposite problem.  #encode: needs to 
> know what encoding was applied in order to convert to raw bytes.  This 
> seems to be a bug to me.  The #encode:-d representation should contain 
> the raw bytes, not the Unicode characters.
I think there's a misunderstanding here somewhere. The generic String 
object (subclasses ByteString, TwoByteString, FourByteString) represents 
unicode characters. This is completely independent of any encoding at 
all. We have some specific ByteEncodedString subclasses for ISO8859L1 
and MSCP1252 but they don't really enter the picture here.

The terminology is important here, may be that's where we're struggling 
- when you start off with bytes, the bytes cannot represent their 
encoding (auto detecting is a fools game) so you must -decode- the bytes 
in to characters. Once that is done, we have our String object. The 
String object does not have a link back to the original bytes, nor does 
it know what encoding you used to create the characters - not should it. 
To turn the string back in to bytes, you have to -encode- the characters 
in to bytes using an encoding.
>
> So, I could fix it by adding a platform-specific hack to #asString:, 
> but it seems wrong.  Can you check what breaks if you return a 
> ByteArray from your codec's #encode: method?
>
The expectation of the GRCodec is that, unfortunately, you will get back 
a String object no matter whether you're doing an encode: or a decode: 
.. I would *LOVELOVELOVE* to return a ByteArray when you call #encode: - 
but this has never worked because Pharo/Squeak could never do it. May be 
this has changed, but none of the code that calls #decode: gives us a 
ByteArray - all of the tests, examples, seaside actual code passes in a 
String containing characters representing bytes. I would really love it 
if the API had a contract like this:

GRCodec>>encode: (String)
     ^(ByteArray)

GRCodec>>decode: (ByteArray)
     ^(String)

That would be absolutely ideal conceptually, but I suspect we'd be 
bucking against:
     a) lots of existing code that current works and people would rather 
not break
     b) the Squeak/Pharo adaptors existing expectations which would 
require a fair bit of work to fix

Instead, the API works like this:

GRCodec>>encode: (String containing unicode characters)
     ^(String containing characters representing bytes)

GRCodec>>decode: (String containing characters representing bytes)
     ^(String containing unicode characters)

This is the reality we're in right now and instead of rocking the boat 
too much, I have an obligation to make Seaside work the least disruptive 
way possible. I'm all in favor of changing the contract, but only if 
everyone else is too. So for now, I accept your accusation that #encode: 
is returning a String seemingly incorrectly, but throw back that that's 
the expectation of the API.

The two tests in question push the opposite ends of the problem. 
#testCodecLatin1 tests the encoded bytes, while #testCodecUtf8Bom tests 
the decoded characters. In the case of testCodecLatin1, you can send 
#asByteArray to the ByteString containing characters representing bytes, 
because none of the characters go over a value of 255 -- this is pure 
happenstance that it works and we fully intend to one day deprecate 
asByteArray from String fully from VisualWorks.

#testCodecUtf8Bom does the opposite, it wants to compare strings 
containing unicode characters and as such in VisualWorks we end up with 
a TwoByteString of which you cannot send #asByteArray. This is how I 
first noticed the problem.

Oh as a small side note, the #name API is inconsistent. #testCodecLatin1 
expects the name to be case insensitive, while #testCodecUtf8 expects 
the name to be lowercase. I'm not sure how this 'came to be' but it's 
impossible for me to make it pass consistently for both scenarios :)

Michael