[squeak-dev] Re: Zero bytes in Multilingual package

Bert Freudenberg bert at freudenbergs.de
Thu Sep 3 21:07:02 UTC 2009


On 03.09.2009, at 22:41, Nicolas Cellier wrote:

> That reminds me http://bugs.squeak.org/view.php?id=5996

Ah, thanks! That's implementing Andreas' suggestion #1 below.

Does someone know if this was integrated in any MC version? The ticket  
doesn't say.

- Bert -

> There are some other bugs sleeping there like this one:
>
> http://lists.gforge.inria.fr/pipermail/pharo-project/2009-May/008994.html
> http://code.google.com/p/pharo/issues/detail?id=830
>
> SystemDictionary>>#condenseChanges use StandardFileStream when it
> should better not...
>
> Nicolas
>
> 2009/9/3 Bert Freudenberg <bert at freudenbergs.de>:
>>
>> On 02.09.2009, at 13:45, Bert Freudenberg wrote:
>>
>>>
>>> On 02.09.2009, at 07:28, Andreas Raab wrote:
>>>
>>>> Hi Bert -
>>>>
>>>> I figured it out, but you won't like it The problem comes from a
>>>> combination of things going wrong. First, you are right, there  
>>>> are non-Latin
>>>> characters in the source. This causes the MCWriter to silently go  
>>>> WideString
>>>> when it writes source.st. The resulting WideString gets passed into
>>>> ZipArchive which compresses it in chunks of 4k. The funny thing  
>>>> is that when
>>>> you pull 4k chunks out of a WideString it reduces the result to  
>>>> ByteString
>>>> again if it can fit into Latin1. Meaning that only those  
>>>> definitions that
>>>> happen to fall into the same 4k chunk that containing a non-Latin  
>>>> character
>>>> get screwed up (excuse me for a second while I walk out and shoot  
>>>> myself).
>>>>
>>>> Ah, feeling better now. This is why nobody ever noticed it,  
>>>> because it
>>>> won't affect all of the stuff and since MC is reasonably smart  
>>>> and doesn't
>>>> need the source too often, screw-ups of the source do not get  
>>>> noticed.
>>>>
>>>> I think there is a solution though, namely having the writer check
>>>> whether whether the source is wide and if so use utf-8 instead.  
>>>> The big
>>>> issue is backwards compatibility though. I can see three  
>>>> approaches:
>>>>
>>>> 1) Write a BOM marker in front of any UTF8 encoded source.st  
>>>> file. This
>>>> will work for any Monticello version which is aware of the BOM;  
>>>> for the
>>>> others YMMV (it depends on whether you're on 3.8 or later - it  
>>>> *should* be
>>>> okay for those but I haven't tested).
>>>>
>>>> 2) Assume all source as UTF8 all the time and allow conversion  
>>>> errors to
>>>> pass through assuming Latin-1. This will work both ways (older  
>>>> Monticello's
>>>> would would get multiple characters in some situations but be  
>>>> otherwise
>>>> unaffected) at the cost of not detecting possibly incorrect  
>>>> encodings in the
>>>> file (which isn't a terrible choice since the zip file has a CRC).
>>>>
>>>> 3) Write two versions of the source, one in snapshot/source one in
>>>> snapshot.utf8/source. Works both ways, too at the cost of  
>>>> doubling disk
>>>> space requirements.
>>>>
>>>> One thing to keep in mind here is that MCDs may only work with #2  
>>>> unless
>>>> the servers get updated. I think we should also consult with  
>>>> other MC users
>>>> to ensure future compatibility. FWIW, my vote is with option #2.
>>>>
>>>> Cheers,
>>>> - Andreas
>>>
>>>
>>> Yes, go UTF-8. This is precisely one of the backwards compatibility
>>> problems UTF-8 was designed to work around. In fact I had thought  
>>> we did
>>> this already, must be an omission in our MC version.
>>>
>>> - Bert -
>>
>>
>> Looking closer into this I understand what you mean and why you  
>> didn't fix
>> it right away. It's a mess.
>>
>> I started by writing tests for MCStReader and MCStWriter but later  
>> realized
>> it's testing the wrong thing. The stream to file out and in is  
>> created in
>> the test, and the stream class used is actually what we need to  
>> change.
>>
>> So I tried to change
>>
>>        RWBinaryOrTextStream on: String new.
>> to
>>        MultiByteBinaryOrTextStream on: String new encoding: 'utf-8'
>>
>> in MCStWriterTest>>setUp but it's not a drop-in replacement, I get  
>> 7 test
>> failures from that change alone.
>>
>> E.g.:
>>        (RWBinaryOrTextStream on: String new) nextPutAll: 'Hi';  
>> contents
>> gives
>>        'Hi'
>> whereas
>>        (MultiByteBinaryOrTextStream on: String new) nextPutAll: 'Hi';
>> contents
>> answers
>>        ''
>>
>> Giving up for now.
>>
>> - Bert -
>>
>>
>>
>>
>






More information about the Squeak-dev mailing list