[squeak-dev] Re: Zero bytes in Multilingual package

Thu Sep 3 20:41:22 UTC 2009

That reminds me http://bugs.squeak.org/view.php?id=5996

There are some other bugs sleeping there like this one:

http://lists.gforge.inria.fr/pipermail/pharo-project/2009-May/008994.html
http://code.google.com/p/pharo/issues/detail?id=830

SystemDictionary>>#condenseChanges use StandardFileStream when it
should better not...

Nicolas

2009/9/3 Bert Freudenberg <bert at freudenbergs.de>:
>
> On 02.09.2009, at 13:45, Bert Freudenberg wrote:
>
>>
>> On 02.09.2009, at 07:28, Andreas Raab wrote:
>>
>>> Hi Bert -
>>>
>>> I figured it out, but you won't like it The problem comes from a
>>> combination of things going wrong. First, you are right, there are non-Latin
>>> characters in the source. This causes the MCWriter to silently go WideString
>>> when it writes source.st. The resulting WideString gets passed into
>>> ZipArchive which compresses it in chunks of 4k. The funny thing is that when
>>> you pull 4k chunks out of a WideString it reduces the result to ByteString
>>> again if it can fit into Latin1. Meaning that only those definitions that
>>> happen to fall into the same 4k chunk that containing a non-Latin character
>>> get screwed up (excuse me for a second while I walk out and shoot myself).
>>>
>>> Ah, feeling better now. This is why nobody ever noticed it, because it
>>> won't affect all of the stuff and since MC is reasonably smart and doesn't
>>> need the source too often, screw-ups of the source do not get noticed.
>>>
>>> I think there is a solution though, namely having the writer check
>>> whether whether the source is wide and if so use utf-8 instead. The big
>>> issue is backwards compatibility though. I can see three approaches:
>>>
>>> 1) Write a BOM marker in front of any UTF8 encoded source.st file. This
>>> will work for any Monticello version which is aware of the BOM; for the
>>> others YMMV (it depends on whether you're on 3.8 or later - it *should* be
>>> okay for those but I haven't tested).
>>>
>>> 2) Assume all source as UTF8 all the time and allow conversion errors to
>>> pass through assuming Latin-1. This will work both ways (older Monticello's
>>> would would get multiple characters in some situations but be otherwise
>>> unaffected) at the cost of not detecting possibly incorrect encodings in the
>>> file (which isn't a terrible choice since the zip file has a CRC).
>>>
>>> 3) Write two versions of the source, one in snapshot/source one in
>>> snapshot.utf8/source. Works both ways, too at the cost of doubling disk
>>> space requirements.
>>>
>>> One thing to keep in mind here is that MCDs may only work with #2 unless
>>> the servers get updated. I think we should also consult with other MC users
>>> to ensure future compatibility. FWIW, my vote is with option #2.
>>>
>>> Cheers,
>>> - Andreas
>>
>>
>> Yes, go UTF-8. This is precisely one of the backwards compatibility
>> problems UTF-8 was designed to work around. In fact I had thought we did
>> this already, must be an omission in our MC version.
>>
>> - Bert -
>
>
> Looking closer into this I understand what you mean and why you didn't fix
> it right away. It's a mess.
>
> I started by writing tests for MCStReader and MCStWriter but later realized
> it's testing the wrong thing. The stream to file out and in is created in
> the test, and the stream class used is actually what we need to change.
>
> So I tried to change
>
>        RWBinaryOrTextStream on: String new.
> to
>        MultiByteBinaryOrTextStream on: String new encoding: 'utf-8'
>
> in MCStWriterTest>>setUp but it's not a drop-in replacement, I get 7 test
> failures from that change alone.
>
> E.g.:
>        (RWBinaryOrTextStream on: String new) nextPutAll: 'Hi'; contents
> gives
>        'Hi'
> whereas
>        (MultiByteBinaryOrTextStream on: String new) nextPutAll: 'Hi';
> contents
> answers
>        ''
>
> Giving up for now.
>
> - Bert -
>
>
>
>