[squeak-dev] Re: Zero bytes in Multilingual package

Wed Sep 2 05:28:49 UTC 2009

Hi Bert -

I figured it out, but you won't like it The problem comes from a 
combination of things going wrong. First, you are right, there are 
non-Latin characters in the source. This causes the MCWriter to silently 
go WideString when it writes source.st. The resulting WideString gets 
passed into ZipArchive which compresses it in chunks of 4k. The funny 
thing is that when you pull 4k chunks out of a WideString it reduces the 
result to ByteString again if it can fit into Latin1. Meaning that only 
those definitions that happen to fall into the same 4k chunk that 
containing a non-Latin character get screwed up (excuse me for a second 
while I walk out and shoot myself).

Ah, feeling better now. This is why nobody ever noticed it, because it 
won't affect all of the stuff and since MC is reasonably smart and 
doesn't need the source too often, screw-ups of the source do not get 
noticed.

I think there is a solution though, namely having the writer check 
whether whether the source is wide and if so use utf-8 instead. The big 
issue is backwards compatibility though. I can see three approaches:

1) Write a BOM marker in front of any UTF8 encoded source.st file. This 
will work for any Monticello version which is aware of the BOM; for the 
others YMMV (it depends on whether you're on 3.8 or later - it *should* 
be okay for those but I haven't tested).

2) Assume all source as UTF8 all the time and allow conversion errors to 
pass through assuming Latin-1. This will work both ways (older 
Monticello's would would get multiple characters in some situations but 
be otherwise unaffected) at the cost of not detecting possibly incorrect 
encodings in the file (which isn't a terrible choice since the zip file 
has a CRC).

3) Write two versions of the source, one in snapshot/source one in 
snapshot.utf8/source. Works both ways, too at the cost of doubling disk 
space requirements.

One thing to keep in mind here is that MCDs may only work with #2 unless 
the servers get updated. I think we should also consult with other MC 
users to ensure future compatibility. FWIW, my vote is with option #2.

Cheers,
   - Andreas

Bert Freudenberg wrote:
> On 16.08.2009, at 17:07, Bert Freudenberg wrote:
> 
>> On 16.08.2009, at 05:15, Andreas Raab wrote:
>>
>>> Ian Trudel wrote:
>>>> Another issue but with the trunk. I have tried to update code from the
>>>> trunk into my image but there's a proxy error with source.squeak.org
>>>> right at this minute, which causes Squeak to freeze for a minute or
>>>> two trying to reach the server.
>>>
>>> It seems to be fine now. Probably just a temporary issue.
>>
>> There were three debuggers open in the squeaksource image when I 
>> looked today. The problem comes from the source server trying to parse 
>> Multilingual-ar.38 and Multilingual-sn.38. It contains sections of 
>> code where each character is stored as a long instead of a byte (that 
>> is, three null bytes and the char code). I've copied the relevant 
>> portion out of the .mcz's source.st, see attachment. If you try to 
>> open a changelist browser on that file, you get the same parse error.
>>
>> I have no idea how these widened characters made it into the mzc's 
>> source.st file. In particular since this starts in the middle of a 
>> method (and of a class comment). It extends over a few chunks, then 
>> reverts back to a regular encoding. Strange.
>>
>> JapaneseEnvironment class>>isBreakableAt:in: looks suspicious though 
>> I'm not sure if it is actually broken or not.
>>
>> I then looked into the trunk's changes file. It has this problem too, 
>> though apparently only in the class comment of LanguageEnvironment.
>>
>> "LanguageEnvironment comment string asByteArray" contains this:
>>
>> 116 104 114 101 101 32 99 97 110 32 104 97 118 101 32 40 97 110 100 32 
>> 100 111 101 115 32 104 97 118 101 41 32 100 105 102 102 101 114 101 
>> 110 116 32 101 110 99 111 100 105 110 103 115 46 32 32 83 0 0 0 111 0 
>> 0 0 32 0 0 0 119 0 0 0 101 0 0 0 32 0 0 0 110 0 0 0 101 0 0 0 101 0 0 
>> 0 100 0 0 0 32 0 0 0 116 0 0 0 111 0 0 0 32 0 0 0 109 0 0 0 97 0 0 0 
>> 110 0 0 0 97 0 0 0 103 0 0 0 101 0 0 0 32 0 0 0 116 0 0 0 104 0 0 0 
>> 101 0 0 0 109 0 0 0 32 0 0 0 115 0 0 0 101 0 0 0 112 0 0 0 97 0 0 0 
>> 114 0 0 0 97 0 0 0 116 0 0 0 101 0 0 0 108 0 0 0 121 0 0 0 46 0 0 0 32 
>> 0 0 0 32 0 0 0 78 0 0 0 111 0 0 0 116 0 0 0 101 0 0 0 32 0 0 0 116 0 0 
>> 0 104 0 0 0 97 0 0 0 116 0 0 0 32 0 0 0 116 0 0 0 104 0 0 0 101 0 0 0 
>> 32 0 0 0 101 0 0 0 110 0 0 0 99 0 0 0 111 0 0 0 100 0 0 0 105 0 0 0 
>> 110 0 0 0 103 0 0 0 32 0 0 0 105 0 0 0 110 0 0 0 32 0 0 0 97 0 0 0 32 
>> 0 0 0 102 0 0 0 105 0 0 0 108 0 0 0 101 0 0 0 32 0 0 0 99 0 0 0 97 0 0 
>> 0 110 0 0 0 32 0 0 0 98 0 0 0
>>
>> Increasingly strange. So I removed the null bytes from the class 
>> comment and published as Multilingual-bf.39. After updating they are 
>> indeed gone from the comment. But looking at the source.st in that mcz 
>> shows the encoding problem again. Bummer.
>>
>> Something very strange is going on. I'm out of ideas (short of 
>> debugging into the MCZ save process).
>>
>> - Bert -
> 
> 
> *ping*
> 
> Problem occurred again today. Will likely happen every time someone 
> touches the Multilingual package?
> 
> - Bert -
> 
> 
> ------------------------------------------------------------------------
> 
>