[squeak-dev] Re: Zero bytes in Multilingual package
Andreas Raab
andreas.raab at gmx.de
Wed Sep 2 05:28:49 UTC 2009
Hi Bert -
I figured it out, but you won't like it The problem comes from a
combination of things going wrong. First, you are right, there are
non-Latin characters in the source. This causes the MCWriter to silently
go WideString when it writes source.st. The resulting WideString gets
passed into ZipArchive which compresses it in chunks of 4k. The funny
thing is that when you pull 4k chunks out of a WideString it reduces the
result to ByteString again if it can fit into Latin1. Meaning that only
those definitions that happen to fall into the same 4k chunk that
containing a non-Latin character get screwed up (excuse me for a second
while I walk out and shoot myself).
Ah, feeling better now. This is why nobody ever noticed it, because it
won't affect all of the stuff and since MC is reasonably smart and
doesn't need the source too often, screw-ups of the source do not get
noticed.
I think there is a solution though, namely having the writer check
whether whether the source is wide and if so use utf-8 instead. The big
issue is backwards compatibility though. I can see three approaches:
1) Write a BOM marker in front of any UTF8 encoded source.st file. This
will work for any Monticello version which is aware of the BOM; for the
others YMMV (it depends on whether you're on 3.8 or later - it *should*
be okay for those but I haven't tested).
2) Assume all source as UTF8 all the time and allow conversion errors to
pass through assuming Latin-1. This will work both ways (older
Monticello's would would get multiple characters in some situations but
be otherwise unaffected) at the cost of not detecting possibly incorrect
encodings in the file (which isn't a terrible choice since the zip file
has a CRC).
3) Write two versions of the source, one in snapshot/source one in
snapshot.utf8/source. Works both ways, too at the cost of doubling disk
space requirements.
One thing to keep in mind here is that MCDs may only work with #2 unless
the servers get updated. I think we should also consult with other MC
users to ensure future compatibility. FWIW, my vote is with option #2.
Cheers,
- Andreas
Bert Freudenberg wrote:
> On 16.08.2009, at 17:07, Bert Freudenberg wrote:
>
>> On 16.08.2009, at 05:15, Andreas Raab wrote:
>>
>>> Ian Trudel wrote:
>>>> Another issue but with the trunk. I have tried to update code from the
>>>> trunk into my image but there's a proxy error with source.squeak.org
>>>> right at this minute, which causes Squeak to freeze for a minute or
>>>> two trying to reach the server.
>>>
>>> It seems to be fine now. Probably just a temporary issue.
>>
>> There were three debuggers open in the squeaksource image when I
>> looked today. The problem comes from the source server trying to parse
>> Multilingual-ar.38 and Multilingual-sn.38. It contains sections of
>> code where each character is stored as a long instead of a byte (that
>> is, three null bytes and the char code). I've copied the relevant
>> portion out of the .mcz's source.st, see attachment. If you try to
>> open a changelist browser on that file, you get the same parse error.
>>
>> I have no idea how these widened characters made it into the mzc's
>> source.st file. In particular since this starts in the middle of a
>> method (and of a class comment). It extends over a few chunks, then
>> reverts back to a regular encoding. Strange.
>>
>> JapaneseEnvironment class>>isBreakableAt:in: looks suspicious though
>> I'm not sure if it is actually broken or not.
>>
>> I then looked into the trunk's changes file. It has this problem too,
>> though apparently only in the class comment of LanguageEnvironment.
>>
>> "LanguageEnvironment comment string asByteArray" contains this:
>>
>> 116 104 114 101 101 32 99 97 110 32 104 97 118 101 32 40 97 110 100 32
>> 100 111 101 115 32 104 97 118 101 41 32 100 105 102 102 101 114 101
>> 110 116 32 101 110 99 111 100 105 110 103 115 46 32 32 83 0 0 0 111 0
>> 0 0 32 0 0 0 119 0 0 0 101 0 0 0 32 0 0 0 110 0 0 0 101 0 0 0 101 0 0
>> 0 100 0 0 0 32 0 0 0 116 0 0 0 111 0 0 0 32 0 0 0 109 0 0 0 97 0 0 0
>> 110 0 0 0 97 0 0 0 103 0 0 0 101 0 0 0 32 0 0 0 116 0 0 0 104 0 0 0
>> 101 0 0 0 109 0 0 0 32 0 0 0 115 0 0 0 101 0 0 0 112 0 0 0 97 0 0 0
>> 114 0 0 0 97 0 0 0 116 0 0 0 101 0 0 0 108 0 0 0 121 0 0 0 46 0 0 0 32
>> 0 0 0 32 0 0 0 78 0 0 0 111 0 0 0 116 0 0 0 101 0 0 0 32 0 0 0 116 0 0
>> 0 104 0 0 0 97 0 0 0 116 0 0 0 32 0 0 0 116 0 0 0 104 0 0 0 101 0 0 0
>> 32 0 0 0 101 0 0 0 110 0 0 0 99 0 0 0 111 0 0 0 100 0 0 0 105 0 0 0
>> 110 0 0 0 103 0 0 0 32 0 0 0 105 0 0 0 110 0 0 0 32 0 0 0 97 0 0 0 32
>> 0 0 0 102 0 0 0 105 0 0 0 108 0 0 0 101 0 0 0 32 0 0 0 99 0 0 0 97 0 0
>> 0 110 0 0 0 32 0 0 0 98 0 0 0
>>
>> Increasingly strange. So I removed the null bytes from the class
>> comment and published as Multilingual-bf.39. After updating they are
>> indeed gone from the comment. But looking at the source.st in that mcz
>> shows the encoding problem again. Bummer.
>>
>> Something very strange is going on. I'm out of ideas (short of
>> debugging into the MCZ save process).
>>
>> - Bert -
>
>
> *ping*
>
> Problem occurred again today. Will likely happen every time someone
> touches the Multilingual package?
>
> - Bert -
>
>
> ------------------------------------------------------------------------
>
>
More information about the Squeak-dev
mailing list
|