[squeak-dev] Re: Zero bytes in Multilingual package

Wed Sep 2 11:45:45 UTC 2009

On 02.09.2009, at 07:28, Andreas Raab wrote:

> Hi Bert -
>
> I figured it out, but you won't like it The problem comes from a  
> combination of things going wrong. First, you are right, there are  
> non-Latin characters in the source. This causes the MCWriter to  
> silently go WideString when it writes source.st. The resulting  
> WideString gets passed into ZipArchive which compresses it in chunks  
> of 4k. The funny thing is that when you pull 4k chunks out of a  
> WideString it reduces the result to ByteString again if it can fit  
> into Latin1. Meaning that only those definitions that happen to fall  
> into the same 4k chunk that containing a non-Latin character get  
> screwed up (excuse me for a second while I walk out and shoot myself).
>
> Ah, feeling better now. This is why nobody ever noticed it, because  
> it won't affect all of the stuff and since MC is reasonably smart  
> and doesn't need the source too often, screw-ups of the source do  
> not get noticed.
>
> I think there is a solution though, namely having the writer check  
> whether whether the source is wide and if so use utf-8 instead. The  
> big issue is backwards compatibility though. I can see three  
> approaches:
>
> 1) Write a BOM marker in front of any UTF8 encoded source.st file.  
> This will work for any Monticello version which is aware of the BOM;  
> for the others YMMV (it depends on whether you're on 3.8 or later -  
> it *should* be okay for those but I haven't tested).
>
> 2) Assume all source as UTF8 all the time and allow conversion  
> errors to pass through assuming Latin-1. This will work both ways  
> (older Monticello's would would get multiple characters in some  
> situations but be otherwise unaffected) at the cost of not detecting  
> possibly incorrect encodings in the file (which isn't a terrible  
> choice since the zip file has a CRC).
>
> 3) Write two versions of the source, one in snapshot/source one in  
> snapshot.utf8/source. Works both ways, too at the cost of doubling  
> disk space requirements.
>
> One thing to keep in mind here is that MCDs may only work with #2  
> unless the servers get updated. I think we should also consult with  
> other MC users to ensure future compatibility. FWIW, my vote is with  
> option #2.
>
> Cheers,
>  - Andreas

Yes, go UTF-8. This is precisely one of the backwards compatibility  
problems UTF-8 was designed to work around. In fact I had thought we  
did this already, must be an omission in our MC version.

- Bert -