[squeak-dev] Re: Zero bytes in Multilingual package
Bert Freudenberg
bert at freudenbergs.de
Wed Sep 2 11:45:45 UTC 2009
On 02.09.2009, at 07:28, Andreas Raab wrote:
> Hi Bert -
>
> I figured it out, but you won't like it The problem comes from a
> combination of things going wrong. First, you are right, there are
> non-Latin characters in the source. This causes the MCWriter to
> silently go WideString when it writes source.st. The resulting
> WideString gets passed into ZipArchive which compresses it in chunks
> of 4k. The funny thing is that when you pull 4k chunks out of a
> WideString it reduces the result to ByteString again if it can fit
> into Latin1. Meaning that only those definitions that happen to fall
> into the same 4k chunk that containing a non-Latin character get
> screwed up (excuse me for a second while I walk out and shoot myself).
>
> Ah, feeling better now. This is why nobody ever noticed it, because
> it won't affect all of the stuff and since MC is reasonably smart
> and doesn't need the source too often, screw-ups of the source do
> not get noticed.
>
> I think there is a solution though, namely having the writer check
> whether whether the source is wide and if so use utf-8 instead. The
> big issue is backwards compatibility though. I can see three
> approaches:
>
> 1) Write a BOM marker in front of any UTF8 encoded source.st file.
> This will work for any Monticello version which is aware of the BOM;
> for the others YMMV (it depends on whether you're on 3.8 or later -
> it *should* be okay for those but I haven't tested).
>
> 2) Assume all source as UTF8 all the time and allow conversion
> errors to pass through assuming Latin-1. This will work both ways
> (older Monticello's would would get multiple characters in some
> situations but be otherwise unaffected) at the cost of not detecting
> possibly incorrect encodings in the file (which isn't a terrible
> choice since the zip file has a CRC).
>
> 3) Write two versions of the source, one in snapshot/source one in
> snapshot.utf8/source. Works both ways, too at the cost of doubling
> disk space requirements.
>
> One thing to keep in mind here is that MCDs may only work with #2
> unless the servers get updated. I think we should also consult with
> other MC users to ensure future compatibility. FWIW, my vote is with
> option #2.
>
> Cheers,
> - Andreas
Yes, go UTF-8. This is precisely one of the backwards compatibility
problems UTF-8 was designed to work around. In fact I had thought we
did this already, must be an omission in our MC version.
- Bert -
More information about the Squeak-dev
mailing list
|