[squeak-dev] Re: Zero bytes in Multilingual package

Bert Freudenberg bert at freudenbergs.de
Thu Sep 3 14:30:17 UTC 2009


On 02.09.2009, at 13:45, Bert Freudenberg wrote:

>
> On 02.09.2009, at 07:28, Andreas Raab wrote:
>
>> Hi Bert -
>>
>> I figured it out, but you won't like it The problem comes from a  
>> combination of things going wrong. First, you are right, there are  
>> non-Latin characters in the source. This causes the MCWriter to  
>> silently go WideString when it writes source.st. The resulting  
>> WideString gets passed into ZipArchive which compresses it in  
>> chunks of 4k. The funny thing is that when you pull 4k chunks out  
>> of a WideString it reduces the result to ByteString again if it can  
>> fit into Latin1. Meaning that only those definitions that happen to  
>> fall into the same 4k chunk that containing a non-Latin character  
>> get screwed up (excuse me for a second while I walk out and shoot  
>> myself).
>>
>> Ah, feeling better now. This is why nobody ever noticed it, because  
>> it won't affect all of the stuff and since MC is reasonably smart  
>> and doesn't need the source too often, screw-ups of the source do  
>> not get noticed.
>>
>> I think there is a solution though, namely having the writer check  
>> whether whether the source is wide and if so use utf-8 instead. The  
>> big issue is backwards compatibility though. I can see three  
>> approaches:
>>
>> 1) Write a BOM marker in front of any UTF8 encoded source.st file.  
>> This will work for any Monticello version which is aware of the  
>> BOM; for the others YMMV (it depends on whether you're on 3.8 or  
>> later - it *should* be okay for those but I haven't tested).
>>
>> 2) Assume all source as UTF8 all the time and allow conversion  
>> errors to pass through assuming Latin-1. This will work both ways  
>> (older Monticello's would would get multiple characters in some  
>> situations but be otherwise unaffected) at the cost of not  
>> detecting possibly incorrect encodings in the file (which isn't a  
>> terrible choice since the zip file has a CRC).
>>
>> 3) Write two versions of the source, one in snapshot/source one in  
>> snapshot.utf8/source. Works both ways, too at the cost of doubling  
>> disk space requirements.
>>
>> One thing to keep in mind here is that MCDs may only work with #2  
>> unless the servers get updated. I think we should also consult with  
>> other MC users to ensure future compatibility. FWIW, my vote is  
>> with option #2.
>>
>> Cheers,
>> - Andreas
>
>
> Yes, go UTF-8. This is precisely one of the backwards compatibility  
> problems UTF-8 was designed to work around. In fact I had thought we  
> did this already, must be an omission in our MC version.
>
> - Bert -


Looking closer into this I understand what you mean and why you didn't  
fix it right away. It's a mess.

I started by writing tests for MCStReader and MCStWriter but later  
realized it's testing the wrong thing. The stream to file out and in  
is created in the test, and the stream class used is actually what we  
need to change.

So I tried to change

	RWBinaryOrTextStream on: String new.
to
	MultiByteBinaryOrTextStream on: String new encoding: 'utf-8'

in MCStWriterTest>>setUp but it's not a drop-in replacement, I get 7  
test failures from that change alone.

E.g.:
	(RWBinaryOrTextStream on: String new) nextPutAll: 'Hi'; contents
gives
	'Hi'
whereas
	(MultiByteBinaryOrTextStream on: String new) nextPutAll: 'Hi'; contents
answers
	''

Giving up for now.

- Bert -





More information about the Squeak-dev mailing list