On 03.09.2009, at 22:41, Nicolas Cellier wrote:
That reminds me http://bugs.squeak.org/view.php?id=5996
Ah, thanks! That's implementing Andreas' suggestion #1 below.
Does someone know if this was integrated in any MC version? The ticket doesn't say.
- Bert -
There are some other bugs sleeping there like this one:
http://lists.gforge.inria.fr/pipermail/pharo-project/2009-May/008994.html http://code.google.com/p/pharo/issues/detail?id=830
SystemDictionary>>#condenseChanges use StandardFileStream when it should better not...
Nicolas
2009/9/3 Bert Freudenberg bert@freudenbergs.de:
On 02.09.2009, at 13:45, Bert Freudenberg wrote:
On 02.09.2009, at 07:28, Andreas Raab wrote:
Hi Bert -
I figured it out, but you won't like it The problem comes from a combination of things going wrong. First, you are right, there are non-Latin characters in the source. This causes the MCWriter to silently go WideString when it writes source.st. The resulting WideString gets passed into ZipArchive which compresses it in chunks of 4k. The funny thing is that when you pull 4k chunks out of a WideString it reduces the result to ByteString again if it can fit into Latin1. Meaning that only those definitions that happen to fall into the same 4k chunk that containing a non-Latin character get screwed up (excuse me for a second while I walk out and shoot myself).
Ah, feeling better now. This is why nobody ever noticed it, because it won't affect all of the stuff and since MC is reasonably smart and doesn't need the source too often, screw-ups of the source do not get noticed.
I think there is a solution though, namely having the writer check whether whether the source is wide and if so use utf-8 instead. The big issue is backwards compatibility though. I can see three approaches:
- Write a BOM marker in front of any UTF8 encoded source.st
file. This will work for any Monticello version which is aware of the BOM; for the others YMMV (it depends on whether you're on 3.8 or later - it *should* be okay for those but I haven't tested).
- Assume all source as UTF8 all the time and allow conversion
errors to pass through assuming Latin-1. This will work both ways (older Monticello's would would get multiple characters in some situations but be otherwise unaffected) at the cost of not detecting possibly incorrect encodings in the file (which isn't a terrible choice since the zip file has a CRC).
- Write two versions of the source, one in snapshot/source one in
snapshot.utf8/source. Works both ways, too at the cost of doubling disk space requirements.
One thing to keep in mind here is that MCDs may only work with #2 unless the servers get updated. I think we should also consult with other MC users to ensure future compatibility. FWIW, my vote is with option #2.
Cheers,
- Andreas
Yes, go UTF-8. This is precisely one of the backwards compatibility problems UTF-8 was designed to work around. In fact I had thought we did this already, must be an omission in our MC version.
- Bert -
Looking closer into this I understand what you mean and why you didn't fix it right away. It's a mess.
I started by writing tests for MCStReader and MCStWriter but later realized it's testing the wrong thing. The stream to file out and in is created in the test, and the stream class used is actually what we need to change.
So I tried to change
RWBinaryOrTextStream on: String new.
to MultiByteBinaryOrTextStream on: String new encoding: 'utf-8'
in MCStWriterTest>>setUp but it's not a drop-in replacement, I get 7 test failures from that change alone.
E.g.: (RWBinaryOrTextStream on: String new) nextPutAll: 'Hi'; contents gives 'Hi' whereas (MultiByteBinaryOrTextStream on: String new) nextPutAll: 'Hi'; contents answers ''
Giving up for now.
- Bert -