Re: [squeak-dev] Re: Zero bytes in Multilingual package

3 Sep 2009

      On 03.09.2009, at 22:41, Nicolas Cellier wrote:
...
That reminds me http://bugs.squeak.org/view.php?id=5996
Ah, thanks! That's implementing Andreas' suggestion #1 below.
Does someone know if this was integrated in any MC version? The ticket  
doesn't say.
- Bert -
...
There are some other bugs sleeping there like this one:
http://lists.gforge.inria.fr/pipermail/pharo-project/2009-May/008994.html
http://code.google.com/p/pharo/issues/detail?id=830
SystemDictionary>>#condenseChanges use StandardFileStream when it
should better not...
Nicolas
2009/9/3 Bert Freudenberg bert@freudenbergs.de:
...
On 02.09.2009, at 13:45, Bert Freudenberg wrote:
...
On 02.09.2009, at 07:28, Andreas Raab wrote:
...
Hi Bert -
I figured it out, but you won't like it The problem comes from a
combination of things going wrong. First, you are right, there  
are non-Latin
characters in the source. This causes the MCWriter to silently go  
WideString
when it writes source.st. The resulting WideString gets passed into
ZipArchive which compresses it in chunks of 4k. The funny thing  
is that when
you pull 4k chunks out of a WideString it reduces the result to  
ByteString
again if it can fit into Latin1. Meaning that only those  
definitions that
happen to fall into the same 4k chunk that containing a non-Latin  
character
get screwed up (excuse me for a second while I walk out and shoot  
myself).
Ah, feeling better now. This is why nobody ever noticed it,  
because it
won't affect all of the stuff and since MC is reasonably smart  
and doesn't
need the source too often, screw-ups of the source do not get  
noticed.
I think there is a solution though, namely having the writer check
whether whether the source is wide and if so use utf-8 instead.  
The big
issue is backwards compatibility though. I can see three  
approaches:

Write a BOM marker in front of any UTF8 encoded source.st

file. This
will work for any Monticello version which is aware of the BOM;  
for the
others YMMV (it depends on whether you're on 3.8 or later - it  
*should* be
okay for those but I haven't tested).

Assume all source as UTF8 all the time and allow conversion

errors to
pass through assuming Latin-1. This will work both ways (older  
Monticello's
would would get multiple characters in some situations but be  
otherwise
unaffected) at the cost of not detecting possibly incorrect  
encodings in the
file (which isn't a terrible choice since the zip file has a CRC).

Write two versions of the source, one in snapshot/source one in

snapshot.utf8/source. Works both ways, too at the cost of  
doubling disk
space requirements.
One thing to keep in mind here is that MCDs may only work with #2  
unless
the servers get updated. I think we should also consult with  
other MC users
to ensure future compatibility. FWIW, my vote is with option #2.
Cheers,

Andreas

Yes, go UTF-8. This is precisely one of the backwards compatibility
problems UTF-8 was designed to work around. In fact I had thought  
we did
this already, must be an omission in our MC version.

Bert -

Looking closer into this I understand what you mean and why you  
didn't fix
it right away. It's a mess.
I started by writing tests for MCStReader and MCStWriter but later  
realized
it's testing the wrong thing. The stream to file out and in is  
created in
the test, and the stream class used is actually what we need to  
change.
So I tried to change
   RWBinaryOrTextStream on: String new.

to
       MultiByteBinaryOrTextStream on: String new encoding: 'utf-8'
in MCStWriterTest>>setUp but it's not a drop-in replacement, I get  
7 test
failures from that change alone.
E.g.:
       (RWBinaryOrTextStream on: String new) nextPutAll: 'Hi';  
contents
gives
       'Hi'
whereas
       (MultiByteBinaryOrTextStream on: String new) nextPutAll: 'Hi';
contents
answers
       ''
Giving up for now.

Bert -