MultiStrings and ZipArchive

Andreas Raab andreas.raab at gmx.de
Sun Apr 17 01:06:14 UTC 2005


Colin,

You are exactly right. ZipArchive (as well as any of the compression 
streams, e.g., GZip[Read|Write]Stream etc) can and should only accept 
bytes and must raise an error for anything else. It is the client's 
responsibility to ensure that the proper conversion happens before the 
data is written out - simply because a consumer of said data must be 
capable of finding out about the proper encoding too, and therefore the 
ZipArchive or any of these places cannot make any assumption about what 
the "right" encoding is.

Put differently: We have a problem if there is any difference between 
writing something to a zip archive and reading it back in vs. writing 
something to a zip archive, *extracting the zip archive using some other 
tool* and reading it back in. The use of the zip archive must not affect 
how the stream is read or written and therefore zip archive and friends 
really cannot deal with wide strings (to use the new terminology) and 
must raise an error.

For Monticello, you will have to decide on an encoding to use (obviously 
this should be UTF8 to keep compatibility with existing fileOuts) and 
you probably need to make sure that you use the same mechanisms (BOM 
etc) to stay consistent with regular fileOuts.

Cheers,
   - Andreas


Colin Putney wrote:
> Hi Folks,
> 
> I'm investigating a Monticello bug report, and have discovered that it's 
> actually a problem with ZipArchive. The problem arises if you add a 
> MultiString as a member of a Zip archive, and then try to read it back. 
> ZipArchive builds a String out of the bytes in the archive, with the 
> result that it includes a lot of null characters, and any non-ascii 
> characters get mangled.
> 
> Looking at how to fix this, it seems that, in general, we need to be 
> more careful about using ByteArrays for binary data. The basic 
> functionality of ZipArchive is byte-oriented, so its interface ought to 
> accept and produce ByteArrays, though we can certainly put some helper 
> methods on top of it to handle strings and encodings.
> 
> On the other hand, I don't know much about m17n, and there might be a 
> better way to go. Anybody have thoughts on this?
> 
> Thanks,
> 
> Colin
> 
> 
> 




More information about the Squeak-dev mailing list