MultiStrings and ZipArchive
Andreas Raab
andreas.raab at gmx.de
Sun Apr 17 01:06:14 UTC 2005
Colin,
You are exactly right. ZipArchive (as well as any of the compression
streams, e.g., GZip[Read|Write]Stream etc) can and should only accept
bytes and must raise an error for anything else. It is the client's
responsibility to ensure that the proper conversion happens before the
data is written out - simply because a consumer of said data must be
capable of finding out about the proper encoding too, and therefore the
ZipArchive or any of these places cannot make any assumption about what
the "right" encoding is.
Put differently: We have a problem if there is any difference between
writing something to a zip archive and reading it back in vs. writing
something to a zip archive, *extracting the zip archive using some other
tool* and reading it back in. The use of the zip archive must not affect
how the stream is read or written and therefore zip archive and friends
really cannot deal with wide strings (to use the new terminology) and
must raise an error.
For Monticello, you will have to decide on an encoding to use (obviously
this should be UTF8 to keep compatibility with existing fileOuts) and
you probably need to make sure that you use the same mechanisms (BOM
etc) to stay consistent with regular fileOuts.
Cheers,
- Andreas
Colin Putney wrote:
> Hi Folks,
>
> I'm investigating a Monticello bug report, and have discovered that it's
> actually a problem with ZipArchive. The problem arises if you add a
> MultiString as a member of a Zip archive, and then try to read it back.
> ZipArchive builds a String out of the bytes in the archive, with the
> result that it includes a lot of null characters, and any non-ascii
> characters get mangled.
>
> Looking at how to fix this, it seems that, in general, we need to be
> more careful about using ByteArrays for binary data. The basic
> functionality of ZipArchive is byte-oriented, so its interface ought to
> accept and produce ByteArrays, though we can certainly put some helper
> methods on top of it to handle strings and encodings.
>
> On the other hand, I don't know much about m17n, and there might be a
> better way to go. Anybody have thoughts on this?
>
> Thanks,
>
> Colin
>
>
>
More information about the Squeak-dev
mailing list
|