[squeak-dev] Re: MC should really write snaphsot/source.st in UTF8

Wed May 22 22:57:18 UTC 2013

MC never wrote a BOM, so we don't have to be compatible with BOM.

If we can simplify the process, let's simplify, because maintaining useless
compatibility costs, the code is really crooked by now, and this leads to
mis-understanding, and soon to broken features and noise. Currently,
snapshot/source.st IS broken.

If there are codes > 127, the UTF8TextConverter will most likely fail, and
I like the idea of Norbert to retry with a legacy encoding. This way, we
put crooked compatibility layer in exceptional handling.

This will also simplify the MC readers/writers in VW, gst, Gemstone, ...

Even for the legacy code, I wonder if MacRoman would be the right choice.
MC never encoded the strings and always wrote the codes as is.

So, setEncoderForCode is here for maintaining compatibility with MC
snapshot/source.st written from an old image where internal String encoding
was MacRoman -  when was it, 3.7? Are there really many of these?

I bet 99% of MC-files are encoded in latin-1 but decoded with MacRoman if
we go through a MczInstaller...

Of course, MC now uses snapshot.bin rather than snapshot/source.st.
Did old versions of MC failed to write snapshot.bin?

Eventually, we can set a Preferences in Squeak for ultra old legacy
encoding (not in Pharo, I guess Pharo should not care at all).

2013/5/23 Yoshiki Ohshima <Yoshiki.Ohshima at acm.org>

> On Wed, May 22, 2013 at 2:16 PM, Nicolas Cellier
> <nicolas.cellier.aka.nice at gmail.com> wrote:
> > First thing would be to simplify #setConverterForCode and
> > #selectTextConverterForCode.
> > Do we still want to use a MacRomanTextConverter, seriously? I'm not even
> > sure I've got that many files with that encoding on my Mac-OSX...
> > Do we really need to put a ByteOrderMark for UTF-8, seriously? See
> > http://en.wikipedia.org/wiki/Byte_order_mark, it's valueless, and not
> > recommended. It were a Squeak way to specify that a Squeak source file
> would
> > use UTF-8 rather than MacRoman, but now this should be obsolescent.
>
> Old code was certainly in MacRoman, and quite a few used middle dot,
> accented chars and other characters in the right half of the character
> chart.
>
> Monticello surely should use UTF-8.  I'd think, though, it should keep
> BOM; did you encounter any problems?  (it is not recommended, but it
> is permitted.)
>
> --
> -- Yoshiki
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20130523/43d9a81d/attachment.htm