[squeak-dev] Re: The Trunk: TrueType-nice.13.mcz

Thu Jan 21 02:55:36 UTC 2010

On 2010-01-19, at 10:04 AM, Andreas Raab wrote:

> Regarding the "oh, this would be a problem if the string encoding changes", let's keep in mind that the string encoding *hasn't* changed for ASCII. So, no, it's specifically incorrect to say that the past change would have affected or invalidated that code. I don't think you understand how fundamental the impact of an encoding change is - if you did that *all* string literals would look like gobbly gook. The only reason we could do it for non-ascii was that we're not using any non-ascii characters and we were only switching the characters > 127 from Mac Roman to Unicode/Latin-1. If you don't believe me, grab your favorite non-ascii piece of text and throw it at "yourText squeakToMac" and have a look at it.
> 
> So the argument that the dependency on literal string encoding is an issue is bogus. If you change literal string encoding there are so many other places that break it's not even funny. And at least I don't design for the implausible (what reason do we have to expect the encoding to change in the next fifty years?)

Here's another way to think about it: #asByteArray violates the abstraction provided by String. A string is a sequence of characters, right? How that sequence is represented in memory is internal to the implementation, and should not be relied upon by users of the string. Historically this abstraction has been very leaky in Squeak, I suppose because of performance issues. (VW does a much better job of this - to convert strings to bytes, you have to specify an encoding, and when you re-encode a string you get a ByteArray and not another string. Immediate characters are a win.) 

As a result, Squeak is rife with code that assumes a particular encoding and treats strings as byte sequences rather than character sequences. That doesn't mean this instance is a good idea, though, it just means we have a lot of bad code floating around. Consider: if we *had* a tighter abstraction in String, switching the encoding would be much easier. Not trivial, because of string literals, source code encodings etc, but easier. 

BTW, I'm pretty familiar with issues arising from the encoding of Strings. I had code break under Squeak 3.8 because of m17n, and have long wrestled with encoding issues in web apps - strings go over the network in UTF-8, so either they get transcoded to MacRoman/Latin-1 inside the image and transcoded back to UTF-8 on the way out again, or you leave them in UTF-8 and live with the fact that you can't manipulate them inside the image. 

This particular case doesn't matter that much, but in general I'd like to see us moving in the direction of cleaner abstractions. Sure, there's lots of legacy code out there that makes assumptions about the internal structure of strings, but that's not an excuse to write more of it.

Colin