UTF8 Squeak

Wed Jun 13 02:32:17 UTC 2007

Colin Putney wrote:
> If a String were a flat array of Unicode code points, it would be 
> implemented in Smalltalk as an array of Characters wouldn't it? The fact 
> that you've chosen to hide the internal representation of the string and 
> use a "variable byte" or "variable word" subclass to store bytes, rather 
> than objects, is an indication that the strings *are* encoded. In fact, 
> the encodings have names: ISO 8859-1 and UCS-4. Janko is proposing to 
> add a string class that internally stores strings encoded in UCS-2 to 
> the mix.
> 
> So what's so holy about these particular encodings, besides the fact 
> that they're especially efficient on the VisualWorks VM?

Indeed. That is effectively the point I was trying to make in taking a 
more "encoding-centered" perspective on the problem. In which case there 
is nothing holy about particular encodings (and nothing confusing about 
the choice of names); some people use one encoding, some people use 
another and by the end of the day there is no need to be religious about 
what exactly a string must contain (EBCDIC anyone? :-)

Cheers,
   - Andreas