UTF8 Squeak

Nicolas Cellier ncellier at ifrance.com
Wed Jun 13 10:22:58 UTC 2007


Andreas Raab <andreas.raab <at> gmx.de> writes:

> 
> Colin Putney wrote:
> > If a String were a flat array of Unicode code points, it would be 
> > implemented in Smalltalk as an array of Characters wouldn't it? The fact 
> > that you've chosen to hide the internal representation of the string and 
> > use a "variable byte" or "variable word" subclass to store bytes, rather 
> > than objects, is an indication that the strings *are* encoded. In fact, 
> > the encodings have names: ISO 8859-1 and UCS-4. Janko is proposing to 
> > add a string class that internally stores strings encoded in UCS-2 to 
> > the mix.
> > 
> > So what's so holy about these particular encodings, besides the fact 
> > that they're especially efficient on the VisualWorks VM?
> 
> Indeed. That is effectively the point I was trying to make in taking a 
> more "encoding-centered" perspective on the problem. In which case there 
> is nothing holy about particular encodings (and nothing confusing about 
> the choice of names); some people use one encoding, some people use 
> another and by the end of the day there is no need to be religious about 
> what exactly a string must contain (EBCDIC anyone? 
> 
> Cheers,
>    - Andreas
> 
> 

As long as there is a Character class, there must be:
- either a canonical encoding in the system,
- or each Character should also carry encoding information.

A canonical encoding must be able to encode all characters, not a subset, so
using a Universal Character Set (UCS) is required, and the standard ISO-10646
seems the best candidate, unless you are ready to invent your own.

Having string encoded using this canonical encoding seems an efficient strategy
regarding String-Character conversions. By now, we have encoded string with a
neutral encoder wrt canonical encoding...

Anyway, even, if Characters carry encoding information, in order to compare
them, we would need a universal canonical encoding too...

So, the only religion here is:
- to have the most simple implementation enabling multilingual. 
- to conform to the widely used UCS standard

Of course, you need to deal with other encodings to link to external world.
There, several strategies have been exposed in this thread:
- doing a canonical conversion at each input/output (by way of stream)
- not converting at all, but storing a collection of bytes uninterpreted by the
image (a ByteArray solution)
- having an internal representation of external objects (strings encoded in
another code page), able to be manipulated inside the image as any other String. 

The later is the VW solution. And this is what you seem to propose. It is also
the reachest solution. Current implementation is the simplest. There lie a
religious choice maybe...

Nicolas




More information about the Squeak-dev mailing list