UTC-8 (was Re: Celeste encoding (was: Duplicate messages in Celeste))

Henrik Gedenryd Henrik.Gedenryd at lucs.lu.se
Fri Mar 17 16:09:26 UTC 2000


Richard A. O'Keefe wrote:

> Then there is string comparison.  Probably the letter A appears in
> oogles of places throughout unicode, and all of them should be
> considered equal in a case-insensitive comparison.

I just *bet* there's a C implementation of that out there. Why not just
compile that as a plugin/primtive/whatever for encoding support. Gits ya the
speed too and saves lotsa work. You could still of course write (a new) one
in Smalltalk.

> Of course, the really serious problem is that case equivalence is itself
> language dependent.

You got that right. In German, Ä (a-umlaut) sorts with A and Ö (o-umlaut)
with O, like in English. In Swedish, these are appended to the alphabet,
...XYZÅÄÖ and sort accordingly.

However, it's not a flaw if the new implementation doesn't handle this--as
the existing one doesn't. When I said skip the hieroglyphics version, I
meant, don't aim too high, or the result will be nada. Plus, we're quite
used to this problem already. And programmers are deeply impressed if they
can use them at all, usually. (But you can't have them in names in Squeak,
can you?)

> When you created a string, if all the characters fitted into 8
> bits, you got a thin string, otherwise you got a fat string.
...
> The thing we _do_ need is to have '16-bit byte' arrays supported just like
> '8-bit byte arrays', to serve as substrate for the String implementation.

When I suggested a general facility for creating new mappings via a simple
dictionary, this is what I had in mind among other things. A simple such
facility makes it easy for the first space-conscious Thai squeaker to set up
a simple mapping table from 8bit Thai chars, say, (I've just exposed my
ignorance) to 16bit Unicode (etc.)

Note that these "maps" make the story very similar to ColorForms which have
256-entry mapping tables from 8bit indices colors to the corresponding full
Color objects. Among other things, as long as you're staying within one
encoding, you can ignore the mapping in eg. equality comparisons like in
searches. 

Now I don't know if a plain dictionary (or rather 256-entry xxxArray I
guess) would suffice, but the mapping table would also contain the sorting
order definition, either implicitly or explicitly.

So I guess I'm suggesting an EncodingString which has an instvar containing
a mapping object (like a ColorForm has), and/or an EncodedString (bad names)
hierarchy where the subclass definition hardcodes the mapping to use (but
still the mapping is another object/class). The latter saves you an instvar
in the object, if it's worth it. (There might be a variableByteSubclass
problem, eg.)

Henrik






More information about the Squeak-dev mailing list