UTF8 Squeak

Alan Lovejoy squeak-dev.sourcery at forum-mail.net
Sat Jun 9 20:54:53 UTC 2007


<Philippe>Well, there is what the evil language with J does: UCS2
everywhere, no excuses. This is a bit awkward for characters outside the BMP
(which are more rare than unicorns) but IIRC the astral planes didn't exits
when it was created. So you could argue for UCS4. Yes, it's twice the size,
but who really cares? If you could get rid of all the size hacks in Squeak
that were cool in the 70ies, would you?</Philippe>

Note: UTF-32 and UCS-4 are different names for the same thing [Reference:
http://en.wikipedia.org/wiki/UTF-32]

There is no one solution that is good enough for all use cases.

UTF-32 is fast for indexed chacter reading/writing. It also comprehensively
covers the entire Unicode Universal Character Set--not just those in the
Basic Multilingual Plane. But it also is not very space efficient.
[Reference: http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters.]

Although you could have a different subclass of String for each encoding,
that's a poor use of inheritance.  It's better to have a single String class
that uses an associated Strategy object (stored in one of the instance
variables of a String--the other holding a ByteArray containing the
characters.) The CharacterEncoding class would have a subclass for each
different encoding.  The byteArray would hold the String's data, whose
character content would be interpreted by the Strategy object (an instance
of CharacterEncoding.)

To achieve semantic unification across any and all character encodings, the
rule would be that when a Character object is reified from a String, it
always uses the Unicode code point ("integer code value.")  And when a
Character is "put:" into a String, its canonical (Unicode) code point is
translated to be correct for that String's encoding.  Both conversions would
be the responsibility of the String's Strategy object (an instance of
CharacterEncoding.)

This implementation architecture lets each application (or
package/module/code-library) choose the encoding that best suits its use
case, but prevents character code mapping errors when characters are copied
between Strings whose encodings are not the same.

In the case of the variable-byte encodings, it might be possible to achieve
significant performance improvements by having the CharacterEncoding
instance store information that helps to more quickly translate between
logical character indices and physical byte indices within the String's
ByteArray (the RunArray of a Text is a good analogy for what I have in mind
here.)

--Alan





More information about the Squeak-dev mailing list