UTF8 Squeak

Matthew Fulmer tapplek at gmail.com
Fri Jun 8 04:25:52 UTC 2007


On Thu, Jun 07, 2007 at 08:16:21PM -0700, Alan Lovejoy wrote:
> It is already the case that accessing individual characters from a String
> results in the reification of a Character object.  So, leveraging what is
> already the case, convervsion to/from the internal encoding to the canonical
> (Unicode) encoding should occur when a Character object is reified from an
> encoded character in a String (or in a Stream.)  Character objects that are
> "put:" into a String would be converted from the Unicode code point to the
> encoding native to that String.  Using Character reification to/from Unicode
> as the unification mechanism provides the illusion that all Strings use the
> same code points for their characters, even though they in fact do not.

Someone already mentioned the way Plan-9 did this, and provided
a link, which I read, and it sounded pretty logical. What
follows is my assessment of what I read.

The key realization that Plan-9 made is that random-access
string access is the exception, rather than the rule. Stream
access is much more common, and much more in need of
optimization. This seems logical to me. UTF-8 is a
stream-oriented encoding of Unicode that Plan-9 invented to
solve this optimization issue. UTF-8 is self-synchronizing and
byte-oriented, which allows a reader to be nearly stateless, and
still consume much less memory that UCS-32. Plan 9 also
described that, contrary to what some expect, very few programs
do better with UCS-32, because very few programs really need to
process the string in a non-linear way. Regular expressions and
sorting are the two main exceptions.

UTF-8 also allows the transition to be made slightly more
smoothly, since many ASCII programs will already work with
UTF-8.

This is a synopsis of what I read. I am not familiar with this
issue as much as you are.

-- 
Matthew Fulmer -- http://mtfulmer.wordpress.com/
Help improve Squeak Documentation: http://wiki.squeak.org/squeak/808



More information about the Squeak-dev mailing list