Unicode support

Jarvis, Robert P. Jarvisb at timken.com
Thu Sep 23 12:22:44 UTC 1999


Your code added methods to String which then reflected back to the stored
objects.  IMO this would add considerably to the protocol which String (or
GeneralString, or whatever) would have to support.  Many of the comments
made about e.g. the bi-directional nature of Hebrew (which is, I think, more
an issue of display rather than storage), the differing word-break
conventions in other languages, etc indicate to me that String isn't as
badly broken as some may want to see it, but that subclasses are needed to
handle these special cases.  I don't think one single, all-purpose,
FinalUltimateSuperString class which can handle all the possible special
cases is desireable or doable.  But until someone actually sits down and
starts cutting code it's just talk anyways.

Here's some interesting stats, gleaned from a more-or-less base Squeak 2.4
image:

	Number of String instances:   15080
	Number of characters in all strings:  874522

This means that if we switch from a single-byte character encoding to
Unicode, in the form where Unicode 'characters' are 16 bits wide, we add
roughly 4/5 of a meg to the image size.  If we convert to something where
each 'character' takes up 4 bytes we add about 2.5 megs to the image size.

Bob Jarvis
The Timken Company

> -----Original Message-----
> From:	Peter Smet [SMTP:peter.smet at flinders.edu.au]
> Sent:	Wednesday, September 22, 1999 7:42 PM
> To:	squeak at cs.uiuc.edu
> Subject:	Re: Unicode support
> 
> 
> From: Jarvis, Robert P. <Jarvisb at timken.com>
> 
> 
> >It seems that what's being attempted here is to create a monster String
> >class which can do anything.  I don't think that's what String is
> intended
> >to be.  Let's review the class comment for String:
> 
> >String is not intended to be a collection of DNA base pairs
> (DnaSequence?),
> >or a collection of musical notes (Score?), or a collection of other
> >arbitrary objects (OrderedCollection?  Array?  Dictionary?).  If you need
> a
> >collection of DNA base pairs with specific new behavior, bite the bullet
> and
> >subclass the appropriate Collection class, add your specific behavior,
> and
> >move on.  Ditto for musical notes.  Arguably ditto for collections of
> >Unicode/hieroglyphic/whatever characters.  Just my opinion.
> 
> 
> I was trying to show that the operations that make
> sense for a String are determined by its components. The music and
> DNA samples were supposed to show this. My code was taking the
> specialized behaviour out of String, and into its components.
> 
> The last thing I want is a monster String class that knows the protocol
> of all languages. That s why I suggested the protocol for string should
> be restricted to conversion:
> 
> String as: aCharacterSet
> String asUnicode
> etc
> 
> This implies "biting the bullet" and having each encoding as a separate
> class.
> 
> The only commonality that seems to have emerged from comparing all
> possible
> languages and Strings is that a GeneralString has collection-like
> behaviour.
> 
> I can't help but think of the credo:
> 
> "do the simplest thing that could possibly work"
> 
> The immediate need appears to be for Squeak to support Unicode so that
> we can parse XML - maybe we should just concentrate on that,
> instead of a universal string format to encode all possible languages...
> 
> Peter
> 
> 
> 





More information about the Squeak-dev mailing list