Unicode support

agree at carltonfields.com agree at carltonfields.com
Tue Sep 21 15:23:59 UTC 1999


Smalltalk-80 proved that we don't need to conceptually special case the most common scenario to have full functionality AND efficiency: the Number hierarchy, and particularly the Integer hierarchy is a case in point.  Smalltalk seamlessly integrates the special case (SmallInteger) in a breathtakingly fast and almost cost-free operation, while providing broad flexibility for the more general case.  The Number architecture provides the essential functionality upon which the rest is based.

Marcel seems to have this right by focusing on the essential question: "What is a string?"  Once we have the essential protocol, the rest revolves around making an intelligent hierarchy, with an eye toward making the special case or cases (ASCII/UTF or whatever) efficient as hell and cost-free in terms of function.

We have already seen a number of extensions of String (Text) and the experiment was worth watching.  OpenDocs provides another model.  What we need to do is thing bigger first -- what is the essence of the String -- and then how do we provide all the encodings (and then conversions between them) within that framework -- hopefully seamlessly and very, very fast whenever that is possible and desired.

> -----Original Message-----
> From: PC :marcel at metaobject.com > Sent: Tuesday, September 21, 1999 9:24 AM
> To: peter at smalltalk.org
> Cc: squeak at cs.uiuc.edu
> Subject: Re: Unicode support
> > > > From: "Peter William Lount" <peter at smalltalk.org>
> >
> > I agree with you that we shouldn't be concerned with how > strings store > their characters if that's all that is too be > stored in a string.
> > I don't see why the restriction.  How a string stores > whatever it  stores is never anybody's business, as with any > other object.  Wether  it stores character objects, > LZW-compressed variable strings, UTF-8,  whatever shouldn't > matter to its clients.
> > > It does
> > mean that strings are based on "byte/double byte encodings" > and not on > general "object oriented" concepts. So we end up > with many  "encoding types"
> > of strings. This is probably necessary given the reality of > different > encoding systems. However, it's not very general. > Having an  GeneralString
> > that is entirely independent of ANY encoding system while > being able to > convert to any encoding system is a very > powerful idea.
> > Yes, having 'GeneralString' as an additional 'encoding' any > string  is required to be able to convert itself to seems > useful.  Once  again, how this is actually stored is simply > none of anybody's  business.  Adding a class that uses this > as its native encoding is  also good.  Making this the *only* > implementation would be suicide  for many applications.
> > > Also the GeneralString could hold more than just > "characters" if  characters
> > are actual objects instead of bytes. Any object, like a > icon or  graphic,
> > could be put into the string as long as they respond to the > "character > protocol". For example, a HTMLink object might > respond with the
> > "characters" that make up the link info. An icon would > display  itself. An
> > accounting total object would show the "total" as numbers. > Any of these > "character objects" would be able to be linked > back to their original > object - a plain character or a > htmlink or an accounting total  object - so
> > you can easily create "hyper links" in text.
> > These shouldn't actually be character objects, but simply > formatting  objects (more like words than characters, even > better would be lists  of words). I recently did some > experiments with the NSText systems,  and found that for many > cases the implementation of embedded objects  as special > characters is not good enough.  One problem is that single  > objects may represent multiple words in the output, which > would have  to be line-wrapped etc.  While it is possible to > fake this with  NSText, it is a lot more convoluted than it should be.
> > Equating "Text" with a series of characters is the > fundamental  problem.  It is a series of objects, some of > which may be represented  words which may actually consist of > characters (rough  approximation).  Introducing > "SuperCharacters" doesn't solve the  fundamental problem of > treating text as a sequence of characters.   That doesn't > mean that it isn't appropriate in many situations.
> > > NeXTStep/OpenStep (now Apple) has an amazing Text and > Character system. > There is no doubt that they have done > their homework very well.  They have
> > an Attributed String object that performs some of the above >  functions. Any
> > professional text system should have at least the > capabilities of the > OpenStep text system.
> > Yes, that is definitely a minimum standard.  However, there > are many  points where it needs to be improved.  Another > example where Apple's  text system is poor is the handling of > very large texts.  For these  sorts of situations, it should > provide a much more simplified and  less resource intensive > configuration.
> > > In conclusion, an object oriented text system should be > based upon an > object oriented string class that stores > characters and other  objects not
> > bytes.
> > No.  It should contain various implementations of the > "string"  concept that have different tradeoffs where size, > generality and  speed is concerned.  However, all of these > should conform to a  generic string protocol, which includes > accessing the contents as  GeneralCharacters.
> > > The objects stored this general string must conform to the
> > "character protocol". A set of "conversion" objects that > know how to > convert between "character byte encodings" and > "the general object > characters" are required and are a very > powerful notion.
> > > This is a valid
> > design just as the design you are promoting is a valid design.
> > The crucial difference, IMHO, is that my proposal includes yours.
> > > The key
> > point is to make the "string" object totally object > oriented in it's > implementation instead of basing it upon a > "byte encoding".
> > This is fine for one particular string object with a specific > set of  requirements.  It is not OK for others.  A "one size > fits all"  implementation simply is not appropriate for all > situations.
> > Marcel
> > > > 





More information about the Squeak-dev mailing list