Unicode support

Marcel Weiher marcel at metaobject.com
Tue Sep 21 14:20:28 UTC 1999


> From: "Peter William Lount" <peter at smalltalk.org>
>
> I agree with you that we shouldn't be concerned with how strings store 
> their characters if that's all that is too be stored in a string.

I don't see why the restriction.  How a string stores whatever it  
stores is never anybody's business, as with any other object.  Wether  
it stores character objects, LZW-compressed variable strings, UTF-8,  
whatever shouldn't matter to its clients.

> It does
> mean that strings are based on "byte/double byte encodings" and not on 
> general "object oriented" concepts. So we end up with many  
"encoding types"
> of strings. This is probably necessary given the reality of different 
> encoding systems. However, it's not very general. Having an  
GeneralString
> that is entirely independent of ANY encoding system while being able to 
> convert to any encoding system is a very powerful idea.

Yes, having 'GeneralString' as an additional 'encoding' any string  
is required to be able to convert itself to seems useful.  Once  
again, how this is actually stored is simply none of anybody's  
business.  Adding a class that uses this as its native encoding is  
also good.  Making this the *only* implementation would be suicide  
for many applications.

> Also the GeneralString could hold more than just "characters" if  
characters
> are actual objects instead of bytes. Any object, like a icon or  
graphic,
> could be put into the string as long as they respond to the "character 
> protocol". For example, a HTMLink object might respond with the
> "characters" that make up the link info. An icon would display  
itself. An
> accounting total object would show the "total" as numbers. Any of these 
> "character objects" would be able to be linked back to their original 
> object - a plain character or a htmlink or an accounting total  
object - so
> you can easily create "hyper links" in text.

These shouldn't actually be character objects, but simply formatting  
objects (more like words than characters, even better would be lists  
of words). I recently did some experiments with the NSText systems,  
and found that for many cases the implementation of embedded objects  
as special characters is not good enough.  One problem is that single  
objects may represent multiple words in the output, which would have  
to be line-wrapped etc.  While it is possible to fake this with  
NSText, it is a lot more convoluted than it should be.

Equating "Text" with a series of characters is the fundamental  
problem.  It is a series of objects, some of which may be represented  
words which may actually consist of characters (rough  
approximation).  Introducing "SuperCharacters" doesn't solve the  
fundamental problem of treating text as a sequence of characters.   
That doesn't mean that it isn't appropriate in many situations.

> NeXTStep/OpenStep (now Apple) has an amazing Text and Character system. 
> There is no doubt that they have done their homework very well.  
They have
> an Attributed String object that performs some of the above  
functions. Any
> professional text system should have at least the capabilities of the 
> OpenStep text system.

Yes, that is definitely a minimum standard.  However, there are many  
points where it needs to be improved.  Another example where Apple's  
text system is poor is the handling of very large texts.  For these  
sorts of situations, it should provide a much more simplified and  
less resource intensive configuration.

> In conclusion, an object oriented text system should be based upon an 
> object oriented string class that stores characters and other  
objects not
> bytes.

No.  It should contain various implementations of the "string"  
concept that have different tradeoffs where size, generality and  
speed is concerned.  However, all of these should conform to a  
generic string protocol, which includes accessing the contents as  
GeneralCharacters.

> The objects stored this general string must conform to the
> "character protocol". A set of "conversion" objects that know how to 
> convert between "character byte encodings" and "the general object 
> characters" are required and are a very powerful notion.

> This is a valid
> design just as the design you are promoting is a valid design.

The crucial difference, IMHO, is that my proposal includes yours.

> The key
> point is to make the "string" object totally object oriented in it's 
> implementation instead of basing it upon a "byte encoding".

This is fine for one particular string object with a specific set of  
requirements.  It is not OK for others.  A "one size fits all"  
implementation simply is not appropriate for all situations.

Marcel





More information about the Squeak-dev mailing list