String hierarchy (was: UTC-8 (was ...))

Maurice Rabb m3rabb at stono.com
Fri Mar 17 01:19:15 UTC 2000


At 2:12 PM -0800 3/16/00, Dan Ingalls wrote:
>AGREE at CarltonFields.com wrote...
> >Of course it ain't trivial, but perhaps there's an interim, if not 
>ad hoc solution that serves every relevant purpose?  It seems to me 
>that the Number hierarchy is proof positive that widely disparate, 
>differently sized and incomparable models with similar features can 
>be resolved into a seamless whole.
> >
> >In a sense, isn't a pure ASCII string just a subset of UTC-8? 
>Can't a hierarchy with built-in coercion be used to preserve ALL of 
>the efficiencies of the status quo, while still permitting (or at 
>least paving the way) toward the full generality of UTC-8 and 
>Unicode?
> >
> >Why can't the ASCII string be the SmallInteger of a new 
>STRINGTHING hierarchy, where operations within the string world be 
>seamless?  Every time I raise this point, there were countless 
>objections about things Squeak so configured could not do (the 
>biggest deal was auto-reversing Hebrew/Anglo-Numeric text), but it 
>seems that we could still accomodate many of the advantages of 
>Unicode, integrate the whole into Squeak, while preserving ALL of 
>the efficiencies of the present ASCII world for unmixed ASCII and 
>Character stuff.
>
>I agree with this approach entirely.  It's a great Squeak Samuri 
>project (I would do it tonight, but I've got a hot date ;-).  Just 
>put StringThing between ArrayedCollection and String, move all of 
>String's methods up a level, leaving only those that have to do with 
>String's primitive behavior.  It shouldn't take more than an hour, 
>and everything should still work.
>
>Then... define, say, String16 (*) that uses 16 bits and produces 
>characters with codes up to 65535.  Make one up like 'Squ<999>eak', 
>and see if it prints.  Then see if it displays.  Etc.  Lots of 
>things will break, but that's half the fun.  You'll find out if text 
>display handles characters that are not in the font, and you'll have 
>to decide whether all characters will still be unique, but this is 
>what life on the frontier is all about.
>
>When in doubt, try it out.
>
>	- Dan
>
>(*) It's probably worth starting with the most general expansion 
>first.  Then from there on, it's only optimization and engineering 
>to do the others -- the interfaces will have all been worked out.
>
>PS:  I'm not saying SqC will embrace unicode, I'm just saying that 
>it may only take a couple of days to understand most of what is 
>involved.


I know that this may be viewed as blasphemy, but this is another 
compelling reason that String should be removed from the Collection 
hierarchy.  IMHO, the continued inclusion of String in the Collection 
hierarchy is a serious mistake that continues to beget problems.

Including in the Collection hierarchy not only reveals its 
implementation but forces its type.  It forces an "is-a" relationship 
instead of a more appropriate "has-a" relationship.  Though strings 
often _act_ as collections, they are more than just collections.  All 
that should matter is that strings should be able to answer 
aSequenceableCollection of its contents when the appropriate message 
is sent; e.g. #characters|#elements|#contents.

(Does Kent Beck have any thoughts on this?)

I first began to wonder about the location of String in the class 
hierarchy when considering all of the special methods used to prevent 
accidentally enumerating a string instead of treating it as a 
singular object.  I became convinced of the problem when trying to 
expand the behavior of string types.

String <indexed bytes>
     Symbol
         Selector

The implementation optimization for its origin are obvious, however 
the current implementation's rigidity complicates appropriate design 
in other aspects of string use.  Appropriate use of protocol is 
tantamount to good design.

The current implementation makes it difficult to:
- Allow strings to use self managing compression;
- Allow symbols to remove or obscure there contents;
- Allow symbols to cache a (better) hash value;
- Allow selectors to have direct references to synonyms or other 
related selectors.

(BTW, I am aware that the Squeak VM allows you to use SmallIntegers 
in place of selectors.  That is orthogonal to my intentions.)

(The last item is useful for efficiently implementing multi-dispatch 
messaging in deep class subhierarchies.  each selector can hold a 
reference to the next most general selector with a supertype name 
embedded in it.  This prevents string character manipulation or 
concatenation, and symbol identity/existence table lookup.)

Ideally String would have one ivar 'contents' which would delegate 
its representation to an arrayed or encoding object.

I recognize the importance, and high degree of interdependence of 
String in Smalltalk.  Moving String from under Collection is not that 
difficult, however finding every place that a string is used as a 
collection is non-trivial.  (What is the best/easiest way to do 
this?)  Initially the collection protocols that are used by strings 
could be copied to String in it new place in the hierarchy.  All such 
methods would be commented as being discouraged, and recommend the 
use of the idioms: 'aString contents someCollectionMethod', or 
'aString contentsDo: aBlock'. (#contents, #characters, whatever!) 
After a few revs of being weaned, perhaps we could eliminate the 
direct collection protocols from String.


In the meantime, I agree that changing String within the Collection 
hierarchy will be the easiest way to solve the element encoding 
problem.

At 5:46 PM -0500 3/16/00, Doug Way wrote:
>Sounds great to me, too.  Except maybe call the new class something other
>than StringThing... maybe "AbstractString" might be most appropriate?
>(Naming is important, y'know... :-))


I agree.  Naming is very important.  Arguably, the _most_ important thing.

Whatever you do, please, please, please!!! name the abstract string 
class String.  I know that it involves extra steps, but IMHO would be 
best to keep the name pure.

Perhaps:

String
     UnicodeString
         Utc8String
             AsciiString

Whatever the intermediate classes you use or don't use, push the 
primitive string calls into AsciiString.

Good luck!

--Maurice


---------------------------------------------------------------------------
   Maurice Rabb    773.281.6003    Stono Technologies, LLC    Chicago, USA





More information about the Squeak-dev mailing list