String hierarchy (was: UTC-8 (was ...))

Andrew P. Black black at
Mon Mar 20 23:31:42 UTC 2000

About a year ago I implemented immutable strings in Java.  But,  I 
hear you say, immutable strings are part of the standard Java 
library.  Yes, but not like this.

Cedar Mesa work at PARC included a data structure called a Rope (an 
"industrial strength" String).  A Rope is an immutable string 
structured as a tree.  The leaves are "flat strings", simple arrays 
of characters.  The interior nodes are

	(1) concat nodes, which represent a rope that is the 
concatenation of the left and right child ropes; or

	(2) substring nodes, which represent a substring of a child 
rope, indicated by starting offset and length; or

	(3) functional nodes, which compute the characters of their 
contents by executing code.  For example, a functional node might 
represent a rope as a file name, length and offset within that file.

In an object-oriented language, (3) would of course be replaced by 
"any object that implements the core methods that define 

Ropes appeared again in PARC's Portable Common Runtime (under the 
name "cords", because they were better than strings, but not as tough 
as ropes), as a sample application for Hans Boehm's garbage 
collector, and as part of SGI's C++ standard template library.  All I 
did was re-implement Ropes in Java.

Ropes have a number of nice properties.

	(1) They are immutable.  We have already talked about some of 
the advantages of this, primarily the fact that immutable strings can 
be shared freely without the need to protect oneself by making a copy.

	(2) Concatenation, Substring, and fetching the i th character 
can all be implemented in time proportional to the log of the size of 
the argument.  In practice, we delay re-balancing in order to make 
concatenation take unit time, because concatenation is such an 
important operation.

	(3) Smalltalk style iterators (do:, collect:, etc.) take time 
proportional to the length of the rope.  Loops of the form

	for i := 1 to length(str) do
		{   code involving str[i] .... }

might take time n log n  (where n is the length of the string), 
unless we do some careful caching.  It is hard to get C and Java 
programmers to stop writing code like this.  Fortunately, in 
Smalltalk this should not be a problem.

	(4) String operations scale to long strings, with good performance.

	(5) Strings stored in files and strings available over the 
network, for example, can be treated in exactly the same way as 
strings in main memory.  Most of the string manipulation functions 
can be written once in an abstract superclass and reused.

	(6) In Java, all strings are Unicode strings, but there is 
really no need to allocate the same number of bytes for all 
characters in a String.  A long ascii string with a few 2-byte 
characters in the middle could be represented efficiently as the 
concatenation of 1-byte per character flat strings and a 2-byte 
character flat string.

The main advantage of immutablity is that once the programmer is 
_assured_ that a string is immutable, she is freed from concerns 
about whether it is necessary to synchronize updates, make copies, 
update caches, etc.  If mutable strings exist, it of frequently 
necessary to copy them, just to guard against the possibility that 
they might change in the future.

One possibility would  be to have a subclass "InitiallyMutableString" 
which can be destructively updated, and then "frozen".  Once an 
InitiallyMutableString is frozen, it would become a String 
(immutable) and the update operations would fail.  This could be 
implemented with becomes: or by setting a bit in the string.

I had been hoping to find time to translate my Java implementation of 
Ropes into Smalltalk.  There are several "gotchas" in Java that 
prevent them working as one would like.  In particular, since String 
is a class, not an interface, it is in general impossible to replace 
String by something better, without recompiling the whole world.  It 
is not even possible to subclass String to introduce other variants, 
because String is a _final_ class.   And then there is the problem 
with numeric indexing loops mentioned above.

In Smalltalk these problems don't arise.   I had really looked 
forward, for example, to using file strings in the directory browser, 
instead of the current hack of reading only the first 5k of the file.

I have a student working on the Java implementation right now.  My 
hope is to have her run some comparative performance measures on my 
version (cord.String) and  java.lang.String, and see if it makes 
sense to lobby the Java world for a wholesale replacement.  Once 
these measurements are done, I think that I could offer to do a 
Smalltalk port.   Most of the code is straightforward, with the 
exception of tree balancing.  However, there is some performance 
tuning that needs to be done on a per-platform basis.    For example, 
concatenating "ab" and "cd" is most efficiently done by  copying them 
into a new four character string, but concatenating a two 1k strings 
is best done by allocating a concat node.  Up to what length should 
one copy?  The answer depends of the relative costs of copying, and 
allocation, as well as the pattern of usage; it needs to be tuned on 
each system.

If someone thinks that "Smalltalk Ropes" are the right way forward, 
and can't wait to get started, I would be happy to provide them with 
the Java version.


References:  On the SGI C++ implementation and

Background and Performance data: Boehm, Atkinson, and Plass, "Ropes: 
An Alternative to Strings", Software-Practice and Experience 25, 12 
(Dec 1995), pp. 1315-1330.

More information about the Squeak-dev mailing list