String hierarchy (was: UTC-8 (was ...))
Andrew P. Black
black at cse.ogi.edu
Mon Mar 20 23:31:42 UTC 2000
About a year ago I implemented immutable strings in Java. But, I
hear you say, immutable strings are part of the standard Java
library. Yes, but not like this.
Cedar Mesa work at PARC included a data structure called a Rope (an
"industrial strength" String). A Rope is an immutable string
structured as a tree. The leaves are "flat strings", simple arrays
of characters. The interior nodes are
(1) concat nodes, which represent a rope that is the
concatenation of the left and right child ropes; or
(2) substring nodes, which represent a substring of a child
rope, indicated by starting offset and length; or
(3) functional nodes, which compute the characters of their
contents by executing code. For example, a functional node might
represent a rope as a file name, length and offset within that file.
In an object-oriented language, (3) would of course be replaced by
"any object that implements the core methods that define
ropefullness".
Ropes appeared again in PARC's Portable Common Runtime (under the
name "cords", because they were better than strings, but not as tough
as ropes), as a sample application for Hans Boehm's garbage
collector, and as part of SGI's C++ standard template library. All I
did was re-implement Ropes in Java.
Ropes have a number of nice properties.
(1) They are immutable. We have already talked about some of
the advantages of this, primarily the fact that immutable strings can
be shared freely without the need to protect oneself by making a copy.
(2) Concatenation, Substring, and fetching the i th character
can all be implemented in time proportional to the log of the size of
the argument. In practice, we delay re-balancing in order to make
concatenation take unit time, because concatenation is such an
important operation.
(3) Smalltalk style iterators (do:, collect:, etc.) take time
proportional to the length of the rope. Loops of the form
for i := 1 to length(str) do
{ code involving str[i] .... }
might take time n log n (where n is the length of the string),
unless we do some careful caching. It is hard to get C and Java
programmers to stop writing code like this. Fortunately, in
Smalltalk this should not be a problem.
(4) String operations scale to long strings, with good performance.
(5) Strings stored in files and strings available over the
network, for example, can be treated in exactly the same way as
strings in main memory. Most of the string manipulation functions
can be written once in an abstract superclass and reused.
(6) In Java, all strings are Unicode strings, but there is
really no need to allocate the same number of bytes for all
characters in a String. A long ascii string with a few 2-byte
characters in the middle could be represented efficiently as the
concatenation of 1-byte per character flat strings and a 2-byte
character flat string.
The main advantage of immutablity is that once the programmer is
_assured_ that a string is immutable, she is freed from concerns
about whether it is necessary to synchronize updates, make copies,
update caches, etc. If mutable strings exist, it of frequently
necessary to copy them, just to guard against the possibility that
they might change in the future.
One possibility would be to have a subclass "InitiallyMutableString"
which can be destructively updated, and then "frozen". Once an
InitiallyMutableString is frozen, it would become a String
(immutable) and the update operations would fail. This could be
implemented with becomes: or by setting a bit in the string.
I had been hoping to find time to translate my Java implementation of
Ropes into Smalltalk. There are several "gotchas" in Java that
prevent them working as one would like. In particular, since String
is a class, not an interface, it is in general impossible to replace
String by something better, without recompiling the whole world. It
is not even possible to subclass String to introduce other variants,
because String is a _final_ class. And then there is the problem
with numeric indexing loops mentioned above.
In Smalltalk these problems don't arise. I had really looked
forward, for example, to using file strings in the directory browser,
instead of the current hack of reading only the first 5k of the file.
I have a student working on the Java implementation right now. My
hope is to have her run some comparative performance measures on my
version (cord.String) and java.lang.String, and see if it makes
sense to lobby the Java world for a wholesale replacement. Once
these measurements are done, I think that I could offer to do a
Smalltalk port. Most of the code is straightforward, with the
exception of tree balancing. However, there is some performance
tuning that needs to be done on a per-platform basis. For example,
concatenating "ab" and "cd" is most efficiently done by copying them
into a new four character string, but concatenating a two 1k strings
is best done by allocating a concat node. Up to what length should
one copy? The answer depends of the relative costs of copying, and
allocation, as well as the pattern of usage; it needs to be tuned on
each system.
If someone thinks that "Smalltalk Ropes" are the right way forward,
and can't wait to get started, I would be happy to provide them with
the Java version.
Andrew
References: On the SGI C++ implementation
http://www.sgi.com/Technology/STL/Rope.html and
http://reality.sgi.com/boehm/ropeimpl.html.
Background and Performance data: Boehm, Atkinson, and Plass, "Ropes:
An Alternative to Strings", Software-Practice and Experience 25, 12
(Dec 1995), pp. 1315-1330.
More information about the Squeak-dev
mailing list
|