String hierarchy (was: UTC-8 (was ...))

Richard A. O'Keefe ok at atlas.otago.ac.nz
Thu Mar 23 02:08:49 UTC 2000


Maurice Rabb wrote:
	When using immutable strings, the obvious cost is to make any change 
	you must use copying.  This can be very expensive in term of memory 
	use, and processing time.  If you want to make a single char change, 
	and entirely new string of size N must be allocated, and copied.  The 
	obvious advantage is safety.
	
	Using mutable strings has the advantage of speed and efficient use of 
	memory.

1.  The serious problem with mutable strings is that you cannot take a
    substring of (the current contents of) a substring without making a
    copy.  Reverting to Lisp/Scheme for a moment,
	(substring String Beginning End)
    where 0 <= Beginning <= End <= |Source|
    is an O(End-Beginning) operation.  With immutable strings, it is
    constant time.

    This matters a lot for parsing.  If you are given a method to compile,
    in the form of an immutable string, your tokens can be substrings of
    that string, without copying or allocating room for any more bytes.

    This means that it is READ-ONLY strings that have the advantage of
    efficient use of memory.

2.  The O(N) update cost alleged for read-only strings applies to one
    very specific implementation; it is NOT generally true.  An obvious
    implementation of strings would give you O(lgN) time and space to
    create an updated version of a read-only string of length N.  There
    are plenty of others.  Some of them give you rather better costs for
    concatenation than mutable strings do (for example, concatenating
    M strings to yield a result with N characters might take O(M.lgN)
    time, regardless of the order the concatenations are done in)

3.  Smalltalk programmers already know how to make lots of little changes
    to a string efficiently:  create a WriteStream, march along the string
    copying the bits that should not change and writing out the new versions
    of the bits that should, then pick up the result from the WriteStream.
    A small revision of that (where the string being built is private to
    the WriteStream and eventually an immutable string is delivered) would
    work nicely for immutable strings.

4.  String = mutable array of byte _does_ have a speed advantage, IF all
    you are doing is looking at the string one character at a time from
    the outside.  That's often not a good approach, and in Unicode it is
    a very poor approach.  Even in ASCII, it's in the official standard
    that (A,BackSpace,") may be three ASCII codes, but they code for ONE
    character (A with umlaut).  Read the standard if you don't believe
    me.  That was dropped from ISO 8859 because they added lots of accented
    letters (but alas, no replacement for (=,BS,/)).  Floating diacriticals
    in Unicode make this issue come back and come back fighting mad.
    To the best of my knowledge, there is *no* upper bound on the length
    of the block of adjacent Unicode codes representing a single character.
    Accessing the *characters* of a string is a *different* operation from
    accessing the *codes* of a string, and making the latter fast doesn't
    really help much with the former.





More information about the Squeak-dev mailing list