String hierarchy (was: UTC-8 (was ...))
Richard A. O'Keefe
ok at atlas.otago.ac.nz
Thu Mar 23 02:08:49 UTC 2000
Maurice Rabb wrote:
When using immutable strings, the obvious cost is to make any change
you must use copying. This can be very expensive in term of memory
use, and processing time. If you want to make a single char change,
and entirely new string of size N must be allocated, and copied. The
obvious advantage is safety.
Using mutable strings has the advantage of speed and efficient use of
memory.
1. The serious problem with mutable strings is that you cannot take a
substring of (the current contents of) a substring without making a
copy. Reverting to Lisp/Scheme for a moment,
(substring String Beginning End)
where 0 <= Beginning <= End <= |Source|
is an O(End-Beginning) operation. With immutable strings, it is
constant time.
This matters a lot for parsing. If you are given a method to compile,
in the form of an immutable string, your tokens can be substrings of
that string, without copying or allocating room for any more bytes.
This means that it is READ-ONLY strings that have the advantage of
efficient use of memory.
2. The O(N) update cost alleged for read-only strings applies to one
very specific implementation; it is NOT generally true. An obvious
implementation of strings would give you O(lgN) time and space to
create an updated version of a read-only string of length N. There
are plenty of others. Some of them give you rather better costs for
concatenation than mutable strings do (for example, concatenating
M strings to yield a result with N characters might take O(M.lgN)
time, regardless of the order the concatenations are done in)
3. Smalltalk programmers already know how to make lots of little changes
to a string efficiently: create a WriteStream, march along the string
copying the bits that should not change and writing out the new versions
of the bits that should, then pick up the result from the WriteStream.
A small revision of that (where the string being built is private to
the WriteStream and eventually an immutable string is delivered) would
work nicely for immutable strings.
4. String = mutable array of byte _does_ have a speed advantage, IF all
you are doing is looking at the string one character at a time from
the outside. That's often not a good approach, and in Unicode it is
a very poor approach. Even in ASCII, it's in the official standard
that (A,BackSpace,") may be three ASCII codes, but they code for ONE
character (A with umlaut). Read the standard if you don't believe
me. That was dropped from ISO 8859 because they added lots of accented
letters (but alas, no replacement for (=,BS,/)). Floating diacriticals
in Unicode make this issue come back and come back fighting mad.
To the best of my knowledge, there is *no* upper bound on the length
of the block of adjacent Unicode codes representing a single character.
Accessing the *characters* of a string is a *different* operation from
accessing the *codes* of a string, and making the latter fast doesn't
really help much with the former.
More information about the Squeak-dev
mailing list
|