 ## [ENH] CharacterTimes-rsb

Richard A. O'Keefe ok at cs.otago.ac.nz
Thu Jun 24 03:40:47 UTC 2004

Stephan Rudlof <sr at evolgo.de> wrote:
I second this.

It takes time to read your mails, but it takes a lot more time
for you to write them!

Thank you.

For me recently the Set/hash discussion was illuminating,
especially the point, that a Set shouldn't contain itself in the
world of actual programming languages.  My first thought about
this has been mathematically (where a Set may include itself, if
I remember correctly).

I really want to take all the sting out of this, but I'm afraid you haven't
remembered correctly.  If you're fairly comfortable with mathematics, a
book that I found extremely helpful is
The Joy of Sets : Fundamentals of Contemporary Set Theory
Keith Devlin
Springer-Verlag 1993
ISBN 0387940944
List price, about USD50, you can get a used copy over the web for USD35.

A core feature of "standard" (Zermelo-Frankel or Goedel-Bernays) set theory
is that the universe of sets is WELL-ORDERED.  There are no infinite chains
x0 \in x1 \in x2 \in .....
The way Devlin explains this in The Joy of Sets is by constructing the ZF
universe as a transfinite sequence of "layers", such that a set in some
layer may only contain elements from strictly earlier layers.  The bottom
layer is the empty set.

The fact that there are no infinite descending chains is incredibly
important, because it means that you get to use proof by induction.

The programming language analogue of this is that you get to use loops
and recursion.  Take this little loop:

x := aSet.
[x isKindOf: Set] whileTrue: [x := x anyOne].

When can we expect this loop to terminate?  When aSet is well-founded
with respect to #includes:.

Now, Peter Aczel has written a book "Non-well-founded Set Theory".
I even have a copy of it.  To be honest, past chapter 1 I didn't really
understand much of it, only enough to decide I didn't really need it.
The Joy of Sets has a discussion of non-well-founded set theory near the
end.  But it's still true that when mathematicians talk about "sets"
(as opposed to "non-well-founded sets") they mean ZF or GB -- which
describe the same sets using different axioms -- or either of those
extended with atoms that are not sets.

Now there are excellent reasons to construct graph structures in
programming languages.  Mathematicians talk about graphs as pairs of
sets (Nodes, Edges), but in a programming language it may be more
convenient to think of a set of Node objects (this set does not point
to itself) with each Node having a set of Nodes that it is connected
to, so that the pointer graph of node -> neighbour set -> node DOES
have cycles, and the collection of Nodes is NOT well-ordered under
#hasSuccessor: .  The loop

x := aNode.
[x neighbours isEmpty] whileFalse: [x := x neighbours anyOne].

may well not terminate.  "Loop terminates" if and only if "stepping
relationship is well founded", so knowing when a relationship is
well-founded is of great practical interest to programmers.

One point did come up in this discussion which no-one seems to have
picked up, and it does have important practical consequences for Squeak
(or any other Smalltalk-like language) and we might need to think about
a change to #= and #hash to deal with it.

I'm going to run this example in Ambrai Smalltalk.  I _like_ Ambrai
Smalltalk, although I must say that the main thing I have learned from
it is how much I prefer the Squeak IDE.  Create a Workspace window, type

|x|
x := Array with: 1.
x at: 1 put: x.
x hash

then Cmd-A Cmd-D (this would be Cmd-P in Squeak) and wait for a while.
Ka-BOOM!  Ambrai Smalltalk softly and suddently vanishes away.  The
reason is

IndexedCollection>>
hash
|size hash|
size := self size.
hash := "doesn't matter"
^size <= 5
ifTrue: [
1 to: size do: [:i| hash := "a function of hash and (self at: i)"].
hash]
ifFalse: [ "doesn't matter" ]

So to compute x hash, it tries to compute x hash, to compute x hash, ...
and when the stack overflows, Ambrai dies.

Squeak has a different and in some ways better hash function:

SequenceableCollection>>
hash
|hash|
hash := self species hash.
1 to: self size do: [:i | hash := "a function of hash and (self at: i)"].
^hash

It has the same problem.  The important difference is that Squeak catches
and can recover from stack overflows.

I spent a long time preparing a Lisp example to show that (sxhash xobject)
should work for cyclic objects, but in the two Common Lisp implementations
I tried, it didn't.  (I had to get a technician to show me how to turn my
new eMac back on again.  This was rather humiliating.)  If you think about
it, this is not altogether unreasonable:  the point of hash functions is that

(equal x y) ===> (= (sxhash x) (sxhash y))

so that if (equal x y) -- or for us, x = y -- isn't going to terminate,
it doesn't matter that (sxhash x) -- or for us, x hash -- isn't going to
terminate either.  But suppose we do this:

|s|
s := Set new.
1 to: 10 do: [:i| |a|
a := Array with: i with: nil.
a at: 2 put: a.

Here we have a case where x = y is going to terminate quickly, but
x hash will not terminate and neither will y hash.

There are several ways to work around the problem.  One of the simplest
is this.

Object>>
hash
^self hashToDepth: 8
hashToDepth: d
^0 < d ifTrue: [userHashToDepth: d-1] ifFalse: [self class identityHash]
userHashToDepth: d
^self identityHash

Then replace all

hash
.... x hash ....

methods with

userHashToDepth: d
.... (x hashToDepth: d) ....

methods, or, if the class is such that objects _cannot_ be cyclic,
such as numbers, bitmaps, strings, unboxed arrays, and so on, retain
the existing #hash methods and add

userHashToDepth: d
^self hash

This doesn't stop hashing taking an unreasonably long time, but it does
mean that it will terminate (if I haven't made any silly mistakes).

This would handle my example of a Set of arrays each of which has a
distinct first element (but is cyclic via its second element).

Even without cyclic structures, a depth-bounded hash can be useful
because it reduces the amount of work done for large trees.

That leaves us with the problem of defining #= for cyclic data structures.
This has actually been solved by Prolog implementors, or at least,
implementors of the ISO Prolog standard.  In Prolog,
L = [1|L]
makes L a cyclic list of ones.  In "classic" Prolog,
L = L
could then loop.  In ISO Prolog,
L = L
must terminate (and should of course say "yes").  The method goes
something like this:

to determine whether x and y are equal:
determine whether x and y are equal assuming {}

to determine whether x and y are equal assuming equalities
if x or y is an immediate value,
determine equality by that way.
if x and y do not have the same constructor,
report that they are not equal.
if (x = y) is amongst the assumed equalities,
report True.
otherwise,
let equalities' = equalities U {x = y}
for each corresponding component xi of x, yi of y
determine whether xi and yi are equal using equalities'
if they aren't, report False.

Something like this could obviously be programmed in Smalltalk, with
= anObject calling equals: anObject assuming: equations and
that calling userEquals: anObject assuming: equations.
I'd rather not think about it.

What this means for us as Smalltalk programmers is that

(a) Smalltalk *could* have been designed so that #hash and #=
supported cyclic objects.  It wasn't.

This isn't surprising.  The original Smalltalk-80 used reference
counting, so you really were NOT supposed to have cyclic objects at
all anyway.  Trying to put a Set inside itself would have been a big
No-NO on garbage collection grounds.

(b) We could change Squeak so that it does handle #hash and #= for
cyclic objects.  That would be a lot of work, and a lot of examples
in Smalltalk books would need changing.  But unless and until #hash
and #= _are_ changed suitably,

(c) If we design some classes so that object cycles may occur, we
should be very careful about #= and \$hash.

In the case of such a cyclic graph of objects, the default implementations
of #= and #hash inherited from Object may be just right.  I don't mean
just "pragmatically useful", I mean "correct".  In such a case the object
identity often really is part of what it represents.

But sometimes we have objects that belong to (classes that inherit from)
system classes with _other_ definitions of #= and #hash, and then we
must be very careful to break the cycles.

(d) Even if we fix #= and #hash, the assumption "there are no cycles" is
deeply embedded in a lot of Smalltalk code.  Such as #deepCopy and
#printOn:.  Putting objects inside themselves is dangerous...