[ENH] CharacterTimes-rsb

Stephan Rudlof sr at evolgo.de
Thu Jun 24 23:32:06 UTC 2004

Richard A. O'Keefe wrote:
> Stephan Rudlof <sr at evolgo.de> wrote:
> 	I second this.
> 	It takes time to read your mails, but it takes a lot more time
> 	for you to write them!
> Thank you.
> 	For me recently the Set/hash discussion was illuminating,
> 	especially the point, that a Set shouldn't contain itself in the
> 	world of actual programming languages.  My first thought about
> 	this has been mathematically (where a Set may include itself, if
> 	I remember correctly).
> I really want to take all the sting out of this, but I'm afraid you haven't
> remembered correctly.  If you're fairly comfortable with mathematics, a
> book that I found extremely helpful is
>     The Joy of Sets : Fundamentals of Contemporary Set Theory
>     Keith Devlin
>     Springer-Verlag 1993
>     ISBN 0387940944
>     List price, about USD50, you can get a used copy over the web for USD35.
> A core feature of "standard" (Zermelo-Frankel or Goedel-Bernays) set theory
> is that the universe of sets is WELL-ORDERED.  There are no infinite chains
> x0 \in x1 \in x2 \in .....
> The way Devlin explains this in The Joy of Sets is by constructing the ZF
> universe as a transfinite sequence of "layers", such that a set in some
> layer may only contain elements from strictly earlier layers.  The bottom
> layer is the empty set.

> The fact that there are no infinite descending chains is incredibly
> important, because it means that you get to use proof by induction.


> The programming language analogue of this is that you get to use loops
> and recursion.  Take this little loop:
>     x := aSet.
>     [x isKindOf: Set] whileTrue: [x := x anyOne].
> When can we expect this loop to terminate?  When aSet is well-founded
> with respect to #includes:.

> Now, Peter Aczel has written a book "Non-well-founded Set Theory".
> I even have a copy of it.  To be honest, past chapter 1 I didn't really
> understand much of it, only enough to decide I didn't really need it.
> The Joy of Sets has a discussion of non-well-founded set theory near the
> end.

After googling for "sets with itself" I would say I've thought of
"non-well-founded sets" (see the first page of
http://odur.let.rug.nl/~sjaak/semantiek3/coursenotes/3-2.pdf) ...

> But it's still true that when mathematicians talk about "sets"
> (as opposed to "non-well-founded sets") they mean ZF or GB -- which
> describe the same sets using different axioms -- or either of those
> extended with atoms that are not sets.

... so I would say mathematicians think about *both* "well-founded sets"
and "non-well-founded sets". But in the context of programming languages
the well-founded ones seem to be much more relevant.

> Now there are excellent reasons to construct graph structures in
> programming languages.  Mathematicians talk about graphs as pairs of
> sets (Nodes, Edges), but in a programming language it may be more
> convenient to think of a set of Node objects (this set does not point
> to itself) with each Node having a set of Nodes that it is connected
> to, so that the pointer graph of node -> neighbour set -> node DOES
> have cycles, and the collection of Nodes is NOT well-ordered under
> #hasSuccessor: .  The loop
>     x := aNode.
>     [x neighbours isEmpty] whileFalse: [x := x neighbours anyOne].
> may well not terminate.

> "Loop terminates" if and only if "stepping
> relationship is well founded", so knowing when a relationship is
> well-founded is of great practical interest to programmers.


> One point did come up in this discussion which no-one seems to have
> picked up, and it does have important practical consequences for Squeak
> (or any other Smalltalk-like language) and we might need to think about
> a change to #= and #hash to deal with it.
> I'm going to run this example in Ambrai Smalltalk.  I _like_ Ambrai
> Smalltalk, although I must say that the main thing I have learned from
> it is how much I prefer the Squeak IDE.  Create a Workspace window, type
>     |x|
>     x := Array with: 1.
>     x at: 1 put: x.
>     x hash
> then Cmd-A Cmd-D (this would be Cmd-P in Squeak) and wait for a while.
> Ka-BOOM!  Ambrai Smalltalk softly and suddently vanishes away.  The
> reason is
>     IndexedCollection>>
>     hash
>       |size hash|
>       size := self size.
>       hash := "doesn't matter"
>       ^size <= 5
>         ifTrue: [
>           1 to: size do: [:i| hash := "a function of hash and (self at: i)"].
> 	  hash]
>         ifFalse: [ "doesn't matter" ]
> So to compute x hash, it tries to compute x hash, to compute x hash, ...
> and when the stack overflows, Ambrai dies.
> Squeak has a different and in some ways better hash function:
>     SequenceableCollection>>
>     hash
>       |hash|
>       hash := self species hash.
>       1 to: self size do: [:i | hash := "a function of hash and (self at: i)"].
>       ^hash
> It has the same problem.  The important difference is that Squeak catches
> and can recover from stack overflows.
> I spent a long time preparing a Lisp example to show that (sxhash xobject)
> should work for cyclic objects, but in the two Common Lisp implementations
> I tried, it didn't.

> (I had to get a technician to show me how to turn my
> new eMac back on again.  This was rather humiliating.)

What about simply switching it off and then on? ;-)

> If you think about
> it, this is not altogether unreasonable:  the point of hash functions is that
>     (equal x y) ===> (= (sxhash x) (sxhash y))
> so that if (equal x y) -- or for us, x = y -- isn't going to terminate,
> it doesn't matter that (sxhash x) -- or for us, x hash -- isn't going to
> terminate either.  But suppose we do this:
>     |s|
>     s := Set new.
>     1 to: 10 do: [:i| |a|
>       a := Array with: i with: nil.
>       a at: 2 put: a.
>       s add: a].
> Here we have a case where x = y is going to terminate quickly, but
> x hash will not terminate and neither will y hash.

I'm not able to follow you here:
     |s1 s2|
     s1 := Set new.
     1 to: 10 do: [:i| |a|
       a := Array with: i with: nil.
       a at: 2 put: a.
       s1 add: a].
     s2 := Set new.
     1 to: 10 do: [:i| |a|
       a := Array with: i with: nil.
       a at: 2 put: a.
       s2 add: a].
     s1 = s2
does *not* terminate in my Squeak (probably later by detecting stack
overflow, but I have much swap space).

> There are several ways to work around the problem.  One of the simplest
> is this.
>     Object>>
>     hash
>       ^self hashToDepth: 8
>     hashToDepth: d
>       ^0 < d ifTrue: [userHashToDepth: d-1] ifFalse: [self class identityHash]
                        ^<self missing>
>     userHashToDepth: d
>       ^self identityHash

I have had this idea myself: I think this could be a good way to
implement hash for collections.

But I would suggest:

  ^self identityHash
  hashToDepth: d
  ^self hash
  ^self hashToDepth: 4
  hashToDepth: d
  | hash |
  hash _ self species hash.
  d = 0 ifTrue: [^hash].
  self size <= 10 ifTrue:
    [self do: [:elem | hash _ hash bitXor: elem hashToDepth: d-1]].
  ^hash bitXor: self size hash
SequenceableCollection>> "it overrides Collection>>hash"
  hashToDepth: d
  | hash |
  hash _ self species hash.
  d = 0 ifTrue: [^hash].
  1 to: self size do: [:i | hash _ (hash + (self at: i) hashToDepth:
d-1) hashMultiply].

Note: I haven't tested this, but I think this could avoid to replace all
>>hash methods by >>userHashToDepth methods in Collections and to add
>>userHashToDepth methods in non-cyclic objects. But some more thinking
and *testing* would be necessary for a real solution IMO.

> Then replace all
>     hash
>       .... x hash ....
> methods with
>     userHashToDepth: d
>       .... (x hashToDepth: d) ....
> methods, or, if the class is such that objects _cannot_ be cyclic,
> such as numbers, bitmaps, strings, unboxed arrays, and so on, retain
> the existing #hash methods and add
>     userHashToDepth: d
>       ^self hash
> This doesn't stop hashing taking an unreasonably long time, but it does
> mean that it will terminate (if I haven't made any silly mistakes).

The idea is good.

> This would handle my example of a Set of arrays each of which has a
> distinct first element (but is cyclic via its second element).


> Even without cyclic structures, a depth-bounded hash can be useful
> because it reduces the amount of work done for large trees.


> That leaves us with the problem of defining #= for cyclic data structures.
> This has actually been solved by Prolog implementors, or at least,
> implementors of the ISO Prolog standard.  In Prolog,
>     L = [1|L]
> makes L a cyclic list of ones.  In "classic" Prolog,
>     L = L
> could then loop.  In ISO Prolog,
>     L = L
> must terminate (and should of course say "yes").  The method goes
> something like this:
>     to determine whether x and y are equal:
>         determine whether x and y are equal assuming {}
>     to determine whether x and y are equal assuming equalities
>         if x or y is an immediate value,
>            determine equality by that way.
>         if x and y do not have the same constructor,
>            report that they are not equal.
>         if (x = y) is amongst the assumed equalities,
>            report True.
>         otherwise,
>           let equalities' = equalities U {x = y}
>           for each corresponding component xi of x, yi of y
>               determine whether xi and yi are equal using equalities'
> 	      if they aren't, report False.
> Something like this could obviously be programmed in Smalltalk, with
> = anObject calling equals: anObject assuming: equations and
> that calling userEquals: anObject assuming: equations.
> I'd rather not think about it.

OK. :-)

> What this means for us as Smalltalk programmers is that
> (a) Smalltalk *could* have been designed so that #hash and #=
>     supported cyclic objects.  It wasn't.

>     This isn't surprising.  The original Smalltalk-80 used reference
>     counting, so you really were NOT supposed to have cyclic objects at
>     all anyway.  Trying to put a Set inside itself would have been a big
>     No-NO on garbage collection grounds.

Actually I think we have root objects: therefore this part of the
problem should be a NOP (no problem).

> (b) We could change Squeak so that it does handle #hash and #= for
>     cyclic objects.  That would be a lot of work,

Are you sure that it is that much work?

> and a lot of examples
>     in Smalltalk books would need changing.  But unless and until #hash
>     and #= _are_ changed suitably,

Nevertheless it's important to keep in mind that even with these changes
 a programmer shouldn't change an Object inside a hashed container...

> (c) If we design some classes so that object cycles may occur, we
>     should be very careful about #= and $hash.


>     In the case of such a cyclic graph of objects, the default implementations
>     of #= and #hash inherited from Object may be just right.  I don't mean
>     just "pragmatically useful", I mean "correct".  In such a case the object
>     identity often really is part of what it represents.

I have to put some water into the wine here: it depends.
Imagine a graph stored serialized into some file and now you want to
compare the stored graph with incidentally the same one kept in memory
after filing the stored one in again...

>     But sometimes we have objects that belong to (classes that inherit from)
>     system classes with _other_ definitions of #= and #hash, and then we
>     must be very careful to break the cycles.


> (d) Even if we fix #= and #hash, the assumption "there are no cycles" is
>     deeply embedded in a lot of Smalltalk code.  Such as #deepCopy and
>     #printOn:.  Putting objects inside themselves is dangerous...




Stephan Rudlof (sr at evolgo.de)
   "Genius doesn't work on an assembly line basis.
    You can't simply say, 'Today I will be brilliant.'"
    -- Kirk, "The Ultimate Computer", stardate 4731.3

More information about the Squeak-dev mailing list