[ENH] CharacterTimes-rsb
Stephan Rudlof
sr at evolgo.de
Thu Jun 24 23:32:06 UTC 2004
Richard A. O'Keefe wrote:
> Stephan Rudlof <sr at evolgo.de> wrote:
> I second this.
>
> It takes time to read your mails, but it takes a lot more time
> for you to write them!
>
> Thank you.
>
> For me recently the Set/hash discussion was illuminating,
> especially the point, that a Set shouldn't contain itself in the
> world of actual programming languages. My first thought about
> this has been mathematically (where a Set may include itself, if
> I remember correctly).
>
> I really want to take all the sting out of this, but I'm afraid you haven't
> remembered correctly. If you're fairly comfortable with mathematics, a
> book that I found extremely helpful is
> The Joy of Sets : Fundamentals of Contemporary Set Theory
> Keith Devlin
> Springer-Verlag 1993
> ISBN 0387940944
> List price, about USD50, you can get a used copy over the web for USD35.
>
> A core feature of "standard" (Zermelo-Frankel or Goedel-Bernays) set theory
> is that the universe of sets is WELL-ORDERED. There are no infinite chains
> x0 \in x1 \in x2 \in .....
> The way Devlin explains this in The Joy of Sets is by constructing the ZF
> universe as a transfinite sequence of "layers", such that a set in some
> layer may only contain elements from strictly earlier layers. The bottom
> layer is the empty set.
>
> The fact that there are no infinite descending chains is incredibly
> important, because it means that you get to use proof by induction.
Agreed.
>
> The programming language analogue of this is that you get to use loops
> and recursion. Take this little loop:
>
> x := aSet.
> [x isKindOf: Set] whileTrue: [x := x anyOne].
>
> When can we expect this loop to terminate? When aSet is well-founded
> with respect to #includes:.
>
> Now, Peter Aczel has written a book "Non-well-founded Set Theory".
> I even have a copy of it. To be honest, past chapter 1 I didn't really
> understand much of it, only enough to decide I didn't really need it.
> The Joy of Sets has a discussion of non-well-founded set theory near the
> end.
After googling for "sets with itself" I would say I've thought of
"non-well-founded sets" (see the first page of
http://odur.let.rug.nl/~sjaak/semantiek3/coursenotes/3-2.pdf) ...
> But it's still true that when mathematicians talk about "sets"
> (as opposed to "non-well-founded sets") they mean ZF or GB -- which
> describe the same sets using different axioms -- or either of those
> extended with atoms that are not sets.
... so I would say mathematicians think about *both* "well-founded sets"
and "non-well-founded sets". But in the context of programming languages
the well-founded ones seem to be much more relevant.
>
> Now there are excellent reasons to construct graph structures in
> programming languages. Mathematicians talk about graphs as pairs of
> sets (Nodes, Edges), but in a programming language it may be more
> convenient to think of a set of Node objects (this set does not point
> to itself) with each Node having a set of Nodes that it is connected
> to, so that the pointer graph of node -> neighbour set -> node DOES
> have cycles, and the collection of Nodes is NOT well-ordered under
> #hasSuccessor: . The loop
>
> x := aNode.
> [x neighbours isEmpty] whileFalse: [x := x neighbours anyOne].
>
> may well not terminate.
> "Loop terminates" if and only if "stepping
> relationship is well founded", so knowing when a relationship is
> well-founded is of great practical interest to programmers.
Agreed.
>
>
> One point did come up in this discussion which no-one seems to have
> picked up, and it does have important practical consequences for Squeak
> (or any other Smalltalk-like language) and we might need to think about
> a change to #= and #hash to deal with it.
>
> I'm going to run this example in Ambrai Smalltalk. I _like_ Ambrai
> Smalltalk, although I must say that the main thing I have learned from
> it is how much I prefer the Squeak IDE. Create a Workspace window, type
>
> |x|
> x := Array with: 1.
> x at: 1 put: x.
> x hash
>
> then Cmd-A Cmd-D (this would be Cmd-P in Squeak) and wait for a while.
> Ka-BOOM! Ambrai Smalltalk softly and suddently vanishes away. The
> reason is
>
> IndexedCollection>>
> hash
> |size hash|
> size := self size.
> hash := "doesn't matter"
> ^size <= 5
> ifTrue: [
> 1 to: size do: [:i| hash := "a function of hash and (self at: i)"].
> hash]
> ifFalse: [ "doesn't matter" ]
>
> So to compute x hash, it tries to compute x hash, to compute x hash, ...
> and when the stack overflows, Ambrai dies.
>
> Squeak has a different and in some ways better hash function:
>
> SequenceableCollection>>
> hash
> |hash|
> hash := self species hash.
> 1 to: self size do: [:i | hash := "a function of hash and (self at: i)"].
> ^hash
>
> It has the same problem. The important difference is that Squeak catches
> and can recover from stack overflows.
>
> I spent a long time preparing a Lisp example to show that (sxhash xobject)
> should work for cyclic objects, but in the two Common Lisp implementations
> I tried, it didn't.
> (I had to get a technician to show me how to turn my
> new eMac back on again. This was rather humiliating.)
What about simply switching it off and then on? ;-)
> If you think about
> it, this is not altogether unreasonable: the point of hash functions is that
>
> (equal x y) ===> (= (sxhash x) (sxhash y))
>
> so that if (equal x y) -- or for us, x = y -- isn't going to terminate,
> it doesn't matter that (sxhash x) -- or for us, x hash -- isn't going to
> terminate either. But suppose we do this:
>
> |s|
> s := Set new.
> 1 to: 10 do: [:i| |a|
> a := Array with: i with: nil.
> a at: 2 put: a.
> s add: a].
>
> Here we have a case where x = y is going to terminate quickly, but
> x hash will not terminate and neither will y hash.
I'm not able to follow you here:
|s1 s2|
s1 := Set new.
1 to: 10 do: [:i| |a|
a := Array with: i with: nil.
a at: 2 put: a.
s1 add: a].
s2 := Set new.
1 to: 10 do: [:i| |a|
a := Array with: i with: nil.
a at: 2 put: a.
s2 add: a].
s1 = s2
does *not* terminate in my Squeak (probably later by detecting stack
overflow, but I have much swap space).
>
> There are several ways to work around the problem. One of the simplest
> is this.
>
> Object>>
> hash
> ^self hashToDepth: 8
> hashToDepth: d
> ^0 < d ifTrue: [userHashToDepth: d-1] ifFalse: [self class identityHash]
^<self missing>
> userHashToDepth: d
> ^self identityHash
I have had this idea myself: I think this could be a good way to
implement hash for collections.
But I would suggest:
Object>>
hash
^self identityHash
Object>>
hashToDepth: d
^self hash
Collection>>
hash
^self hashToDepth: 4
Collection>>
hashToDepth: d
| hash |
hash _ self species hash.
d = 0 ifTrue: [^hash].
self size <= 10 ifTrue:
[self do: [:elem | hash _ hash bitXor: elem hashToDepth: d-1]].
^hash bitXor: self size hash
SequenceableCollection>> "it overrides Collection>>hash"
hashToDepth: d
| hash |
hash _ self species hash.
d = 0 ifTrue: [^hash].
1 to: self size do: [:i | hash _ (hash + (self at: i) hashToDepth:
d-1) hashMultiply].
^hash
Note: I haven't tested this, but I think this could avoid to replace all
>>hash methods by >>userHashToDepth methods in Collections and to add
>>userHashToDepth methods in non-cyclic objects. But some more thinking
and *testing* would be necessary for a real solution IMO.
>
> Then replace all
>
> hash
> .... x hash ....
>
> methods with
>
> userHashToDepth: d
> .... (x hashToDepth: d) ....
>
> methods, or, if the class is such that objects _cannot_ be cyclic,
> such as numbers, bitmaps, strings, unboxed arrays, and so on, retain
> the existing #hash methods and add
>
> userHashToDepth: d
> ^self hash
>
> This doesn't stop hashing taking an unreasonably long time, but it does
> mean that it will terminate (if I haven't made any silly mistakes).
The idea is good.
>
> This would handle my example of a Set of arrays each of which has a
> distinct first element (but is cyclic via its second element).
OK.
>
> Even without cyclic structures, a depth-bounded hash can be useful
> because it reduces the amount of work done for large trees.
Agreed.
>
> That leaves us with the problem of defining #= for cyclic data structures.
> This has actually been solved by Prolog implementors, or at least,
> implementors of the ISO Prolog standard. In Prolog,
> L = [1|L]
> makes L a cyclic list of ones. In "classic" Prolog,
> L = L
> could then loop. In ISO Prolog,
> L = L
> must terminate (and should of course say "yes"). The method goes
> something like this:
>
> to determine whether x and y are equal:
> determine whether x and y are equal assuming {}
>
> to determine whether x and y are equal assuming equalities
> if x or y is an immediate value,
> determine equality by that way.
> if x and y do not have the same constructor,
> report that they are not equal.
> if (x = y) is amongst the assumed equalities,
> report True.
> otherwise,
> let equalities' = equalities U {x = y}
> for each corresponding component xi of x, yi of y
> determine whether xi and yi are equal using equalities'
> if they aren't, report False.
>
> Something like this could obviously be programmed in Smalltalk, with
> = anObject calling equals: anObject assuming: equations and
> that calling userEquals: anObject assuming: equations.
> I'd rather not think about it.
OK. :-)
>
>
> What this means for us as Smalltalk programmers is that
>
> (a) Smalltalk *could* have been designed so that #hash and #=
> supported cyclic objects. It wasn't.
>
> This isn't surprising. The original Smalltalk-80 used reference
> counting, so you really were NOT supposed to have cyclic objects at
> all anyway. Trying to put a Set inside itself would have been a big
> No-NO on garbage collection grounds.
Actually I think we have root objects: therefore this part of the
problem should be a NOP (no problem).
>
> (b) We could change Squeak so that it does handle #hash and #= for
> cyclic objects. That would be a lot of work,
Are you sure that it is that much work?
> and a lot of examples
> in Smalltalk books would need changing. But unless and until #hash
> and #= _are_ changed suitably,
Nevertheless it's important to keep in mind that even with these changes
a programmer shouldn't change an Object inside a hashed container...
>
> (c) If we design some classes so that object cycles may occur, we
> should be very careful about #= and $hash.
Agreed.
>
> In the case of such a cyclic graph of objects, the default implementations
> of #= and #hash inherited from Object may be just right. I don't mean
> just "pragmatically useful", I mean "correct". In such a case the object
> identity often really is part of what it represents.
I have to put some water into the wine here: it depends.
Imagine a graph stored serialized into some file and now you want to
compare the stored graph with incidentally the same one kept in memory
after filing the stored one in again...
>
> But sometimes we have objects that belong to (classes that inherit from)
> system classes with _other_ definitions of #= and #hash, and then we
> must be very careful to break the cycles.
Agreed.
>
> (d) Even if we fix #= and #hash, the assumption "there are no cycles" is
> deeply embedded in a lot of Smalltalk code. Such as #deepCopy and
> #printOn:. Putting objects inside themselves is dangerous...
Agreed.
Greetings
Stephan
>
>
--
Stephan Rudlof (sr at evolgo.de)
"Genius doesn't work on an assembly line basis.
You can't simply say, 'Today I will be brilliant.'"
-- Kirk, "The Ultimate Computer", stardate 4731.3
More information about the Squeak-dev
mailing list
|