MarkingKey

Richard A. O'Keefe ok at cs.otago.ac.nz
Thu Jun 13 01:35:16 UTC 2002


	well, i had no problem to use my markingkey to implement a heap. the 
	theoretical problem that a < b XOR a >= b must be true, is easily avoidable 
	by making sure that the "sentinel" is on the right side of the comparison.
	im not experiencing any trouble.
	
Something one uses in a private project of one's own may have
rough edges, need not work in all cases.

Things that go in the Squeak image are supposed to be fit for reuse.

Amongst other things, that means that if you use a "well known" selector,
the semantics of your operation had better agree with the expected
semantics of that selector.

	> 	They're quite useful, each one, like an element
	> 	z in a linked list.
	>
	> Well, no, an element z in a linked list is NOT particularly useful,
	> and if any books are candidates for burning, books that make simple
	> linked lists look difficult are on the top of that pile.
	
	The very same sedgewick algorithms book is such one, and it
	simply makes sense.  this book is the classical book for that
	matter, no doubt about that.

Surely the classic is Knuth?
Sedgewick's book *sells* well, no doubt about that.
But there are plenty of data structure books with better explanations
and cleaner code.

	and i gladly see how simple the z
	element makes linked lists.

Eh?

OK, let's see how simple "z" *does* make linked lists.

    /* page 20 */
    struct node
      { int key; struct node *next; };
    struct node *head, *z, *t;
    listinitialize()
      {
        head = (struct node *) malloc(sizeof *head);
        z = (struct node *) malloc(sizoe *z);
	head->next = z; z->next = z;
      }
    deletenext(struct node *t)
      { t->next = t->next->next; }
    struct node *insertafter(int v, struct node *t)
      {
        struct node *x;
        x = (struct node *) malloc(sizeof *x);
        x->key = v; x->next = t->next;
        t->next = x;
        return x;
      }
    /* page 27 */
    push(int v)
      {
        t = (struct node *) malloc(sizeof *t);
        t->key = v; t->next = head->next;
        head->next = t;
      }
    int pop()      
      {
        int x;
        t = head->next; head->next = t->next;
        x = t->key;
        free(t);
        return x;
      }
    int stackempty()
      { return head->next == z; }

    /* I couldn't find a listsearch, so
       this is adapted from treesearch on page 203 */
    struct node* listsearch(int v) {
      {
        struct node *x = head->next;
        z->key = v;
        while (v != x->key) x = x->next;
	return x;
      }

If you think that is dreadful C code, you are right.
Poor naming style, six missing 'void's, and of course the classic
beginner's bug: forgetting to check malloc().

I wasn't going to do anything about the malloc() bug, but no,
I *can't* bring myself to publish code I know to be buggy.
Let's see what this looks like WITHOUT the bogus header and z objects.

    #include <assert.h>
    #include <stdlib.h>

    typedef struct Node *List;
    struct Node {List next; int item; };
    #define nil (List)0

    List cons(int item, List next) {  /* EXTREMELY useful */
        List result = malloc(sizeof *result);
	assert(result != nil);
	result->item = item, result->next = next;
	return result;
    }

    void delete_after(List t) {    /* almost never useful */
	if (t->next != nil) {
	    List to_go = t->next;
	    t->next = to_go->next;
	    free(to_go);
	}
    }

    List insert_after(int item, List t) {  /* almost never useful */
	return t->next = cons(item, t->next);
    }

    List stack = nil;

    void push(int item) {
        stack = cons(item, stack);
    }

    int pop(void) {
	int result;
	List t = stack;

	assert(t != nil);
	result = t->item, stack = t->next;
	free(t);
	return result;
    }

    int stack_is_empty(void) {
	return stack == nil;
    }

    List list_search(int item, List x) {
        while (x != nil && x->item != item) x = x->next;
    }

	if you disagree with the classical
	book, then youll have a hard time to win the reputation you need
	to seriously fight it.

"The Craft of Prolog" made my reputation, thanks.
I've been using and implementing list-processing languages since the
mid-70's.  I am fully aware of the simplicity and power of singly
linked lists in a way that Sedgewick, who made his reputation with
a thesis and article about Quicksort, apparently was not when he wrote
that book.

I have the code from the Third Edition of "Algorithms in C".
It came out in 1998, and Sedgewick's C still hasn't improved.
The first thing to greet you is main() falling off the end without
returning a value, which is both technically and pragmatically an error.
The first code example to use sqrt() is this one:

    #include <stdlib.h>
    typedef int numType;
    numType randNum()
      { return rand(); }
    main(int argc, char *argv[])
      { int i, N = atoi(argv[1]);
	float m1 = 0.0, m2 = 0.0;
	numType x;
	for (i = 0; i < N; i++)
	  {
	    x = randNum();
	    m1 += ((float) x)/N;
	    m2 += ((float) x*x)/N;
	  }
	printf("       Average: %f\n", m1);
	printf("Std. deviation: %f\n", sqrt(m2-m1*m1));
     }

What's missing from this file (apart from the 'return' statement)?
More precisely, why is it not legal C99, and why is it going to
print the wrong answer?

I don't have to believe ANYTHING this guy says about algorithms
and data structures.

	most of all since its really obvious:
	such elements make implementations easier, faster and plainly:
	more fun.
	
As demonstrated above, it does NOT make implementations easier.
Compare the simplicity of
    x = nil;
with the complexity of listinitialize().  I can assure you from a long
practical experience that it DOESN'T make implementations faster, it
makes them slower.  As for fun, maybe you're into weight-lifting or
bondage.  I'm not.

I invite you to think about concurrent threads.
The trick of having a boundary node where you stuff the thing you're
looking for means that you cannot have two threads searching the
same list (or tree) at the same time, even though they aren't changing
it.  That's not my idea of fun.  It's my idea of pain.

Oddly enough, looking at the list code from Algorithms in C, third
edition, I notice that the "z" nodes have disappeared!

I guess Sedgewick doesn't think they are a good idea any more!

	> As I have demonstrated, your "marking key" didn't work with numbers,
	> and while that could be fixed, it still didn't work with Strings,
	
	it _did_ work.  your problem is entirely theoretical:  ive got
	_no_ problem to use it.
	
My problem is not entirely theoretical, it is entirely practical.
Obvious ways of using your object DO NOT WORK.
It really shouldn't make any difference whether I write
    x < y
or  y < x.
If you are content to deal with collections of numbers only, then fine,
but the name of the class should make it clear that it only works with
numbers.

	> One day it dawned on me that I COULDN'T use the sentinel technique in
	> Pascal, because if I had to sort an entire array, there just wasn't
	> any place to PUT the sentinel. 
	
	well, specially with ocs, there usually is space.  and you
	neednt use it all the time, when theres no space, you can simply
	search for one from inside the array:  i dont see a problem.

The problem is that the method ONLY works when you KNOW there is unused
space in an array or when you are using a kind of array that can stretch
at whichever end you want to insert the thing.  In most programming
languages, there aren't stretchy arrays.  And in languages where there
are, you still have trouble sorting *part* of an array.

When I learned how to sort, I wanted to learn techniques that I could
reliably use in *all* the languages I dealt with: Fortran, Algol,
Pascal, C, Lisp, Pop, ...  And I found that once I stopped bothering
about sentinels, everything got simple.  Sorting, merging, heaps, you
name it.

	> One day someone will innocently include maxInt as a value in the array,
	> and your algorithm will stop working.
	
	well, i cant recall any situation ive come to see where maxInt was an 
	unavoidable legitimate value,

Remember, we're not talking about Squeak here, because Squeak doesn't HAVE
any maximum integer.  We're talking about C.  My test harness for sorting
algorithms (mine in the sense that I "own" it, a student wrote it for me
based on the outline in Bentley & McIlroy) includes filling an array with
random numbers, and if you do a few million tests INT_MAX is practically
*bound* to come up.  Perhaps more interestingly, if you are using integers
to represent sets, then sorting them as (unsigned) integers gives you an
order that is compatible with subset order (if x subset-of y, then
x as-number <= y as-number), and in that case too, UINT_MAX is *bound* to
come up.  Or if you are sorting a collection of intervals, some of the
intervals may be half-open and an endpoint may very easily be INT_MAX.

	but if there ever might be one, the cool thing 
	bout squeak is that you can easily fix the situation.
	but again: if your calculations go any near maxInt, and you do not switch to 
	longs, then its really your fault...
	
There is no maxInt or 'longs' in Squeak.
As for the idea that "switching to longs" is the programmer's responsibility,
decades of Lisp and Smalltalk experience say "no way".  (C longs are usually
only 32 bits anyway.  Only yesterday I had reason to compute a 95-bit
number in C.)

Didn't Smalltalk get a surge in popularity when that stock market thing
happened and a lot of C and Fortran programs overflowed their stupid 32-bit
limits and started going crazy, while Smalltalk just soldiered on?

	while for b) we can argue what is useful or not, the need isnt big to fight so 
	hard there... soon as i have a benchmark for my heap, ill compare it to the 
	one shipped with squeak, and then the discussion will be odd, right?
	(hay... *dreamingBoutFindingMyCodeInSqueak....*)
	
Remember that to compete with Squeak Heaps yours must allow the client
to provide the comparison method.  Heaps that are restricted to numbers
aren't of that much interest.  (In that case there's a method that does
much MUCH better.)

	and for a) I have a copy of that algorithms book, and i want to
	practice with it.

Get a better book.  Get Knuth.  Get Cormen, Leiserson, & Rivest.
Or if you want to stick with Sedgewick, get the current edition.

	> Not as common as you might think.  Dates, Times, and/or Timestamps, yes.
	
	stockings, sales, numbers, priorities.... sorting by numbers is extremely 
	common.
	
Er, stockings are things women put on their legs.

Sorting *BY* numbers is common.  Sorting *OF* numbers is not.
When you talk about sorting 'sales', you are talking about sorting
a collection of sales *objects*, which record what was sold, how many,
at what price, and so on.  A sales object is not a number.  It contains
a number, but it isn't the identical same object as that number.

This is very very important.  It means that if you do

    salesRecords sort

then SalesRecord must implement the Magnitude protocol.
When the sorting algorithm does x < y, this will NOT be an inline
integer comparison, it will be a full method send to a SalesRecord
object.

My point is that this full method send is so much more expensive
than the inline integer comparison that sentinels save you that the
actual *proportion* of time saved by using a sentinel is tiny.

Alternatively, if you write

    salesRecords sortBy: [:x :y| x date <= y date]

then every time the sorting method wants to compare two objects, it
will do
    aSortBlock value: x value: y
which is a message send, then there will be two more message
sendes to get the keys (x date, y date), then there will be
one more message send (to the #<= of Date), which will finally
do an inline integer comparison.  This time there will be four
message sends.

	> You can only use your MarkingKey class when you KNOW you are sorting
	> numbers.
	
	nonsense..
	
Simple truth.  I demonstrated that 0 < MarkingKey biggest didn't work,
showed how to fix it so that it did, and then pointed out that
'a' < MarkingKey biggest doesn't work and would be hard to fix.
Nor does $a < MarkingKey biggest work, Date today < MarkingKey biggest,
&c.  In particular, if someone hands you an OrderedCollection and asks
you to sort it, you have NO reason to believe that it doesn't already
contain a MarkingKey.  If *you* constructed the array, fine.  The
problem arises when someone else constructs the array.

Look, this sound theoretical, but I have had too many bad experiences
where someone said "oh, we can use this special value, nobody will
ever come across that by accident" and then I did.  Perhaps the
simplest example was '$VAR' in Prolog.  Yes, normal code never ran
across it, but then one day I had to debug the code that used it,
and you should have seen the mess!

	> And if you know you are sorting numbers, you should use a radix
	> sort or a bucket sort, not insertion sort.
	
	radix sort and straight radix sort both speed up when you mix them with 
	insertion sort, for random distributions.
	
Not true, in my tests.  Not unless you are thinking of a really dumb
radix sort that works one bit at a time.



More information about the Squeak-dev mailing list