comparing Collections. Can we do better!?

Sun Dec 14 08:18:07 UTC 2003

If you are intersted in making Smalltalk faster then bear with me.
This gets interesting.

I am building a parser generator tool and as part of that I have to
search sets of arrays to find if an array is in the set.  This causes 
lots of array comparisons which can be very expensive since two arrays
(in my application, building finite state machines) have a reasonable
chance of containing the same elements or even a lot of common elements.

I reasoned that two arrays should compare hash values before doing a
full comparison.  However in Smalltalk this doesn't help a whole lot
because the hash value of an array is the bitXor:  of its elements
(and some minor stuff which I will ignore) which costs O(n) time for an
  n element array.

So I implemented a Class  HashArray whose superclass is Array and which
stores the exclusive or of its elements in an instance variable,
  hashValue.  Note now that adding/removing single elements costs
O(1) additional time.  Also creating new arrays of any size also costs
O(1) additional time because the bitXor of n nils can be computed
in O(1) time.

The down side is that you need the extra instance variable but for an
array this shouldn't be too serious an issue.

Computing the hash value of a HashArray object is about 4 times as fast
as for an Array element,  per element of the array.  For an array of
10,000 elements HashArray hash is  40,000 times faster.  On a
performance improvement per amount of effort basis I think
this is pretty good.

I did a similar comparison for sets (actually I should have created a
  HashSet class for this but I haven't yet so I used HashArray for the
comparison). In this case  HashArray is about 4 times faster than
Set  to generate a hash value, this time independant of the Set size.

I tried to figure out how Set did this.  Set appears to inherit method
  hash from Collection.  Collection does a self species hash.
Set answers that its species is itself so set hash must be called
which of course is inherited from Collection.
Though a Set hash is 4 time slower than
a HashArray hash it isn't bad for an infinite loop!!!  (explainations
here are welcome and don't put me down too too much for being stupid;
I just increased the performance of a method by, say, 40,000 times).

Anyway, after completing the infinite loop, Collection tests if the
(Set) has 10 or fewer elements.
If it does then it computes the bitXor of its
elements and combines this with the hash value.  As a result,
for example, for 8 element sets,  HashArray hash is about 27 times
  faster than Set hash.  Thus hashing small sets is slower than hashing
large sets where small is 10 or fewer and large is 11 or larger.

I assume that Set hash does not store a hash value and thus in my
  application a  aSetObject = aSetObject would be expected to be slower
than aHashSetObject = aHashSetObject, possibly by a large amount.

Unfortunately, this does not appear to be the case which I assume is a
result of bugs to be found.

Finally, a question.
Would it not be a good idea for all (or at least most) Smalltalk
  Collections (Sequenceable or otherwise) to store the hash value?

I realize that this means that each at: or at:put: or similar operations
then have to do additional work and perhaps since these operations are
very common it might not actually be a good idea.
If this is the case, should there not at least be hashed versions of
  collection classes so that people like me don't have to implement their
  own???

Comments are most welcome but be quick;  I am moving in a couple of days
  and the computer has to be packed.

Ralph Boland