[BUG][FIX?]Symbol hashing

Daniel Vainsencher danielv at netvision.net.il
Sat Jul 10 16:36:58 UTC 1999


Content-Type: text/plain; charset=us-ascii
Hi guys.

The short version -
using a 12-bit hash space with a set size >> 4095 elements works badly with
current Set>>scanFor: implementation. A better implementation is attached,
though it assumes 12 bits are always used, which may not be correct.

If this looks like a good idea, I'll fix it in the other implementors of
scanFor:.

The whole story -

While playing around with Squeak's navel gazing facilities,
I noticed that it can find out interesting stuff like "all message sent but
not implemented".

Then I found "allImplementedMessages", and that was really slow.
So slow I stopped waiting. Since this seems like a message that'll be very
useful to me in something I'm prototyping, I started looking into it.

It's code is simple -
| aSet |
	aSet _ Set new: Symbol instanceCount * 2.
	Cursor wait showWhile:
		[self allBehaviorsDo: [:cl | cl selectorsDo: 
[:aSelector | aSet add:
aSelector]]].
	^aSet

Having seen Squeak do the same iteration in well under a second, I started
suspecting the Set. Either it was getting too big and couldn't deal with it
(seemed unlikely), or the hashing wasn't working.

Changed the set to an array (fed through a stream), and I was back to well
under a second. So it appears that Squeak eats 20-somethingK element arrays
for breakfast, but something's wrong with the hashing/sets.

(resultArray collect: [:e | e hash]) asBag showed that the values were indeed
too tight (all used up, most showing up several time), but no particular
problem.

"hashesArray max" clued me right in - it was 4095. min was 0. 12 bits of hash
really aint all that much, for symbols.

Looking at Symbol>>hash, the comment leads me to believe that there are
supposed to be at least 16 bits involved. However, I didn't manage to get from
<primitive: 75> to the VM code implementing it.

A recent message on the list says that the hash value provided by asOop is 12
bits by definition. This puts a ceiling on the utility of sets, dictionaries
and such for large datasets.

Looking for a quick fix, I noticed that Set>>scanFor: goes scanning linearly
when the array slot is taken at the hash. This means that as soon as the first
4000~ places are taken, adding each new element requires lots of scanning.

The change set below makes the scan start position be more or less evenly
distributed in the array.

timings _ OrderedCollection new.
3000 to: 9000 by: 500 do: [:i |
shortArray _ result contents copyFrom: 1 to: i.
timings add: (Time millisecondsToRun: [shortArray asSet])].
timings

resulted in -
55 64 74 88 110 149 297 1793 4556 7239 10249 13757 17571
changing to -
56 74 85 94 110 122 124 137  152  164  180   186   191

This look right to you?

Daniel


Content-Type: text/plain ; name="Set-scanFor.st"
Content-Description: Set-scanFor.st
Content-Disposition: attachment; filename="Set-scanFor.st"

Attachment converted: Anon:Set-scanFor.st (TEXT/MSIE) (0000B0EB)





More information about the Squeak-dev mailing list