Sets, hashing and #fullCheck

Ned Konz ned at squeakland.org
Thu Mar 11 02:38:29 UTC 2004


On Wednesday 10 March 2004 3:01 pm, Andreas Raab wrote:
> The questions I have are basically: Is there a reasonable explanation for
> the behavior I'm seeing when changing it to 1/2 or is this just something
> that worked for this specific experiment? More generally, is it possible to
> establish a reasonable upper bound for what the "expected fill rate" of an
> IDSet should be (considering things like number of hash bits, hash function
> used etc)? And finally, would it be worthwhile to change the identity based
> sets/dictionaries to grow more agressively?

Oddly enough, I was just thinking about this in the shower this morning.

Unfortunately, I've not been exposed to the Computer Science line on this, but 
it seems to me that the number of free slots should be large enough that 
every group of items that hash together should be separated by at least one 
empty slot.

Doing a quick search turned up:
http://planetmath.org/encyclopedia/GoodHashTablePrimes.html

Perhaps the growth strategy (and the default set/dictionary size) should be 
re-thought to try to use primes.

-- 
Ned Konz
http://bike-nomad.com/squeak/
-------------- next part --------------
'From Squeak3.7alpha of ''11 September 2003'' [latest update: #5657] on 10 March 2004 at 6:37:47 pm'!
"Change Set:		SetClumpinessAnalysis-nk
Date:			10 March 2004
Author:			Ned Konz

Quick analysis of worst-case hash collisions in keyed collections. Analyzes ratio of worst-case number of slots scanned to average number.

Set clumpinessReport

"!


!Set methodsFor: 'private' stamp: 'nk 3/10/2004 18:35'!
clumpiness
	"Answer the maximum number of hash collisions in my table divided by the average number."
	| b |
	self isEmpty ifTrue: [ ^0 ].
	b _ IdentityBag new.
	self do: [ :k | b add: (self slotsScannedToFind: k) ].
	^b max / b average asFloat! !

!Set methodsFor: 'private' stamp: 'nk 3/10/2004 16:40'!
slotsScannedToFind: anObject
	"Scan the key array for the first slot containing either a nil (indicating an empty slot) or an element that matches anObject.
	Answer the number of slots of the array scanned, or 0 if not found."
	| element start finish scanned |
	start _ (anObject hash \\ array size) + 1.
	finish _ array size.
	scanned _ 1.

	"Search from (hash mod size) to the end."
	start to: finish do:
		[:index | ((element _ array at: index) == nil or: [element = anObject])
			ifTrue: [^ scanned ].
			scanned _ scanned + 1 ].

	"Search from 1 to where we started."
	1 to: start-1 do:
		[:index | ((element _ array at: index) == nil or: [element = anObject])
			ifTrue: [^ scanned ].
			scanned _ scanned + 1 ].

	^0! !


!Dictionary methodsFor: 'testing' stamp: 'nk 3/10/2004 18:35'!
clumpiness
	"Answer the maximum number of hash collisions in my table divided by the average number."
	| b |
	self isEmpty ifTrue: [ ^0 ].
	b _ IdentityBag new.
	self keysDo: [ :k | b add: (self slotsScannedToFind: k) ].
	^b max / b average asFloat! !


!Set class methodsFor: 'testing' stamp: 'nk 3/10/2004 18:31'!
clumpinessReport
	"Set clumpinessReport"
	| b |
	b _ Bag new.
	self allSubInstances do: [ :ea | ea isEmpty ifFalse: [ b add: (ea clumpiness roundTo: 0.1) ] ].
	Workspace new
		acceptContents: (String streamContents: [ :s |
			s nextPutAll: 'sorted counts:'; cr; print: b sortedCounts; cr;
			nextPutAll: 'frequency distribution;'; cr; print: b frequencyDistribution; cr;
			nextPutAll: 'cumulative counts:'; cr; print: b cumulativeCounts; cr ]);
		openLabel: 'Clumpiness of ', self name
! !


!Symbol class methodsFor: 'class initialization' stamp: 'nk 3/10/2004 16:19'!
compareTiming
	" 
	Symbol compareTiming
	"
	| answer t selectorList implementorLists flattenedList md |
	answer _ WriteStream on: String new.
	Smalltalk timeStamp: answer.
	answer cr; cr.
	answer nextPutAll: MethodDictionary instanceCount printString , ' method dictionaries';
		 cr;
		 cr.
	answer nextPutAll: (MethodDictionary allInstances
			inject: 0
			into: [:sum :each | sum + each size]) printString , ' method dictionary entries';
		 cr;
		 cr.
	md _ MethodDictionary allInstances.
	t _ [100
				timesRepeat: [md
						do: [:each | each includesKey: #majorShrink]]] timeToRun.
	answer nextPutAll: t printString , ' ms to check all method dictionaries for #majorShrink 1000 times';
		 cr;
		 cr.
	selectorList _ Symbol selectorsContaining: 'help'.
	t _ [3
				timesRepeat: [selectorList
						collect: [:each | self systemNavigation allImplementorsOf: each]]] timeToRun.
	answer nextPutAll: t printString , ' ms to do #allImplementorsOf: for ' , selectorList size printString , ' selectors like *help* 3 times';
		 cr;
		 cr.
	t _ [3
				timesRepeat: [selectorList
						do: [:eachSel | md
								do: [:eachMd | eachMd includesKey: eachSel]]]] timeToRun.
	answer nextPutAll: t printString , ' ms to do #includesKey: for ' , md size printString , ' methodDicts for ' , selectorList size printString , ' selectors like *help* 3 times';
		 cr;
		 cr.
	#('help' 'majorShrink' )
		do: [:substr | 
			answer nextPutAll: (Symbol selectorsContaining: substr) size printString , ' selectors containing "' , substr , '"';
				 cr.
			t _ [3
						timesRepeat: [selectorList _ Symbol selectorsContaining: substr]] timeToRun.
			answer nextPutAll: t printString , ' ms to find Symbols containing *' , substr , '* 3 times';
				 cr.
			t _ [3
						timesRepeat: [selectorList _ Symbol selectorsContaining: substr.
							implementorLists _ selectorList
										collect: [:each | SystemNavigation default allImplementorsOf: each].
							flattenedList _ SortedCollection new.
							implementorLists
								do: [:each | flattenedList addAll: each]]] timeToRun.
			answer nextPutAll: t printString , ' ms to find implementors of *' , substr , '* 3 times';
				 cr;
				 cr].
	StringHolder new contents: answer contents;
		 openLabel: 'timing'! !



More information about the Squeak-dev mailing list