[squeak-dev] The Inbox: Collections-cbc.812.mcz

Levente Uzonyi leves at caesar.elte.hu
Fri Nov 16 10:42:11 UTC 2018


On Fri, 16 Nov 2018, commits at source.squeak.org wrote:

> Chris Cunningham uploaded a new version of Collections to project The Inbox:
> http://source.squeak.org/inbox/Collections-cbc.812.mcz
>
> Item was changed:
>  ----- Method: Interval>>hash (in category 'comparing') -----
>  hash
>  	"Hash is reimplemented because = is implemented."
>
>  	^(((start hash bitShift: 2)
> + 		bitOr: self last hash)
> - 		bitOr: stop hash)
>  		bitShift: 1)
>  		bitOr: self size!

As Bert pointed out, #bitOr: is not a good choice for hashing, because 
it loses entropy. Also, I'd use #hashMultiply instead of ad-hoc bit 
shifts to maximize the entropy of the hash value. E.g.:

hash

 	^((start hash hashMultiply bitXor: self last hash) hashMultiply
 		bitXor: self size) hashMultiply

Below is a snippet comparing various hash functions:

hashes := Set new. "Original #hash"
hashes2 := Set new. "#hash from Collections-cbc.812"
hashes3 := Set new. "The above proposed #hash using #hashMultiply and #bitXor:"
intervals := Set new. "The maximum number of different #hash values"

Interval allInstancesDo: [ :each |
 	hashes add: each hash.
 	hashes2 add: each hash2.
 	hashes3 add: each hash3.
 	intervals add: each ].
{
 	hashes size.
 	hashes2 size.
 	hashes3 size.
 	intervals size }

"==> #(584 584 1113 1113)"

The sample is too small, but it suggests that the original #hash and the 
#hash in Collections-cbc.812 are suboptimal.
If we create a few more intervals in a workspace:

newIntervals := (1 to: 10000) collect: [ :each | (1 to: each) ].
newIntervals2 := (1 to: 10000) collect: [ :each | (1 to: each by: 2) ].
newIntervals3 := (1 to: 10000) collect: [ :each | (each to: 1 by: -1) ].

then the numbers will be more accurate:

#(6122 5368 26244 27043)

Levente


More information about the Squeak-dev mailing list