DateAndTime hash was: Re: [squeak-dev] true hash

Mon May 14 07:15:59 UTC 2012

On 05/13/2012 08:04 PM, Levente Uzonyi wrote:
> On Sat, 12 May 2012, Andres Valloud wrote:
>
>> When asserting speedups and collision rates, please also mention what
>> was the dataset used for the measurements.  In that way, others can
>> replicate the results and compare hash functions on a consistent basis.
>
> +1. I started hacking it a bit and got another ~2x speedup on my
> benchmark, but it was too simple and I didn't have the time to dig deeper.
>
>
> Levente

The benchmark is described first and then the datasets/collision rate 
estimates are below that.  With credit to  Andres Valloud's work put 
forth in his book:
www.amazon.com/Hashing-Smalltalk-Theory-Practice/dp/B00262SJ1M

I've just made a naive version modelled after the process he lays out in 
the book. This version only tests dates and times and does not calculate 
chi-sq or anything like that. I put my code here:
http://ss3.gemstone.com/ss/TimeHashing
and you also need:
http://www.squeaksource.com/DHBNumerical.html

When I claimed it was 100x faster I was relaying the results of running 
this test in a workspace:

now:=DateAndTime now.
[now hash] bench

and changed between the old:

DateAndTime>>#hash
	^self asUTC ticks hash

and the new:

DateAndTime>>#hash
	| totalSeconds |
	totalSeconds := seconds - offset asSeconds.
	^ ((totalSeconds // 86400 + jdn) hashMultiply bitXor: totalSeconds \\ 
86400) bitXor: nanos

I ran it in Squeak4.3 and Pharo1.3 on ubuntu-64 bit with Eliot's cog vm.

Based on Levente's comment I made some more benchmarks that test the 
hashing speed for the items in the datasets I used for testing 
collisions and included them in the package. It seems like the 100x 
holds up when testing a larger set of values but in either case I could 
also be messing up the measuring of things I don't intend to.  I don't 
know.

---------------------------------------------------

For the datasets for the collision rates:

I build two arrays and also used 'DateAndTime allInstances asSet'.

The first array is a random array of DateAndTime instances of a user 
specified size where the value for the jdn, second, offset, and nano are 
selected independently at random from valid values.  For the offsets I 
use those listed on Wikipedia.  There are 55.

The second array is a sequential array where I create a random 
DateAndTime as above but then for a given number of iterations add 1 
second, 1 minute, 1 day, 1 week, 30 days, and 1 year to the created date 
respectively.  So you wind up with an array that has seven sections of 
roughly equal length with each element a second, minute, hour, day, 
week, month, or year apart.  Duplicates are removed before running the 
tests.

In the DateAndTimeHashing class there are three class side methods. 
(runRandomHashing, runAllInstancesHashing, runSequentialHashing) that 
will run and spit out the number of collisions / number of iterations to 
the Transcript.

The number of iterations in the version in ss3 is artificially low and 
should probably be raised once the code is loaded.  The data sets can be 
refreshed to different random values by running:

DateAndTimeHashing refreshDataSets.

To modify the hash function that's tested change 
DateAndTimeHashing>>#hashOf:  I've left the old version and Nicolas 
Cellier's version in the comments of that method.

I originally built this to test options for Chronos's Timepoint hash 
which is just the 'day since the squeak epoch' in the version for Squeak.