DateAndTime hash was: Re: [squeak-dev] true hash
Paul DeBruicker
pdebruic at gmail.com
Mon May 14 07:15:59 UTC 2012
On 05/13/2012 08:04 PM, Levente Uzonyi wrote:
> On Sat, 12 May 2012, Andres Valloud wrote:
>
>> When asserting speedups and collision rates, please also mention what
>> was the dataset used for the measurements. In that way, others can
>> replicate the results and compare hash functions on a consistent basis.
>
> +1. I started hacking it a bit and got another ~2x speedup on my
> benchmark, but it was too simple and I didn't have the time to dig deeper.
>
>
> Levente
The benchmark is described first and then the datasets/collision rate
estimates are below that. With credit to Andres Valloud's work put
forth in his book:
www.amazon.com/Hashing-Smalltalk-Theory-Practice/dp/B00262SJ1M
I've just made a naive version modelled after the process he lays out in
the book. This version only tests dates and times and does not calculate
chi-sq or anything like that. I put my code here:
http://ss3.gemstone.com/ss/TimeHashing
and you also need:
http://www.squeaksource.com/DHBNumerical.html
When I claimed it was 100x faster I was relaying the results of running
this test in a workspace:
now:=DateAndTime now.
[now hash] bench
and changed between the old:
DateAndTime>>#hash
^self asUTC ticks hash
and the new:
DateAndTime>>#hash
| totalSeconds |
totalSeconds := seconds - offset asSeconds.
^ ((totalSeconds // 86400 + jdn) hashMultiply bitXor: totalSeconds \\
86400) bitXor: nanos
I ran it in Squeak4.3 and Pharo1.3 on ubuntu-64 bit with Eliot's cog vm.
Based on Levente's comment I made some more benchmarks that test the
hashing speed for the items in the datasets I used for testing
collisions and included them in the package. It seems like the 100x
holds up when testing a larger set of values but in either case I could
also be messing up the measuring of things I don't intend to. I don't
know.
---------------------------------------------------
For the datasets for the collision rates:
I build two arrays and also used 'DateAndTime allInstances asSet'.
The first array is a random array of DateAndTime instances of a user
specified size where the value for the jdn, second, offset, and nano are
selected independently at random from valid values. For the offsets I
use those listed on Wikipedia. There are 55.
The second array is a sequential array where I create a random
DateAndTime as above but then for a given number of iterations add 1
second, 1 minute, 1 day, 1 week, 30 days, and 1 year to the created date
respectively. So you wind up with an array that has seven sections of
roughly equal length with each element a second, minute, hour, day,
week, month, or year apart. Duplicates are removed before running the
tests.
In the DateAndTimeHashing class there are three class side methods.
(runRandomHashing, runAllInstancesHashing, runSequentialHashing) that
will run and spit out the number of collisions / number of iterations to
the Transcript.
The number of iterations in the version in ss3 is artificially low and
should probably be raised once the code is loaded. The data sets can be
refreshed to different random values by running:
DateAndTimeHashing refreshDataSets.
To modify the hash function that's tested change
DateAndTimeHashing>>#hashOf: I've left the old version and Nicolas
Cellier's version in the comments of that method.
I originally built this to test options for Chronos's Timepoint hash
which is just the 'day since the squeak epoch' in the version for Squeak.
More information about the Squeak-dev
mailing list
|