[squeak-dev] 4.1 - hashed collections still a problem
Levente Uzonyi
leves at elte.hu
Sun Mar 28 00:47:25 UTC 2010
On Sat, 27 Mar 2010, Chris Muller wrote:
>>> Dictionaries don't have to use associations (for example MethodDictionary
>>> doesn't use them), that's why #pointsTo: works (MethodDictionary also uses
>>> it).
>>>
>> But that means a linear scan of the whole collection, even if done primitively,
>> this is not scalable.
>
> Not necessarily. VisualAge had a LookupTable class that was identical
> to Dictionary except it used two parallel Array's (keys / values)
> rather than Association objects instaniated by its Dictionary.
> LookupTable was faster..
>
>
Linear search is not scalable in the way MethodDictionary uses it, but
that doesn't have to scale at all. MethodDictionaries rarely grow large
and #pointsTo: is faster if the capacity of the dictinary is less than
1024 (that covers all but one MethodDictionary in my image).
LookupTable is totally unrelated to this issue. It's just an
association-free dictionary implementation, but it still uses open
addressing so it doesn't solve the issue with Squeak's identityHash.
It's easier to fight the issue of few keys and lots of objects with
separate-chaining. Igor's sets/dictionaries use a linked list of
associations for that, while my implementations uses arrays and no
associations.
Using no associations and (separate key-value) arrays has several
advantages over the list of associations:
- smaller memory footprint because of less objects (this means that it's
more cache-friendly)
- better cache locality, because pointers to keys with the same hash
use adjacent memory slots (this means that it's much more cache-friendly)
- #pointsTo: can be used with IdentityDictionaries which is much
faster than a linear search implemented with a fully optimized loop.
This all means that LargeIdentityDictionary is 3.7x faster than
MaIdentityDictionary for #at:ifAbsent: and 40x faster for #includesKey:
when reaching 1000000 elements in the benchmark below, while
MaIdentityDictionary is only 30-40% faster than IdentityDictionary (if you
don't run into a really bad clustering. With better primes the chance of
that may be even lower, and with an extra check for clustering that may be
totally avoided).
Of course the above numbers are only true if the #notNil send is replaced
with == nil in MaIdentityDictionary >> #at:ifAbsent: otherwise it's
a bit worse (MaIdentityDictionary is only 20% faster than IdentityDictionary).
The benchmark (requires the latest trunk image):
objects := Array new: 1000000 streamContents: [ :stream | 1 to: 1000000 do: [ :i | stream nextPut: Object new ] ].
objects shuffleBy: (Random seed: 36rSQUEAK). "Could be anything, since the hashes are random."
results := {
LargeIdentityDictionary.
MaIdentityDictionary.
IdentityDictionary } collect: [ :dictionaryClass |
| objectSource |
objectSource := objects readStream.
dictionaryClass -> (Array streamContents: [ :results |
| dictionary newObjects firstRun oldObjects |
dictionary := dictionaryClass new.
newObjects := Array new: 10000.
oldObjects := nil.
Smalltalk garbageCollect.
[ dictionary size >= 1000000 ] whileFalse: [
1 to: 10000 do: [ :index | dictionary at: (newObjects at: index put: objectSource next) put: nil ].
oldObjects ifNil: [ oldObjects := newObjects copy ].
results nextPut: {
dictionary size.
[
1 to: 5 do: [ :run |
1 to: 10000 do: [ :index | dictionary includesKey: (oldObjects at: index) ] ].
1 to: 5 do: [ :run |
1 to: 10000 do: [ :index | dictionary includesKey: (newObjects at: index) ] ] ] timeToRun.
[
1 to: 5 do: [ :run |
1 to: 10000 do: [ :index | dictionary at: (oldObjects at: index) ifAbsent: nil ] ].
1 to: 5 do: [ :run |
1 to: 10000 do: [ :index | dictionary at: (newObjects at: index) ifAbsent: nil ] ] ] timeToRun } ] ]) ].
While reviewing MaDictionary and subclasses I found that #maxBuckets is
wrong. It's okay for MaIdentityDictionary, but it won't allow an
MaDictionary to have more than 4096 buckets.
Levente
More information about the Squeak-dev
mailing list
|