[squeak-dev] Improving WeakKeyDictionary comment

Mon Jun 6 10:16:07 UTC 2022

Hi,

So to sum up: if the WeakKeyDictionary gets new elements added frequently,
causing it to consider growing every once in a while, it may not be
necessary to register it with WeakArray. On the other hand, if the
dictionary is seldom modified, one can register it with WeakArray to get
stale associations collected "as soon as possible." Alternatively I should
consider sending #finalizeValues by myself as regularly as it makes sense
for my application. Is that right?

In my latest use case I wanted to use a WeakIdentityKeyDictionary to hold
certain "on-demand inst vars" for other kinds of objects (like the
DependentsFields dictionary in Object, which adds a list of dependents to
any kind of object on demand). So in general I would like each on-demand
value to be collected when the object to which it belongs gets collected.
But I expect that new keys will be added only in rare circumstances (except
during the unit tests), and that the keys will be garbage-collected only in
rare circumstances as well. The keys are expected to be PackageInfos from
the package organizer, so changes come only when the extra info is accessed
for a package for the first time, or when a package gets removed from the
system. I tried turning my dictionary from an IdentityDictionary to a
WeakIdentityKeyDictionary (with #newFrom:) without registering, and it kept
lots of garbage elements, which were previously added during repeated test
runs. Soon after registering it with WeakArray, most elements were removed.
On the other hand, now most of the invocations to #finalizeValues go to
waste because as mentioned, packages do not disappear frequently under
normal circumstances. Hmm...

Kind regards,
Jakob

Am Mo., 6. Juni 2022 um 11:00 Uhr schrieb Levente Uzonyi <
leves at caesar.elte.hu>:

> Hi Chris,
>
> On Mon, 6 Jun 2022, Chris Muller wrote:
>
> > Hi Levente,
> >
> >>>      > - Must one ensure that #finalizeValues is sent to one's
> WeakKeyDictionary instance regularly?
> >>>
> >>> Yes.  Either that, or you need to otherwise somehow remove the garbage
> collected entries from its internal array.
> >>>
> >>> Otherwise, the WeakKeyDictionary will only ever grow in size
> internally, with a bunch of nil keys.  Here's a short script showing the
> mechanics and assumptions of this:
> >>
> >> No, that is not the case. Associations with nil keys get removed
> >> automatically, which is why you cannot have nil keys in
> >> WeakKeyDictionaries (and because there was no demand to implement
> >> something like what has been done for Sets).
> >>
> >> What you mean is that associations do not get removed immediately.
> >> Here's is a snippet demonstrating the situation by only sending #at:put:
> >> to the dictionary and always with a new key:
> >>
> >> | w |
> >> w := WeakKeyDictionary new.
> >> 1000 timesRepeat: [ w at: Object new put: 0 ].
> >> Transcript open; showln: { w size. w slowSize. w capacity }.
> >> w capacity - w size timesRepeat: [ w at: Object new put: 0 ].
> >> Smalltalk garbageCollectMost.
> >> w at: Object new put: 0.
> >> Smalltalk garbageCollectMost.
> >> Transcript showln: { w size. w slowSize. w capacity }.
> >>
> >>
> >> The following lines should appear on the transcript:
> >> #(1000 0 1817)
> >> #(1 0 2)
> >
> > Okay, that must be new.  In 5.3, I still get:
> >
> > #(1000 1000 2423)
> > #(2424 0 5119)
> >
> > but, in trunk, yes!  That's great, thanks for letting me know, but I'm
> > having trouble identifying _why_ it works like that in trunk -- what
> > change actually enabled that?  It's something Igor and I worked on
>
> Well, it works in 5.3 as well, but the snippet I sent assumes that
> #capacity returns the capacity - the number of elements the collection can
> hold without growing.
> For 5.3 and before the corresponding snippet is:
>
> w := WeakKeyDictionary new.
> 1000 timesRepeat: [ w at: Object new put: 0 ].
> Transcript open; showln: { w size. w slowSize. w capacity }.
> w capacity * 3 // 4 - w size timesRepeat: [ w at: Object new put: 0 ].
> Smalltalk garbageCollectMost.
> w at: Object new put: 0.
> Smalltalk garbageCollectMost.
> Transcript showln: { w size. w slowSize. w capacity }.
>
> > back in 2010 (I incorrectly said 2007, before), but it involved a VM
> > change we couldn't get pushed through.
> >
> >> The dictionary will eventually shrink to the smallest possible
> >> capacity, 2,
> >
> > Wow, even smaller than the default initial capacity, 3, now there's a
> > pleasant surprise. ;)  Does that mean if I do a "WeakKeyDictionary
> > new" and never add any entries to it, its capacity will eventually
> > shrink to 2?
>
> No, it won't. You need to trigger the automatic resizing of the
> dictionary, and it's pretty hard without adding new entries to it.
>
> >
> >> holding one association to be garbage collected and no
> >> associations with a valid key.
> >>
> >> As with all weak hashed collections, #isEmpty, #size, #notEmpty are not
> >> reliable because the entries may vanish any time, so there's no effort
> to
> >> even try to make those correct.
> >
> > Hm, Marcel fixed #isEmpty for Weak collections in 2020.  There are
> > contexts where those methods are useful even with the understanding of
> > the potential inaccuracy.  Wouldn't you consider #at: and
>
> Thanks, now I remember that #isEmpty and #notEmpty work as expected.
>
> > #includesKey: to also be affected (e.g., "not reliable") by this
> > logic, as well?
>
> No, #includesKey: is exact in the sense that while the method was being
> executed the result matches the status of the requested key.
> Though a GC may happen right after #includesKey: returns rendering the
> result incorrect.
>
> >
> >> If you need accurate information, you must check against #slowSize,
> which
> >> means O(n) runtime. Using ephemerons may help with that.
> >
> > I hope to one day understand ephemerons and why they're useful,
> > because I sense they provide some kind of magic that could help my
> > stack.  But, I never grok'd them so far.
> >
> >>> Igor made some alternative-implementation WeakKeyDictionary's for
> Magma back in 2007 which are faster than the one's built into Squeak to
> this day (last I checked, anyway).  If interested, load "MaBase" from
> SqueakMap and
> >>> then see MaWeakIdentityKeyDictionary.
> >>
> >> Is it really faster?
> >> Here's a basic benchmark:
> >>
> >> [
> >>         | m |
> >>         Smalltalk garbageCollect.
> >>         m := MaDictionary new.
> >>         {
> >>                 #insert -> [ 1 to: 1000000 do: [ :i | m at: i
> hashMultiply put: i ] ] timeToRun.
> >>                 #access -> [ 1 to: 1000000 do: [ :i | m at: i
> hashMultiply ] ] timeToRun.
> >>                 #update -> [ 1 to: 1000000 do: [ :i | m at: i
> hashMultiply put: i + 1 ] ] timeToRun.
> >>                 #positiveLookup -> [ 1 to: 1000000 do: [ :i | m
> includesKey: i hashMultiply ] ] timeToRun.
> >>                 #negativeLookup -> [ 1000001 to: 2000000 do: [ :i | m
> includesKey: i hashMultiply ] ] timeToRun.
> >>                 #remove -> [ 1 to: 1000000 do: [ :i | m removeKey: i
> hashMultiply ] ] timeToRun
> >>         } ] value.
> >>
> >> "{#insert->535 . #access->1062 . #update->250 . #positiveLookup->967 .
> #negativeLookup->1055 . #remove->2353}"
> >> The same benchmark for Dictionary yields
> >>  {#insert->469 . #access->945 . #update->198 . #positiveLookup->174 .
> #negativeLookup->301 . #remove->719} on my machine.
> >> Even with MaDictionary >> #checkForUnderflow: patched to avoid the
> creation of Fractions.
> >
> > Yeah, one Fraction per resize isn't going to be noticeable, but thanks
> > for pointing it out, I patched it.
>
> I noticed it by running #timeProfile and it showed up with IIRC 20% of the
> runtime. Creating a fraction is not cheap, but I think the resizing
> algorithm of MaDictionary is broken.
>
> I created MaDictionary2, which uses a different hash function, a
> different table size, and a conservative resize-strategy. The performance
> difference is amazing, give it a try:
> https://leves.web.elte.hu/squeak/Ma-Collections-ul.169.mcz
>
> I think the difference comes from better cache locality, but it's
> just a wild guess. I have no idea how it became so fast even though it
> lacks the typical micro-optimizations one would apply there.
>
>
> Levente
>
> >
> > MaDictionary is actually just an abstract superclass Igor made to
> > contain his work, it wasn't meant to be used.  The real purpose of his
> > work was the replacement of Squeak's woeful WeakKeyIdentityDictionary
> > from the days when the identityHash was only 12-bits.  Your bench
> > didn't have any collisions, but that's what the Ma variant's optimized
> > for.
> >
> > But Squeak has come a long way since the 12-bit identityHash and
> > #hashMultiply, so the benefit is not nearly as pronounced, but I
> > enhanced your basic bench to try to more-closely emulate how Magma
> > would be using them for, and shows that it's still a _little_ (albeit,
> > not much) better than WeakKeyIdentityDictionary with #at: and
> > #at:put:...
> >
> > { WeakIdentityKeyDictionary. MaWeakIdentityKeyDictionary}
> > collect: [ : cls |
> > [
> >        | m r o testSize |
> >        testSize := 1e6.
> >        o := Object new.
> >        r := (1 to: testSize) collect: [ : e | Object new ].
> >        r := r shuffled.  "mix up identityHash's to simulate
> > real-world access, just in case consecutive creation above might
> > artificially help the hash algo"
> >        m := cls new.
> >        Smalltalk garbageCollect.
> >        {        m class -> testSize.
> >                #insert -> [ r do: [ :i | m at: i put: i ] ] timeToRun.
> >                r := r shuffled.
> >                #access -> [ r do: [ :i | m at: i ] ] timeToRun.
> >                r := r shuffled.
> >                #negativeaccess -> [ r do: [ :i | m at: o ifAbsent:
> > [nil] ] ] timeToRun.
> >                r := r shuffled.
> >                #update -> [ r do: [ :i | m at: i put: 0 ] ] timeToRun.
> >                r := r shuffled.
> >                #positiveLookup -> [ r do: [ :i | m includesKey: i ] ]
> > timeToRun.
> >                r := r shuffled.
> >                #negativeLookup -> [ 1000001 to: 2000000 do: [ :i | m
> > includesKey: o ] ] timeToRun.
> >        } ] value select: [ : e | e isVariableBinding ] ].
> >
> > {{WeakIdentityKeyDictionary->1000000 . #insert->1830 . #access->1260 .
> > #negativeaccess->705 . #update->532 . #positiveLookup->524 .
> > #negativeLookup->42} .
> > {MaWeakIdentityKeyDictionary->1000000 . #insert->1286 . #access->1170
> > . #negativeaccess->739 . #update->435 . #positiveLookup->1131 .
> > #negativeLookup->800}}
> >
> > I never noticed #includesKey: to be so slow on Ma (especially the
> > negative case, wow!), and I think Magma does use that in one place,
> > but most of the usage is for #at: and #at:put: which get even better
> > as the sizes get beyond 1M elements.
> >
> > But with 6.0's new auto-cleaning, I will certainly retire these Ma
> dictionary's!
> >
> > I always learn useful stuff from you, thanks a lot for your help.
> >
> > - Chris
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220606/c5e089f7/attachment.html>