Magma design

Sat Aug 3 02:43:49 UTC 2002

Chris,

Chris Muller wrote:
> Hi Stephen, let me see if I can answer your questions:
> 
> > I notice that you are assigning oids to every object that 
> is stored in
> Magma and using a WeakKeyIdentityDictionary for storing the 
> oids.  One issue I had with that approach is that I think it 
> runs into scalability issues when you approach and surpass 
> 4096 objects (due to the number of bits available for the 
> identity hash).  Is there a way to make this scheme more 
> scalable?  Or, is it possible that it will be rare to have 
> more than 4000 persistent objects cached on the client?
> 
> A:  I don't think it would be "rare" to have more than 4K 
> objects in a medium-sized program.  However, it was written 
> with the expectation that identityHash will be fast.  If 
> there is a scalability problem w/ weakIdentityDictionaries 
> greater than 4K in size, then there may be a scalability 
> issue w/ Magma.

The identity hash in Squeak is only 12 bits.  There has been a lot of
discussion in the past regarding how to improve this...you might be able
to get better scalability by spreading out the hash (and perhaps
subclass WeakKeyIdentityDictionary)...you'd still get the same amount of
collisions, but presumably, you wouldn't have to scan nearly as far to
get a match or an empty slot.

> ======
> 
> How are you tracking the changed objects?
> A:  Take a look at MaTransaction>>markRead:.  MaTransaction 
> maintains an IdentityDictionary whose keys are the read 
> object, values are a shallowCopy "backup".  When you commit, 
> it zips through the keys and values and does an identity 
> compare of each objects variables (see implementors of 
> maIsChangedFrom:).  This may seem like a lot of work, but its 
> actually suprisingly fast.  Additionally, I prefer this "copy 
> and compare" method of change detection because it offer the 
> most transparency.

After I wrote the email, I discovered this...very clever!  Since you
only need to do the scan when you commit, it's not a horrible price to
pay.  I was baffled by how Magma was able to detect changes in brand new
instances I was creating...I even tried to fool it, but it worked like a
charm.  It looks new objects get stored (during commit) when they are
referenced from other objects that are already in the db...when they are
stored, you then record the shallowCopy backup...that way the object is
part of the scan to detect changes.  Is that correct?

> ========
> 
> It takes a long time to establish a MagmaSession (especially 
> after some objects have been populated in the server)...can 
> you describe what's happening when connecting?
> A:  Hmm... I've never noticed that taking a long time.  If 
> you look at the
> MamgaSession>>connectAs:maximumNumberOfChallengers: it 
> summarizes what 
> MamgaSession>>happens
> upon connection.  Basically, we have to get the classId's and 
> definitions of all the classes that are known to the 
> repository.  The current class definition defined in Magma 
> has to match what is in your image or it won't let you connect.
> 
> I haven't posted the Swiki page yet about how to tell Magma 
> to "upgrade" a class to a new version, but you can see 
> examples in MagmaTestCase.
>
>
> ==========
> 
> I see that you are using a stubbing mechanism (ala GemStone) 
> that uses a ProtoObject and #doesNotUnderstand: to 
> transparently forward messages to the real object.  Are you 
> also using #become: to change these objects into their real 
> counterpart?  If so, won't this present a performance issue 
> under certain circumstances (where one or both of the objects 
> are in old space)?  Also, did you implement a "stubbing 
> level" mechanism ala GemStone?
> A:  I use becomeForward:.  I've not noticed any performance 
> issues in my "small" volume tests.  If this causes a 
> performance problem due to them being in oldspace, do you 
> have an alternative?

Unfortunately no...I've always been leary about solutions that require a
#become: (or #becomeForward:) due to the potential performance issue.
Squeak has a direct mapped memory model, which means that if you swap
the identities of two objects, you must scan all of memory (unless
you're dealing with two young objects).  You can avoid some of the
performance issues if you can bunch up a lot of objects (this is why
Squeak's #become is actually based on
Array>>elementsExchangeIdentityWith:).  I don't see a way of bunching up
a lot of stubs though.

Perhaps I've been a little too quick to dismiss #become: though...if you
examine normal usage patterns, it's probably the case that the vast
majority of your #become: calls are happening with two young
objects...which is fast.

> ========
> 
> Is there any kind of cache control in Magma?  For example, if 
> I have a client that is running for many weeks and accessing 
> lot's of objects, once they are pulled from the server to the 
> client, are they going to stay in the client indefinitely?  
> Is there some way of controlling how many objects are 
> retained in the client's object memory?
> 
> A:  The only "caching" Magma does is in weak collections, so 
> there shouldn't be anything that you don't cache yourself.

Avi Bryant wrote:
"Right, but once a stub gets replaced with a real object, the real
object will never get replaced back with a stub, correct? Which means
that, since the session has a reference to the root, and the root has a
reference to every other object in the database, once an object is
pulled into the client it will never be garbage collected unless it is
removed from the database.  This can be a problem for long running
clients.  I think what Stephen is suggesting is that, for example, you
periodically becomeForward: objects back into stubs that have not been
used for a certain length of time."

Avi is summarizes it correctly...the issue I see is that you can
eventually get the whole database in memory (if your application runs
long enough and works with enough of your dataset).  One solution might
be to simply restart the image when it grows beyond a certain threshold
(I do that right now with swiki.net).  Here's a caching design I'm
working on:

The algorithm is an approximate LRU and works by counting the number of
dereferences (when an object is actually retrieved from the db) and
object stores (when an object is saved).  Each cached object (actually
an object cluster in my case) has an age.  After a threshold is reached
(say 1000 derefs), I scan through the cached objects (actually clusters)
incrementing the age of relatively young references, and flushing the
objects whose references have reached a maximum age (say 5).  When an
object is dereferenced, the age gets reset to 0.  So, the age actually
tells us how many aging scans have occurred since the last time the
cluster was dereferenced.  Thus, older objects have not been accessed in
a while (note: I chose not to implement transparent proxies...you access
your cached objects by holding onto an PdbObjectReference, which holds
the root of a cached cluster of objects...thus, you must always access
your cached objects by sending the #deref message to a
PdbObjectReference).  Also, I have to check all of the process stacks to
make sure that an object is not current being accessed before flushing
it (probably a rare occurrence anyway).

I'm thinking that this type of caching might work well in practice.  It
doesn't set any hard limits on the number of cached objects...so, if you
have a very large and active working set, they should all stay young and
stay in memory.  A possible improvement might be to add a hard limit on
the number of cached objects and have have a deref trigger a scan (in
addition to the deref counting) if that deref would cause the limit to
be exceeded.

- Stephen