Chris,
Chris Muller wrote:
Hi Stephen, let me see if I can answer your questions:
I notice that you are assigning oids to every object that
is stored in Magma and using a WeakKeyIdentityDictionary for storing the oids. One issue I had with that approach is that I think it runs into scalability issues when you approach and surpass 4096 objects (due to the number of bits available for the identity hash). Is there a way to make this scheme more scalable? Or, is it possible that it will be rare to have more than 4000 persistent objects cached on the client?
A: I don't think it would be "rare" to have more than 4K objects in a medium-sized program. However, it was written with the expectation that identityHash will be fast. If there is a scalability problem w/ weakIdentityDictionaries greater than 4K in size, then there may be a scalability issue w/ Magma.
The identity hash in Squeak is only 12 bits. There has been a lot of discussion in the past regarding how to improve this...you might be able to get better scalability by spreading out the hash (and perhaps subclass WeakKeyIdentityDictionary)...you'd still get the same amount of collisions, but presumably, you wouldn't have to scan nearly as far to get a match or an empty slot.
======
How are you tracking the changed objects? A: Take a look at MaTransaction>>markRead:. MaTransaction maintains an IdentityDictionary whose keys are the read object, values are a shallowCopy "backup". When you commit, it zips through the keys and values and does an identity compare of each objects variables (see implementors of maIsChangedFrom:). This may seem like a lot of work, but its actually suprisingly fast. Additionally, I prefer this "copy and compare" method of change detection because it offer the most transparency.
After I wrote the email, I discovered this...very clever! Since you only need to do the scan when you commit, it's not a horrible price to pay. I was baffled by how Magma was able to detect changes in brand new instances I was creating...I even tried to fool it, but it worked like a charm. It looks new objects get stored (during commit) when they are referenced from other objects that are already in the db...when they are stored, you then record the shallowCopy backup...that way the object is part of the scan to detect changes. Is that correct?
========
It takes a long time to establish a MagmaSession (especially after some objects have been populated in the server)...can you describe what's happening when connecting? A: Hmm... I've never noticed that taking a long time. If you look at the MamgaSession>>connectAs:maximumNumberOfChallengers: it summarizes what MamgaSession>>happens upon connection. Basically, we have to get the classId's and definitions of all the classes that are known to the repository. The current class definition defined in Magma has to match what is in your image or it won't let you connect.
I haven't posted the Swiki page yet about how to tell Magma to "upgrade" a class to a new version, but you can see examples in MagmaTestCase.
==========
I see that you are using a stubbing mechanism (ala GemStone) that uses a ProtoObject and #doesNotUnderstand: to transparently forward messages to the real object. Are you also using #become: to change these objects into their real counterpart? If so, won't this present a performance issue under certain circumstances (where one or both of the objects are in old space)? Also, did you implement a "stubbing level" mechanism ala GemStone? A: I use becomeForward:. I've not noticed any performance issues in my "small" volume tests. If this causes a performance problem due to them being in oldspace, do you have an alternative?
Unfortunately no...I've always been leary about solutions that require a #become: (or #becomeForward:) due to the potential performance issue. Squeak has a direct mapped memory model, which means that if you swap the identities of two objects, you must scan all of memory (unless you're dealing with two young objects). You can avoid some of the performance issues if you can bunch up a lot of objects (this is why Squeak's #become is actually based on Array>>elementsExchangeIdentityWith:). I don't see a way of bunching up a lot of stubs though.
Perhaps I've been a little too quick to dismiss #become: though...if you examine normal usage patterns, it's probably the case that the vast majority of your #become: calls are happening with two young objects...which is fast.
========
Is there any kind of cache control in Magma? For example, if I have a client that is running for many weeks and accessing lot's of objects, once they are pulled from the server to the client, are they going to stay in the client indefinitely? Is there some way of controlling how many objects are retained in the client's object memory?
A: The only "caching" Magma does is in weak collections, so there shouldn't be anything that you don't cache yourself.
Avi Bryant wrote: "Right, but once a stub gets replaced with a real object, the real object will never get replaced back with a stub, correct? Which means that, since the session has a reference to the root, and the root has a reference to every other object in the database, once an object is pulled into the client it will never be garbage collected unless it is removed from the database. This can be a problem for long running clients. I think what Stephen is suggesting is that, for example, you periodically becomeForward: objects back into stubs that have not been used for a certain length of time."
Avi is summarizes it correctly...the issue I see is that you can eventually get the whole database in memory (if your application runs long enough and works with enough of your dataset). One solution might be to simply restart the image when it grows beyond a certain threshold (I do that right now with swiki.net). Here's a caching design I'm working on:
The algorithm is an approximate LRU and works by counting the number of dereferences (when an object is actually retrieved from the db) and object stores (when an object is saved). Each cached object (actually an object cluster in my case) has an age. After a threshold is reached (say 1000 derefs), I scan through the cached objects (actually clusters) incrementing the age of relatively young references, and flushing the objects whose references have reached a maximum age (say 5). When an object is dereferenced, the age gets reset to 0. So, the age actually tells us how many aging scans have occurred since the last time the cluster was dereferenced. Thus, older objects have not been accessed in a while (note: I chose not to implement transparent proxies...you access your cached objects by holding onto an PdbObjectReference, which holds the root of a cached cluster of objects...thus, you must always access your cached objects by sending the #deref message to a PdbObjectReference). Also, I have to check all of the process stacks to make sure that an object is not current being accessed before flushing it (probably a rare occurrence anyway).
I'm thinking that this type of caching might work well in practice. It doesn't set any hard limits on the number of cached objects...so, if you have a very large and active working set, they should all stay young and stay in memory. A possible improvement might be to add a hard limit on the number of cached objects and have have a deref trigger a scan (in addition to the deref counting) if that deref would cause the limit to be exceeded.
- Stephen