MagmaCollections

Fri Aug 15 20:41:52 UTC 2003

Hi Jimmie!  I'm happy you're exploring MagmaCollections.  They can be fun and
useful, but also require fair consideration.  Let me see if I can help.

> I made two attempts with Magma. The first committing the 32,201 item 
> Ordered Collection and the 32,201 item Dictionary to the MagmaDB. Took 
> forever. I left, went home, came back next day. The image was at 200mb, 
> the MagmaDB was at 25mb for a 5mb text file. And it hose my computer and 
> I had to restart. I may have done something wrong. :)

Well, sort of.  Commits need to be smaller than that.  However, how can they be
if you need 32K elements in a single OrderedCollection?  You could do 10 at a
time and commit every 10.  But as it grows, it has to serialize the entire
collection every time, so no good there.

Depending on what kind of objects you had in this OrderedCollection, keep in
mind that it can amount to a lot MORE objects for Magma.  If they're Strings,
then no problem.  But if 32000 complex object-graphs amounts to a lot more than
32000 objects serialized, sent over the network, assigned permanent oids and
written to disk by the server.

Typically, a 32K element OrderedCollection is going to be expensive to move
around in a multi-user database no matter what.  In Magma, all single objects
are read as a single buffer, with 8-bytes per oid, so that's 32000*8 just for
the root collection objectBuffer.  Then, it has to do 32000 reads just to give
you just the first layer of objects beyond that.  After sending all that over
the network, your client then has to materialize those buffers into objects,
which means mapping objects to oids in Squeak's Dictionaries, which are,
unfortunately, quite slow.  If you profile your commit, I'm sure you'll see
80%-90% of the time spent in Squeak's WeakKeyIdentityDictionary code.

Thankfully, a future version of Squeak appears will address this issue, and
Magma performance will benefit tremendously.  In the meantime, there are things
you can do to speed it up.  Check out the Magma performance tuning page on the
Swiki.

For this size of collectoin, I think a MagmaCollection would be more suitable. 
The regular part of the object-model is meant for richer, deeper object graphs.
 For example, a model such as:

  Book
    Chapter
      Verse
         Text
      Verse
         Text
      Verse
         Text
    Chapter
      Verse
         Text
      Verse
         Text

This way, no single collection has too many entries in it.

> Last night I did a MagmaCollection for the 32,201 verses. The kjv.magma 
> file is 6267kb the hdx file 14,081. Wow, is it normal for the hdx file 
> to be so much larger than the other file? This time the image never grew 
> past 40mb.

Q:  Why is the .magma file larger than my text file?
A:  Keep in mind that though your text file is only 5MB, it takes quite a bit
more storage to store an object representation of that text file.  Objects have
additional "fields" such as the oids, class designation, and even a little
filler information for future use (e.g., security).

Q:  Why is the .hdx file so ridiculously large compared to my text file?
A:  The structure of the MaHashIndex (.hdx) file has "free-space" available in
the file based on the recordSize you specified when creating the collection (or
perhaps you used the default).  Initially, you will see a very rapid file
growth as records designed to hold 100 objects (or as defined by your
recordSize) start out with just one entry, but then will taper off considerably
as those same records are reused to hold up to 99 additional objects.  You can
tweak these by adjusting your keySize and recordSize parameters.

Read the pages about MagmaCollections on the Swiki and let me know if you have
more questions.

> I stored KJVverse objects which only had a #verse and #verseIndex 
> attributes.
> 
> I attempted to add an Index based on the #verseIndex with this:
> KJVls is a local session to the database stored in the KJVls class
> variable.
> 
> createVerseIndex
> (self KJVls root at: 'verses') addIndex:
>        (MaSearchStringIndexDefinition
>           attribute: #verseIndex
>           keySize: 64)
> 
> This method appears to succeed but does nothing. No error, no index. :(
> I copied it from the swiki and modified it.

You did *commit* this, right?   :)

You can know if an index was created by looking for a new file in your
filesystem in the same place as the other Magma database files, called
"12345678verseIndex.hdx" where 12345678 is the oid of the collection.

You need to take care not to index highly-frequent, unimportant words such as
"the", "of", "to", etc.  You can do it, of course, but updating indexes is one
of the slowest operations for Magma (reading them is very fast, however).

When you insert the 1000th instance of "the" into a keyword index, it ends up
performing 1000 / yourRecordSize number of writes to the disk.  So if your
recordSize was 100, you could have up to 100 "the's" before it would have to
create a new record to hold the next batch of the's.  Worse, once the second
record is created, every new "the" added causes the first record to need to be
updated (with child-count information, so as to preserve the at: anIndex
functionality) as well as the second record.  So if you have 1000 "the's", to
insert the very next "the" will require a minimum of 11 writes to the disk.

Note that this performance issue only occurs for duplicates.  If you have a
good dispersion of hash-keys, you should only need as many writes as it takes
to go straight to that key, which is be pretty low.  With a good
key-dispersion, you should see anywhere from 10-100 insertions per second
depending on.. many factors.

Let me know how it goes!

 - Chris

PS - one experiment I'll be attempting soon is a full keyword index of the
entire Squeak mailing list into a MagmaCollection.  My first two attempts at
this were unsuccessful, which is how I discovered the cost of duplicates.  :) 
When I get back around to it, I'll let you know how it went.

PPS - during this experiment, another thing I remember having to balance was
the cost of letting my sessions oidMaps (which are invisible to you) get too
big vs. the cost of using stubOut:.  stubOut: requires use of
"Dictionary>>removeKey:" which is exceptionally slow in Squeak as well as a
become:, while not stubbing out causes the IdentityDictionaries to grow beyond
12-bits before the garbage-collector kicks in to keep them small.  It's a
balancing act, but you can also use mySession>>finalizeOids to help with that.

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com