MagmaCollections

Fri Aug 15 22:01:46 UTC 2003

Hello Chris,

Thanks for the reply.

Chris Muller wrote:
> Hi Jimmie!  I'm happy you're exploring MagmaCollections.  They can be fun and
> useful, but also require fair consideration.  Let me see if I can help.
> 
>>I made two attempts with Magma. The first committing the 32,201 item 
>>Ordered Collection and the 32,201 item Dictionary to the MagmaDB. Took 
>>forever. I left, went home, came back next day. The image was at 200mb, 
>>the MagmaDB was at 25mb for a 5mb text file. And it hose my computer and 
>>I had to restart. I may have done something wrong. :)
> 
> Well, sort of.  Commits need to be smaller than that.  However, how can they be
> if you need 32K elements in a single OrderedCollection?  You could do 10 at a
> time and commit every 10.  But as it grows, it has to serialize the entire
> collection every time, so no good there.

I'm sorry I wasn't clear. I did a commit on every #add:. So it was along 
  the lines of 64467 commits. :)
I really wasn't trying to be abusive, honest. :)

> Depending on what kind of objects you had in this OrderedCollection, keep in
> mind that it can amount to a lot MORE objects for Magma.  If they're Strings,
> then no problem.  But if 32000 complex object-graphs amounts to a lot more than
> 32000 objects serialized, sent over the network, assigned permanent oids and
> written to disk by the server.
> 
> Typically, a 32K element OrderedCollection is going to be expensive to move
> around in a multi-user database no matter what.  In Magma, all single objects
> are read as a single buffer, with 8-bytes per oid, so that's 32000*8 just for
> the root collection objectBuffer.  Then, it has to do 32000 reads just to give
> you just the first layer of objects beyond that.  After sending all that over
> the network, your client then has to materialize those buffers into objects,
> which means mapping objects to oids in Squeak's Dictionaries, which are,
> unfortunately, quite slow.  If you profile your commit, I'm sure you'll see
> 80%-90% of the time spent in Squeak's WeakKeyIdentityDictionary code.
> 
> Thankfully, a future version of Squeak appears will address this issue, and
> Magma performance will benefit tremendously.  In the meantime, there are things
> you can do to speed it up.  Check out the Magma performance tuning page on the
> Swiki.
> 
> For this size of collectoin, I think a MagmaCollection would be more suitable. 
> The regular part of the object-model is meant for richer, deeper object graphs.
>  For example, a model such as:
> 
>   Book
>     Chapter
>       Verse
>          Text
>       Verse
>          Text
>       Verse
>          Text
>     Chapter
>       Verse
>          Text
>       Verse
>          Text
> 
> This way, no single collection has too many entries in it.

Gotcha, I'll give it a go.

 From reading the Swiki it seemed a MagmaCollection would offer 
improvements. But I would have to do things differently since what I had 
was merely a collection of Strings and not really verse objects.

I'll redo the code to reflect your above model.
After that I'll commit the index. :)

>>Last night I did a MagmaCollection for the 32,201 verses. The kjv.magma 
>>file is 6267kb the hdx file 14,081. Wow, is it normal for the hdx file 
>>to be so much larger than the other file? This time the image never grew 
>>past 40mb.
> 
> Q:  Why is the .magma file larger than my text file?
> A:  Keep in mind that though your text file is only 5MB, it takes quite a bit
> more storage to store an object representation of that text file.  Objects have
> additional "fields" such as the oids, class designation, and even a little
> filler information for future use (e.g., security).
> 
> Q:  Why is the .hdx file so ridiculously large compared to my text file?
> A:  The structure of the MaHashIndex (.hdx) file has "free-space" available in
> the file based on the recordSize you specified when creating the collection (or
> perhaps you used the default).  Initially, you will see a very rapid file
> growth as records designed to hold 100 objects (or as defined by your
> recordSize) start out with just one entry, but then will taper off considerably
> as those same records are reused to hold up to 99 additional objects.  You can
> tweak these by adjusting your keySize and recordSize parameters.
> 
> Read the pages about MagmaCollections on the Swiki and let me know if you have
> more questions.

I have been reading over and over and over.
Every time I want to do something I read and try to go by the instructions.

>>I stored KJVverse objects which only had a #verse and #verseIndex 
>>attributes.
>>
>>I attempted to add an Index based on the #verseIndex with this:
>>KJVls is a local session to the database stored in the KJVls class
>>variable.
>>
>>createVerseIndex
>>(self KJVls root at: 'verses') addIndex:
>>       (MaSearchStringIndexDefinition
>>          attribute: #verseIndex
>>          keySize: 64)
>>
>>This method appears to succeed but does nothing. No error, no index. :(
>>I copied it from the swiki and modified it.
> 
> You did *commit* this, right?   :)

Well uh, hmm, (shuffles feet and looks around sheepishly), uh no. :)
I guess that will make a difference. I didn't realize that needed committed.
I'll give that a try tomorrow.

> You can know if an index was created by looking for a new file in your
> filesystem in the same place as the other Magma database files, called
> "12345678verseIndex.hdx" where 12345678 is the oid of the collection.
> 
> You need to take care not to index highly-frequent, unimportant words such as
> "the", "of", "to", etc.  You can do it, of course, but updating indexes is one
> of the slowest operations for Magma (reading them is very fast, however).

How do you go about limiting words being indexed?
I do have a list of stop words.

> When you insert the 1000th instance of "the" into a keyword index, it ends up
> performing 1000 / yourRecordSize number of writes to the disk.  So if your
> recordSize was 100, you could have up to 100 "the's" before it would have to
> create a new record to hold the next batch of the's.  Worse, once the second
> record is created, every new "the" added causes the first record to need to be
> updated (with child-count information, so as to preserve the at: anIndex
> functionality) as well as the second record.  So if you have 1000 "the's", to
> insert the very next "the" will require a minimum of 11 writes to the disk.
> 
> Note that this performance issue only occurs for duplicates.  If you have a
> good dispersion of hash-keys, you should only need as many writes as it takes
> to go straight to that key, which is be pretty low.  With a good
> key-dispersion, you should see anywhere from 10-100 insertions per second
> depending on.. many factors.

I don't know what good dispersion means or how to perform or ensure 
such. Help.

> Let me know how it goes!

Will do.

>  - Chris
> 
> PS - one experiment I'll be attempting soon is a full keyword index of the
> entire Squeak mailing list into a MagmaCollection.  My first two attempts at
> this were unsuccessful, which is how I discovered the cost of duplicates.  :) 
> When I get back around to it, I'll let you know how it went.

Great, I'll definitely be interested in how this goes and how to do it.

> PPS - during this experiment, another thing I remember having to balance was
> the cost of letting my sessions oidMaps (which are invisible to you) get too
> big vs. the cost of using stubOut:.  stubOut: requires use of
> "Dictionary>>removeKey:" which is exceptionally slow in Squeak as well as a
> become:, while not stubbing out causes the IdentityDictionaries to grow beyond
> 12-bits before the garbage-collector kicks in to keep them small.  It's a
> balancing act, but you can also use mySession>>finalizeOids to help with that.

Thanks for the information and help.

Jimmie Houchin