MagmaCollections
Jimmie Houchin
jhouchin at texoma.net
Fri Aug 15 22:01:46 UTC 2003
Hello Chris,
Thanks for the reply.
Chris Muller wrote:
> Hi Jimmie! I'm happy you're exploring MagmaCollections. They can be fun and
> useful, but also require fair consideration. Let me see if I can help.
>
>>I made two attempts with Magma. The first committing the 32,201 item
>>Ordered Collection and the 32,201 item Dictionary to the MagmaDB. Took
>>forever. I left, went home, came back next day. The image was at 200mb,
>>the MagmaDB was at 25mb for a 5mb text file. And it hose my computer and
>>I had to restart. I may have done something wrong. :)
>
> Well, sort of. Commits need to be smaller than that. However, how can they be
> if you need 32K elements in a single OrderedCollection? You could do 10 at a
> time and commit every 10. But as it grows, it has to serialize the entire
> collection every time, so no good there.
I'm sorry I wasn't clear. I did a commit on every #add:. So it was along
the lines of 64467 commits. :)
I really wasn't trying to be abusive, honest. :)
> Depending on what kind of objects you had in this OrderedCollection, keep in
> mind that it can amount to a lot MORE objects for Magma. If they're Strings,
> then no problem. But if 32000 complex object-graphs amounts to a lot more than
> 32000 objects serialized, sent over the network, assigned permanent oids and
> written to disk by the server.
>
> Typically, a 32K element OrderedCollection is going to be expensive to move
> around in a multi-user database no matter what. In Magma, all single objects
> are read as a single buffer, with 8-bytes per oid, so that's 32000*8 just for
> the root collection objectBuffer. Then, it has to do 32000 reads just to give
> you just the first layer of objects beyond that. After sending all that over
> the network, your client then has to materialize those buffers into objects,
> which means mapping objects to oids in Squeak's Dictionaries, which are,
> unfortunately, quite slow. If you profile your commit, I'm sure you'll see
> 80%-90% of the time spent in Squeak's WeakKeyIdentityDictionary code.
>
> Thankfully, a future version of Squeak appears will address this issue, and
> Magma performance will benefit tremendously. In the meantime, there are things
> you can do to speed it up. Check out the Magma performance tuning page on the
> Swiki.
>
> For this size of collectoin, I think a MagmaCollection would be more suitable.
> The regular part of the object-model is meant for richer, deeper object graphs.
> For example, a model such as:
>
> Book
> Chapter
> Verse
> Text
> Verse
> Text
> Verse
> Text
> Chapter
> Verse
> Text
> Verse
> Text
>
> This way, no single collection has too many entries in it.
Gotcha, I'll give it a go.
From reading the Swiki it seemed a MagmaCollection would offer
improvements. But I would have to do things differently since what I had
was merely a collection of Strings and not really verse objects.
I'll redo the code to reflect your above model.
After that I'll commit the index. :)
>>Last night I did a MagmaCollection for the 32,201 verses. The kjv.magma
>>file is 6267kb the hdx file 14,081. Wow, is it normal for the hdx file
>>to be so much larger than the other file? This time the image never grew
>>past 40mb.
>
> Q: Why is the .magma file larger than my text file?
> A: Keep in mind that though your text file is only 5MB, it takes quite a bit
> more storage to store an object representation of that text file. Objects have
> additional "fields" such as the oids, class designation, and even a little
> filler information for future use (e.g., security).
>
> Q: Why is the .hdx file so ridiculously large compared to my text file?
> A: The structure of the MaHashIndex (.hdx) file has "free-space" available in
> the file based on the recordSize you specified when creating the collection (or
> perhaps you used the default). Initially, you will see a very rapid file
> growth as records designed to hold 100 objects (or as defined by your
> recordSize) start out with just one entry, but then will taper off considerably
> as those same records are reused to hold up to 99 additional objects. You can
> tweak these by adjusting your keySize and recordSize parameters.
>
> Read the pages about MagmaCollections on the Swiki and let me know if you have
> more questions.
I have been reading over and over and over.
Every time I want to do something I read and try to go by the instructions.
>>I stored KJVverse objects which only had a #verse and #verseIndex
>>attributes.
>>
>>I attempted to add an Index based on the #verseIndex with this:
>>KJVls is a local session to the database stored in the KJVls class
>>variable.
>>
>>createVerseIndex
>>(self KJVls root at: 'verses') addIndex:
>> (MaSearchStringIndexDefinition
>> attribute: #verseIndex
>> keySize: 64)
>>
>>This method appears to succeed but does nothing. No error, no index. :(
>>I copied it from the swiki and modified it.
>
> You did *commit* this, right? :)
Well uh, hmm, (shuffles feet and looks around sheepishly), uh no. :)
I guess that will make a difference. I didn't realize that needed committed.
I'll give that a try tomorrow.
> You can know if an index was created by looking for a new file in your
> filesystem in the same place as the other Magma database files, called
> "12345678verseIndex.hdx" where 12345678 is the oid of the collection.
>
> You need to take care not to index highly-frequent, unimportant words such as
> "the", "of", "to", etc. You can do it, of course, but updating indexes is one
> of the slowest operations for Magma (reading them is very fast, however).
How do you go about limiting words being indexed?
I do have a list of stop words.
> When you insert the 1000th instance of "the" into a keyword index, it ends up
> performing 1000 / yourRecordSize number of writes to the disk. So if your
> recordSize was 100, you could have up to 100 "the's" before it would have to
> create a new record to hold the next batch of the's. Worse, once the second
> record is created, every new "the" added causes the first record to need to be
> updated (with child-count information, so as to preserve the at: anIndex
> functionality) as well as the second record. So if you have 1000 "the's", to
> insert the very next "the" will require a minimum of 11 writes to the disk.
>
> Note that this performance issue only occurs for duplicates. If you have a
> good dispersion of hash-keys, you should only need as many writes as it takes
> to go straight to that key, which is be pretty low. With a good
> key-dispersion, you should see anywhere from 10-100 insertions per second
> depending on.. many factors.
I don't know what good dispersion means or how to perform or ensure
such. Help.
> Let me know how it goes!
Will do.
> - Chris
>
> PS - one experiment I'll be attempting soon is a full keyword index of the
> entire Squeak mailing list into a MagmaCollection. My first two attempts at
> this were unsuccessful, which is how I discovered the cost of duplicates. :)
> When I get back around to it, I'll let you know how it went.
Great, I'll definitely be interested in how this goes and how to do it.
> PPS - during this experiment, another thing I remember having to balance was
> the cost of letting my sessions oidMaps (which are invisible to you) get too
> big vs. the cost of using stubOut:. stubOut: requires use of
> "Dictionary>>removeKey:" which is exceptionally slow in Squeak as well as a
> become:, while not stubbing out causes the IdentityDictionaries to grow beyond
> 12-bits before the garbage-collector kicks in to keep them small. It's a
> balancing act, but you can also use mySession>>finalizeOids to help with that.
Thanks for the information and help.
Jimmie Houchin
More information about the Squeak-dev
mailing list
|