hard-drive read-performance

Mon Nov 29 15:42:52 UTC 2010

Yep, it's called MagmaCompressor.

But still, if you have a multi-gigabyte repository and a HD buffer of,
what, a few K? or even a client-cache of a 100MB, there is no way
around needing to read off the HD..

On Sun, Nov 28, 2010 at 8:31 PM, Elliot Finley <efinley.lists at gmail.com> wrote:
> maybe a defrag utility for Magma that places all objects in a collection
> close together on the disk?
>
> On Wed, Nov 24, 2010 at 10:00 AM, Chris Muller <ma.chris.m at gmail.com> wrote:
>>
>> When reading any object off the hard drive (represented as the
>> 'byteArray' of a single MaObjectBuffer), Magma always reads 280 bytes.
>>  Since the #physicalSize is in the object header, it is then able to
>> check the contents of the buffer to determine the size of the whole
>> object and, if necessary, read more bytes in order to get the whole
>> object.  See MaObjectFiler>>#read:bytesInto:and:startingAt:filePosition:
>> for this behavior.
>>
>> 280 bytes is enough for about 40 pointer references, allowing most
>> objects to be read in just one disk access.  I refer to it as the
>> #trackSize, to remind me it is supposed to be how many bytes I think
>> can the HD read in one operation without overrunning its own internal
>> buffers and becoming inefficient.  I was curious whether this number
>> is optimized in 2010, so I ran the following script:
>>
>> -----------
>> |stats random| stats:=OrderedCollection new. random := Random new.
>> nextPos:=100.
>> (FileDirectory on: '/home/cmm/test3/cube.001.magma') fileNamed:
>> 'objects.2.dat' do:
>>        [ : stream | | ba fileSize | ba := ByteArray new: 10000.
>>        fileSize := stream size.
>>        100 to: 10000 by: 100 do:
>>                [ : n |
>>                stream position: 0.
>>                Transcript cr; show: (stats add: n->([stream
>>                                maRead: n "bytes"
>>                                bytesFromPosition: 1
>>                                of: ba
>>                                atFilePosition: (random nextInt: fileSize ]
>> bench)) ]].
>> stats
>> ------------
>>
>> Note that "objects.2.dat" is a real Magma file, 1.8GB in size.  The
>> goal of the script is bench how fast Squeak can read object buffers
>> off the hard-drive when we obviously won't get many (if any) HD cache
>> hits.
>>
>> I have a cheap, Western Digital Caviar HD, which produced the following
>> output:
>>
>> 100->'119 per second.'
>> 200->'98.5 per second.'
>> 300->'106 per second.'
>> 400->'106 per second.'
>> 500->'101 per second.'
>> 600->'102 per second.'
>> 700->'99.9 per second.'
>> 800->'103 per second.'
>> 900->'104 per second.'
>> 1000->'99 per second.'
>> 1100->'97.9 per second.'
>> 1200->'104 per second.'
>> 1300->'111 per second.'
>> 1400->'99.8 per second.'
>> 1500->'107 per second.'
>> 1600->'108 per second.'
>> 1700->'95.6 per second.'
>> 1800->'103 per second.'
>> 1900->'108 per second.'
>> 2000->'102 per second.'
>> 2100->'103 per second.'
>> 2200->'107 per second.'
>> ...
>> 3000->'98.7 per second.'
>> 4000->'102 per second.'
>> 5000->'106 per second.'
>> 6000->'104 per second.'
>> 7000->'101 per second.'
>> 8000->'102 per second.'
>> 9000->'102 per second.'
>> 10000->'107 per second.'
>>
>> For curiousity, I also modified the script to read very small buffers
>> from the HD, here are the results:
>>
>> 4->'137 per second.'
>> 12->'146 per second.'
>> 20->'154 per second.'
>> 28->'143 per second.'
>>
>> (The HD busy light was solid ON during the test).
>>
>> At first I was puzzled because Magma has demonstrated much faster
>> objects-per-second read rates than these, even including
>> materialization, what gives?
>>
>> It's the HD buffering.  Most of the time, objects are "clustered"
>> closely together, so that reading one object causes the "next" object
>> which will be read to already be in the HD's buffer.  Here's the same
>> script, except reading mostly "sequentially" through the file instead
>> of from a random location:
>>
>> |stats random nextPos| stats:=OrderedCollection new. random := Random new.
>> nextPos:=100.
>> (FileDirectory on: '/home/cmm/test3/cube.001.magma') fileNamed:
>> 'objects.2.dat' do:
>>        [ : stream | | ba fileSize | ba := ByteArray new: 10000.
>>        fileSize := stream size.
>>        #(4 12 20 28 100 200 300 400 500)
>>                [ : n |
>>                stream position: 0.
>>                Transcript cr; show: (stats add: n->([stream
>>                                maRead: n "bytes"
>>                                bytesFromPosition: 1
>>                                of: ba
>>                                atFilePosition: ("random nextInt: fileSize"
>> (nextPos :=
>> nextPos+n+10)) ] bench)) ]].
>> stats
>>
>> Now look at the results:
>>
>> "Reading sequentially rather than at a random position."
>> 4->'1,160,000 per second.'
>> 12->'1,210,000 per second.'
>> 20->'1,100,000 per second.'
>> 28->'973,000 per second.'
>> ...
>> 100->'1,030,000 per second.'
>> 200->'321,000 per second.'
>> 300->'215,000 per second.'
>> 400->'160,000 per second.'
>> 500->'227,000 per second.'
>>
>> Conclusions:
>>
>>  - Hard-disk seek is definitely a bottleneck with Magma, or any
>> Squeak application that requires random-access to a file.
>>  - When objects are clustered closely together, read performance can
>> be dramatically better.
>>  - HD's with fast seek times, such as newer solid-state drives, might
>> perform dramatically better.
>>  - I should consider reducing the trackSize from 280 bytes to ~100
>> bytes (or make it customizable); because the rate drops really fast
>> after that and even a second read required could still be faster than
>> an initial read.
>>
>>  - Chris
>> _______________________________________________
>> Magma mailing list
>> Magma at lists.squeakfoundation.org
>> http://lists.squeakfoundation.org/mailman/listinfo/magma
>
>