hard-drive read-performance
Chris Muller
ma.chris.m at gmail.com
Mon Nov 29 15:42:52 UTC 2010
Yep, it's called MagmaCompressor.
But still, if you have a multi-gigabyte repository and a HD buffer of,
what, a few K? or even a client-cache of a 100MB, there is no way
around needing to read off the HD..
On Sun, Nov 28, 2010 at 8:31 PM, Elliot Finley <efinley.lists at gmail.com> wrote:
> maybe a defrag utility for Magma that places all objects in a collection
> close together on the disk?
>
> On Wed, Nov 24, 2010 at 10:00 AM, Chris Muller <ma.chris.m at gmail.com> wrote:
>>
>> When reading any object off the hard drive (represented as the
>> 'byteArray' of a single MaObjectBuffer), Magma always reads 280 bytes.
>> Since the #physicalSize is in the object header, it is then able to
>> check the contents of the buffer to determine the size of the whole
>> object and, if necessary, read more bytes in order to get the whole
>> object. See MaObjectFiler>>#read:bytesInto:and:startingAt:filePosition:
>> for this behavior.
>>
>> 280 bytes is enough for about 40 pointer references, allowing most
>> objects to be read in just one disk access. I refer to it as the
>> #trackSize, to remind me it is supposed to be how many bytes I think
>> can the HD read in one operation without overrunning its own internal
>> buffers and becoming inefficient. I was curious whether this number
>> is optimized in 2010, so I ran the following script:
>>
>> -----------
>> |stats random| stats:=OrderedCollection new. random := Random new.
>> nextPos:=100.
>> (FileDirectory on: '/home/cmm/test3/cube.001.magma') fileNamed:
>> 'objects.2.dat' do:
>> [ : stream | | ba fileSize | ba := ByteArray new: 10000.
>> fileSize := stream size.
>> 100 to: 10000 by: 100 do:
>> [ : n |
>> stream position: 0.
>> Transcript cr; show: (stats add: n->([stream
>> maRead: n "bytes"
>> bytesFromPosition: 1
>> of: ba
>> atFilePosition: (random nextInt: fileSize ]
>> bench)) ]].
>> stats
>> ------------
>>
>> Note that "objects.2.dat" is a real Magma file, 1.8GB in size. The
>> goal of the script is bench how fast Squeak can read object buffers
>> off the hard-drive when we obviously won't get many (if any) HD cache
>> hits.
>>
>> I have a cheap, Western Digital Caviar HD, which produced the following
>> output:
>>
>> 100->'119 per second.'
>> 200->'98.5 per second.'
>> 300->'106 per second.'
>> 400->'106 per second.'
>> 500->'101 per second.'
>> 600->'102 per second.'
>> 700->'99.9 per second.'
>> 800->'103 per second.'
>> 900->'104 per second.'
>> 1000->'99 per second.'
>> 1100->'97.9 per second.'
>> 1200->'104 per second.'
>> 1300->'111 per second.'
>> 1400->'99.8 per second.'
>> 1500->'107 per second.'
>> 1600->'108 per second.'
>> 1700->'95.6 per second.'
>> 1800->'103 per second.'
>> 1900->'108 per second.'
>> 2000->'102 per second.'
>> 2100->'103 per second.'
>> 2200->'107 per second.'
>> ...
>> 3000->'98.7 per second.'
>> 4000->'102 per second.'
>> 5000->'106 per second.'
>> 6000->'104 per second.'
>> 7000->'101 per second.'
>> 8000->'102 per second.'
>> 9000->'102 per second.'
>> 10000->'107 per second.'
>>
>> For curiousity, I also modified the script to read very small buffers
>> from the HD, here are the results:
>>
>> 4->'137 per second.'
>> 12->'146 per second.'
>> 20->'154 per second.'
>> 28->'143 per second.'
>>
>> (The HD busy light was solid ON during the test).
>>
>> At first I was puzzled because Magma has demonstrated much faster
>> objects-per-second read rates than these, even including
>> materialization, what gives?
>>
>> It's the HD buffering. Most of the time, objects are "clustered"
>> closely together, so that reading one object causes the "next" object
>> which will be read to already be in the HD's buffer. Here's the same
>> script, except reading mostly "sequentially" through the file instead
>> of from a random location:
>>
>> |stats random nextPos| stats:=OrderedCollection new. random := Random new.
>> nextPos:=100.
>> (FileDirectory on: '/home/cmm/test3/cube.001.magma') fileNamed:
>> 'objects.2.dat' do:
>> [ : stream | | ba fileSize | ba := ByteArray new: 10000.
>> fileSize := stream size.
>> #(4 12 20 28 100 200 300 400 500)
>> [ : n |
>> stream position: 0.
>> Transcript cr; show: (stats add: n->([stream
>> maRead: n "bytes"
>> bytesFromPosition: 1
>> of: ba
>> atFilePosition: ("random nextInt: fileSize"
>> (nextPos :=
>> nextPos+n+10)) ] bench)) ]].
>> stats
>>
>> Now look at the results:
>>
>> "Reading sequentially rather than at a random position."
>> 4->'1,160,000 per second.'
>> 12->'1,210,000 per second.'
>> 20->'1,100,000 per second.'
>> 28->'973,000 per second.'
>> ...
>> 100->'1,030,000 per second.'
>> 200->'321,000 per second.'
>> 300->'215,000 per second.'
>> 400->'160,000 per second.'
>> 500->'227,000 per second.'
>>
>> Conclusions:
>>
>> - Hard-disk seek is definitely a bottleneck with Magma, or any
>> Squeak application that requires random-access to a file.
>> - When objects are clustered closely together, read performance can
>> be dramatically better.
>> - HD's with fast seek times, such as newer solid-state drives, might
>> perform dramatically better.
>> - I should consider reducing the trackSize from 280 bytes to ~100
>> bytes (or make it customizable); because the rate drops really fast
>> after that and even a second read required could still be faster than
>> an initial read.
>>
>> - Chris
>> _______________________________________________
>> Magma mailing list
>> Magma at lists.squeakfoundation.org
>> http://lists.squeakfoundation.org/mailman/listinfo/magma
>
>
More information about the Magma
mailing list