[Newbies] Re: Binary file I/O performance problems
Klaus D. Witzel
klaus.witzel at cobss.com
Wed Sep 3 09:53:58 UTC 2008
Hi David,
let me respond in "reverse" order of your points:
> I find it troubling that I am having to write code below the
> abstraction level of C to read and write data from a file. I thought
> Smalltalk was supposed to free me from this kind of drudgery? Right
> now, Java looks good and Python/Ruby look fantastic by comparison.
Here the difference to Squeak/Smalltalk is, that the intermediate level
routines like #uint32 are made available at the Smalltalk language level
where users can see them, use them and modify them. Such an approach is
seen as part of an invaluable resource by Smalltalk users. It has a price,
yes.
But Squeak/Smalltalk can do faster, dramatically faster than what you
observed. The .image file (10s - 100s MB) is read from disk and
de-endianessed in a second or so. Of course this is possible only because
the file is in a ready-to-use format, but this can be a clue when you
perhaps want to consider alternative input methods.
> This (I think) cleans up some of the code smell, but for only marginal
> performance improvements. It seems that I may need to implement a
> buffer on the binary stream. Is there a good example on how this
> should be done in the image or elsewhere?
I don't know of a particular example (specialized somehow on your problem
at hand, for buffered reading of arbitrary "struct"s) but this here is
easy to do in Squeak:
byteArray := ByteArray new: 2 << 20.
actuallyTransferred :=
binaryStream readInto: byteArray startingAt: 1 count: byteArray size
You may perhaps want to check that GBs can be brought into Squeak's memory
in a matter of seconds, just #printIt in a workspace:
[1024 timesRepeat: [[
(binaryStream := (SourceFiles at: 1) readOnlyCopy) binary.
byteArray := ByteArray new: 2 << 20.
actuallyTransferred :=
binaryStream reset; readInto:
byteArray startingAt: 1 count: byteArray size]
ensure: [binaryStream close]]] timeToRun
When reading from disk 4-byte-wise this makes a huge difference for sure.
From here on you would use the ByteArray protocol (#byteAt:*, #shortAt:*,
#longAt:*, #doubleAt:*) but as mentioned earlier these methods are perhaps
not optimal (when compared to other languages and their implementation
libraries) for you.
Last but not least, when doing performance critical i/o or conversions,
Squeak users sometimes write a Squeak plugin (which then extends the
Squeak VM), still at the Smalltalk/Slang language level but with it they
can do/call any hw-oriented routine for speeding up things dramatically,
and this indeed compares well to other languages and their implementation
libraries :)
HTH.
/Klaus
On Wed, 03 Sep 2008 08:00:54 +0200, David Finlayson wrote:
> OK - I made some of the suggested changes. I broke the readers into two
> parts:
>
> uint32
> "returns the next unsigned, 32-bit integer from the binary
> stream"
> isBigEndian
> ifTrue: [^ self nextBigEndianNumber: 4]
> ifFalse: [^ self nextLittleEndianNumber: 4]
>
> Where nextLittleEndianNumber looks like this:
>
> nextLittleEndianNumber: n
> "Answer the next n bytes as a positive Integer or
> LargePositiveInteger, where the bytes are ordered from least
> significant to most significant.
> Copied from PositionableStream"
> | bytes s |
> [bytes := stream next: n.
> s := 0.
> n
> to: 1
> by: -1
> do: [:i | s := (s bitShift: 8)
> bitOr: (bytes at: i)].
> ^ s]
> on: Error
> do: [^ nil]
>
>
>
> David
More information about the Beginners
mailing list