[Newbies] Re: Binary file I/O performance problems

Klaus D. Witzel klaus.witzel at cobss.com
Wed Sep 3 09:53:58 UTC 2008


Hi David,

let me respond in "reverse" order of your points:

> I find it troubling that I am having to write code below the
> abstraction level of C to read and write data from a file.  I thought
> Smalltalk was supposed to free me from this kind of drudgery? Right
> now, Java looks good and Python/Ruby look fantastic by comparison.

Here the difference to Squeak/Smalltalk is, that the intermediate level  
routines like #uint32 are made available at the Smalltalk language level  
where users can see them, use them and modify them. Such an approach is  
seen as part of an invaluable resource by Smalltalk users. It has a price,  
yes.

But Squeak/Smalltalk can do faster, dramatically faster than what you  
observed. The .image file (10s - 100s MB) is read from disk and  
de-endianessed in a second or so. Of course this is possible only because  
the file is in a ready-to-use format, but this can be a clue when you  
perhaps want to consider alternative input methods.

> This (I think) cleans up some of the code smell, but for only marginal
> performance improvements. It seems that I may need to implement a
> buffer on the binary stream. Is there a good example on how this
> should be done in the image or elsewhere?

I don't know of a particular example (specialized somehow on your problem  
at hand, for buffered reading of arbitrary "struct"s) but this here is  
easy to do in Squeak:

   byteArray := ByteArray new: 2 << 20.
   actuallyTransferred :=
	binaryStream readInto: byteArray startingAt: 1 count: byteArray size

You may perhaps want to check that GBs can be brought into Squeak's memory  
in a matter of seconds, just #printIt in a workspace:

[1024 timesRepeat: [[
	(binaryStream := (SourceFiles at: 1) readOnlyCopy) binary.
	byteArray := ByteArray new: 2 << 20.
	  actuallyTransferred :=
		binaryStream reset; readInto:
		byteArray startingAt: 1 count: byteArray size]
  ensure: [binaryStream close]]] timeToRun

When reading from disk 4-byte-wise this makes a huge difference for sure.  
 From here on you would use the ByteArray protocol (#byteAt:*, #shortAt:*,  
#longAt:*, #doubleAt:*) but as mentioned earlier these methods are perhaps  
not optimal (when compared to other languages and their implementation  
libraries) for you.

Last but not least, when doing performance critical i/o or conversions,  
Squeak users sometimes write a Squeak plugin (which then extends the  
Squeak VM), still at the Smalltalk/Slang language level but with it they  
can do/call any hw-oriented routine for speeding up things dramatically,  
and this indeed compares well to other languages and their implementation  
libraries :)

HTH.

/Klaus


On Wed, 03 Sep 2008 08:00:54 +0200, David Finlayson wrote:

> OK - I made some of the suggested changes. I broke the readers into two  
> parts:
>
> uint32
> 	"returns the next unsigned, 32-bit integer from the binary
> 	stream"
> 	isBigEndian
> 		ifTrue: [^ self nextBigEndianNumber: 4]
> 		ifFalse: [^ self nextLittleEndianNumber: 4]
>
> Where nextLittleEndianNumber looks like this:
>
> nextLittleEndianNumber: n
> 	"Answer the next n bytes as a positive Integer or
> 	LargePositiveInteger, where the bytes are ordered from least
> 	significant to most significant.
> 	Copied from PositionableStream"
> 	| bytes s |
> 	[bytes := stream next: n.
> 	s := 0.
> 	n
> 		to: 1
> 		by: -1
> 		do: [:i | s := (s bitShift: 8)
> 						bitOr: (bytes at: i)].
> 	^ s]
> 		on: Error
> 		do: [^ nil]
>
>
>
> David




More information about the Beginners mailing list