Ramblings on how to optimize Squeak for modern CPU bit manipulation (Was Re: I was wondering ...

Tue Jun 29 20:58:31 UTC 1999

On Tue, Jun 29, 1999 at 12:38:54PM -0700, Lawson English wrote:

> I said:
>
> [refinements to my idea include]
>
> >1) a set of quick and dirty primitives that can be called as standalones;
> >2) a full-blown (more or less) AltiVec simulator that uses a single set of
> >32 "registers" that are accessed by index, rather than by providing a
> >Smalltalk byteArray every time you use it.

> Further refinements to my pixel-handling primitives idea:
>
> In addtion to the 128-bit AltiVec simulation/primitives, it would be useful
> to implement:
>
> A set of quick and dirty AlitiVec-like primitives that only operate on
> single pixes at the various standard screen depths -32, 16, 8.
>
> A full-blown single-pixel equivalent of the Altivec simulator that would
> allow one to manipulate a single pixel's color channels
> one-pixel-at-a-time.
>
> E.g., C-based methods to separate color channels into 16 or 32-bit values,
> and manipulate them simultaneously or separately, as well as methods to
> convert them back to 8/5/whatever bits per channel and repack them into a
> single pixel.
>
> The idea is to create a bunch of primitives that can manipulate pixels
> efficiently for color-handling, as well as to speedup  DSP-like operations.
> The simulator (32 or 128-bit) would store intermediate values in virtual
> registers so that no conversion or other data-related overhead would be
> incurred until you needed to manipulate the data using standard SmallTalk.
>
> The 128-byte simulator/primitives would be suitable for long streams of
> pixels or other data. The 32-bit simulator/primitives would be suitable for
> short segments of data (or for handling edge conditions in a long stream in
> a machine that has an AltiVec-like device handy).
>
> For MMX, perhaps an intermediate 64-bit version could be implemented as
> well, or facilities created to handle 64-bit segments from within the
> 128-bit version?
>
> Comments? Criticisms? I haven't benchmarked any of this, but my intutition
> says that the time-savings from doing this could be quite good, especially
> during the prototyping phase of pixel/DSP algorithms.

Disclaimer: I haven't been keeping up with this list or this thread in
particular, so I may be re-itterating things that have already been
hashed out or just generally making clueless statements.  If that's
the case, sorry.

However, as my day job involves writing and maintaining development
tools for a SIMD DSP, I have some experience with the technology and
the related philosophy.

The basic idea behind vector processing is to do more in one clock
cycle.  As such, my particular DSP will do 4 or 8  arithmetic
operations on a swath of data in one cycle and calculate the next
address at the same time.  The ideal construct for using this kind of
thing is a big linear sequence of instructions--an unrolled tight
loop, as it were.  This give you a 4- or 8-fold increase in
performance over a scalar processor and allows realtime video
manipulation at 25MHz.

Note that you only get a significant performance boost because you're
doing an enormous number of successive basic arithmetic operations.
That is, almost every clock cycle is used to do arithmetic rather than
flow control or other housekeeping.

In contrast, executing a Squeak expression like:

        ByteArray doBy128Bits: [ :chunk | chunk someDSPOperation ]

involves several hundred (at least) clock cycles in between each piece
of arithmetic done.  The performance gained by doing 4 or 8 or 16
bytes' worth in one clock cycle rather than {4,8,16} isn't a
significant gain because the arithmetic itself doesn't use that many
clock cycles compared to the Squeak VM.

Adding extended bit-manipulation primitives to Squeak may well be a
good idea (I haven't needed them so I don't have an informed opinion
here) but you may as well go and code them in efficient C--vector
processing won't help enough.

I _can_ think of several places where vector instructions might
improve performance:  BitBLT could probably benefit from AltiVec or
MMX-based replacements for a few of the workhorse primitives.
ByteArray, FloatArray and IntArray might also benefit from
vector-based "Do-this-to-all-elements"-type primitives.  Maybe
you can think of a few others.

In any case, if you do add such primitives, please make sure that
there are portable alternates for all of them.

                                --Chris