64 bit images(was: A plan for 3.8/4.0...)

Fri Apr 30 01:59:38 UTC 2004

"Daniel Poon" <dan at chenvillage.com> wrote,
and <Yoshiki.Ohshima at acm.org> forwarded:
    I am a Smalltalk programmer who works on numerical programs in Smalltalk
    [...] at least half of what we do is not linear, and involves a lot
    of old fashioned floating point ops to solve.

It does not follow that there are lots of floats outside arrays.

    Personally, I don't mind sacrificing some speed for Smalltalks other
    benefits.  But two orders of magnitude is really stretching the
    case.  When people knock Smalltalk for its speed, they can easily
    point to floating point ops.

    Please consider implementing immediate floats in Smalltalk!!

Immediate (32-bit) floats will do NOTHING to improve the performance
of code working on 64-bit floats.  In fact, it will make it worse.

People are jumping from
(A) we'd really like much better floating-point performance
to
(B) immediate 32-bit (or 62-bit) floats are the answer.
But this is NOT a valid inference and the conclusion is almost certainly false.

There are all sorts of other things one could do, none of which involve
fundamental changes to the Squeak method dispatch code.

Squeak floating-point performance is affected by several things:
- the time it takes to discover that something IS a float
- the time it takes to select the right implementation of a method for a float
- the time required to fetch floats out of boxes
- the time required to allocate boxes for intermediate results
- the time required to reclaim that storage
- the lost opportunities to keep things in floating point registers

Aside from the fact that you can't have 64-bit immediate floats at all,
there's *still* going to be an order-of-magnitude cost relative to
optimised C or Fortran 90 due to the need to fetch floats from memory,
twice, once to see if something is a float -- which will load into an
integer register -- and again to get it in a floating point register.
There will still be the lost opportunities to keep things in FP registers.

There are several possible approaches.

One is to try to reduce the cost of allocation and GC for floats by having
a specially managed region just for boxed floats.  (Shades of BiBoP.)

A variant of that would be to have quite a small special region, treated
as a stack, and have the arithmetic operations on floats return numbers
in that area.  Now add two new instructions:

    discard_number		if the TOS item is the float at the top
				of the special region, decrement the
				special region's sp.  Decrement the
				normal sp.
    fix_number			if the TOS item is the float at the top
				of the special region, allocate a normal
				box for it.
So
    x :=- a * x + b
becomes
    push a;
    push x;
    send *;		-- result left in special region
    push b;
    send +;		-- result left in special region
    fix_number;		-- final result boxed normally
    store x.

Push this idea far enough and you get something like the old MacLisp SPDL.
MacLisp was supposed to get quite good FP performance, compared with
Fortran, although the Fortran compiler was probably quite dumb compared
with today's.

Another, and long-term perhaps the most effective, approach
would be some sort of type inference, rather like Self.

What the best approach is depends to a large extent on what the
non-array-based floating-point code looks like, and some examples
would help there.

For what it's worth, computing dot products on my machine:
    30 nsec/element in optimised C
   100 nsec/element in unoptimised C

  7240 nsec/element in SCM (a Scheme interpreter), of which
  3420 nsec/element is GC overhead
  4790 nsec/element in SCM using immediate integers, of which
  1230 nsec/element is GC overhead.

  1880 nsec/element in Larceny (a Scheme compiler), or which
   880 nsec/element is GC overhead
    93 nsec/element in Larceny using immediate integers, of which
     0 nsec/element is GC overhead.

Larceny/C is somewhere between 1 and 2 orders of magnitude.
SCM/C is somewhere around 2 orders of magnitude.
If Squeak float arithmetic ends up 2 orders of magnitude slower than C,
then this is about what it looks like.

I picked SCM and Larceny because I already had them.  
Immediate floats would clearly save a lot of allocation time,
except that we wouldn't be able to use them.  (Because when you
compute dot products, you want to accumulate the results in double
precision even if your vector elements are single precision.)
But in the SCM system, that wouldn't really help *enough*; the
version using immediate integers is still 2 orders of magnitude
slower than the C version using doubles.  SCM interprets carefully
tweaked parse trees; Larceny generates native code.  So I'd expect
Squeak to lie somewhere between Larceny and SCM here; using immediate
floats might cut the time in half, but I really wouldn't expect a
decimal order of magnitude improvement.