Adding loop primitives/optimizations

Thu Dec 2 20:30:24 UTC 2004

Am 02.12.2004 um 20:34 schrieb Lyndon Tremblay:

>>
>> VI4 was an experiment of Anthony hannan, the result is the
>> closurecompiler on
>> SqueakMap. (the changes to bytecodes, stack layout... have been
>> abandoned).
>>
>> The status for the Jitter is unkown.
>>
>>      Marcus
>>
>>
>
> VI4 is documented to also meant to be an image format change - this 
> means
> either these changes are already in and the closure compiler is 
> optional, or
> the result of the experiment was just the closure compiler itself. Am I
> close?

The result is just the closure compiler. The changes were much more then
just what is minimally required for the gettung closures. And they had 
some
bad consequenses for Jit compiling.

>  I find if (useJit) and #ifdef JITTER in I believe both the Mac VM and
> Unix VM sources. (not in Win32)

Yes, these were for J3. J3 actualy got quite far: It worked on both G3 
and x86, both
MacOS9 and Linux (and I think Andreas made a Win version for testing, 
too).

The Jitter was done by Ian Piurmarta, originally for Linux/G3. With 
lots of help
from Ian, I did the port to x86 and MacOS.

Of course, that was 2001. Revisiting the benchmarks is kind of 
interesting...

Interp:     '43805612 bytecodes/sec; 1325959 sends/sec'
J3:         '135665076 bytecodes/sec; 8100691 sends/sec'

Today: (PowerBookG4 1.5GHz), interp:

               '114387846 bytecodes/sec; 5152891 sends/sec'

But the mircoBenchmarks don't tell the whole story: Even with a speedup
of factor 6 in sends, we only saw the performance doubled on real world
benchmarks (e.g. the MacroBenchmarks). So even beeing slower on sends, 
I'd
guess that my System today is faster then the Jit based one of 2001.

This is because Squeak was carefully optimized to run primitives most 
of the time.
And then, as Tim pointed out, even if the Jit can optimize the code to 
run in zero
seconds, you will only see the perfomance doubled when the system 
spends 50%
of the time in the primitives.

Another, related problem was GC: with the faster VM, the percentage of 
the time that the system
spends in gc will grow. I'd guess that we would have to look closely at 
the gc to get more
leverage from a good Jit.

Of course, even if you don't get quite the performance that you'd like 
to see at the end, it's worthwhile.
as a good jit allows you to convert a lot of the Slang code to normal 
smalltalk, thus making better designs
possible and much easier to be changed, as it's not hardcoded in the VM

J5 then was an interesting design: real PICs. And all the complex 
bytecode was de-optimized to a simple
form were every send was really happening. Even without *any* inlining, 
Ian managed to get near the performance
of the interpreter. And with the PIC data, the next step would be to 
start using that to do optimizations based
on the types that are recorded in the PICs... don't know if that ever 
got implemented in J5.

Speaking about runtime compilers for Squeak, there are two other 
projects: Exupery by Bruce Kampjes, a
runtime translator that is written in Squeak, not C++. This is on 
SqueakMap. Bruce has reported
some good speedup already.

And then there is AOStA, a project started by Elliot Miranda. The idea 
here is to add TypeFeedback optimization
to an existing Jit-based Smalltalk (e.g. VisualWorks or J5 without 
inlining) using a Bytecode-2-Bytecode optimizer
in the image (and a slightly modified vm with a bunch of additional 
bytecodes and access to PIC data).
This project was extremely successfull in the sense that I got my Dipl. 
Inform. (Masters Degree) by hacking on it,
but it has not yet resulted in anything practically useful (and it has 
therefore not yet proven that this is a working
design at all).

     Marcus