Optimizing Squeak

Mon Feb 22 18:56:18 UTC 1999

I'm just a nosy bystander at implementing interpreters so forgive me if these are stupid comments:

As you know, Smalltalk should spend most of its time in primitives (if I remember correctly we targeted it at over 70% at ParcPlace).  And, of course, primitives can run at the speed of optimized C.  If this is true, then the need to speed up the other ~20% is lessened.

I notice the need for method execution to be a lot faster in math intensive tasks.  If we implemented math primitives that operated on entire vectors (arrays) of SmallIntegers and Floats for each operation and implemented appropriate matrix and vector classes (and primitive math operations on them), much of the need for fast math could be done using these vector math classes (at primitive speed (optimized C)).

Not that I want to discourage anyone from working on method execution faster!  I don't.  It's just that with the appropriate operations implemented primitively, the size/complexity cost I'm willing to pay for faster method execution is decreased.

Before coming to ParcPlace, Eliot Miranda implemented a Smalltalk VM (BrouHaHa) that implemented a very efficient threaded byte-code interpreter. He has even released the code into public domain I believe.  Without the size/complexity cost of dynamic translation to machine code it still managed to get 85% (if I remember correctly) of the speed of ParcPlace's dynamic translation VM!

If Squeak's VM achieved that:
	- 85% of the speed of ParcPlace's dynamic translation VM at method execution and
	- none of the complexity of dynamic translation and
	- none of the space requirements for native code caches and
	- had vector/matrix math operations with primitive implemations

....I'd be one happy puppy!  I can't imagine wanting any more speed.

>Bytecode interpreter:
>One thing costly on modern microprocessors is branch
>mispredictions.  Since a bytecode interpreter calls
>or jumps to different code when dispatching each bytecode,
>it is likely to get a misprediction per bytecode.

Not if the bytecode is used to index into a 256 element jump-table.  In this case there is no need for an array-bounds check and there are no conditional branches so no branch mispredictions can occur.

Carl Watts
http://AppliedThought.com/carl/