Optimizing Squeak

Mon Feb 22 20:36:01 UTC 1999

At 07:09 AM 2/22/99 -0800, Greg & Cindy Gritton wrote:
>are available. Of course, on most processors, additional
>code is required to pass the arguments and save locals.

I'd like to suggest that most of the work done by bytecodes is essentially
getting ready to do a send (parameter pushing from local, ivar, global,
constant, outer), doing the send, and then following up on the send
(possibly storing a result in a local, ivar, global). A few years ago I did
a bunch of research on alternative Smalltalk bytecode sets and concluded
the "classic" bytecode set could not be much worse from an interpreter
overhead viewpoint. 

The best alternative I came up with was what I'll call a message send
bytecode set. Basically each bytecode fully represented the preparation,
message send, and followup in a single bytecode. About 200 bytecodes were
used for the different combinations of parameters sources and number (as I
remember, 5 possible parameter sources and 0 to 2 parameters). With the
rest of the bytecodes used to fill in the holes. I had a little prototype
interpreter running that could get about 75% of the performance achieved by
the VisualWorks native JIT compiler, on an Intel processor. It was written
in macro language (M4) that expanded to assembler. This performance was
measured using some hand "assembled" bytecode methods (leap year
calculation in Time for an example). No fully running system or compiler
ever was finished, so milage may vary in real life. 

Some of the other tricks/features where:

1) Double indirect inline caches allowed native thread/multiprocessor
safety. The same trick might be used to implement synchronization locks.
The key was to be able to update the inline cache with a single 32-bit
memory write (using two writes, like the cached class value and the cached
method would have required a very expensive spinlock to maintain thread
safety).

2) Methods that were "simple" were specially marked and never involked a
new context. Simple meant things like returned self, returned ivar of self,
returned constant. When the method lookup was first done (cache miss), a
target piece of native code to do the right thing was substituted for the
actual method in the cache entry.

3) There was no difference between bytecode dispatching and primitive
dispatching. A primitive was typically a piece of native code that worked
on the passed parameters and the last thing it did was execute the tail
bytecode dispatching code. There were essentially bytecode send fragments
and "helper" code fragments. The bytecode fragments got you to specific
helper fragments. An example helper fragment would be SmallInteger +, or
method entry context setup.

I don't offhand remember if the method cache ended up being a hashed index
or if I had a dedicated "slot" index as part of the bytecode. I remember
tinkering with both. As I remember, having the bytecode index a dedicated
slot was faster (and allowed the cache to be preloaded <essentially at
compile time>), but burned a lot more memory.

To try to give a flavor of this bytecode set, you might have a bytecode
that did a receiver+one parameter send:

	send local, ivar

The binary encoding would be like:

	<send local, ivar bytecode>:8, <index into current context for local
parameter>:4, <index into object for ivar parameter>:4, <index into current
context for return result>:4, <index into method constant table for
selector>:4

Classic bytecode would have been:

	push temp
	push ivar
	send #someSelector
	pop temp

As you would get the work of 4 bytecodes collapsed into 1, the
interpretation overhead was dramatically reduced. Also notice the classic
bytecode format took 4 bytes, and the alternative format was only 3 bytes,
so had improved code density. I actually though mabey Huffman coding with
variable length bit fields (based on analysis of a full system) would give
the densest code, but was a bit slower to decode. And hand assembling
Huffman code was also a serious pain for tinkering. I suspect for an
interpreter in C, like Squeak, the interpreter overhead reduction may be
even more dramatic. This also would only get 2 branch prediction misses,
once in the send to the helper code, and once at the end of the helper to
dispatch the next send bytecode. The classic bytecode set is at least a
branch prediction miss per bytecode, 4 in this example.

In case this isn't clear as mud, imagine the following 155 bytecodes:

	send0 {temp, ivar, global, constant, outer} "5 variations for unary messages"

	send1 {temp, ivar, global, constant, outer}, 
	      {temp, ivar, global, constant, outer} "25 variations for binary
messages"

	send2 {temp, ivar, global, constant, outer},
	      {temp, ivar, global, constant, outer},
	      {temp, ivar, global, constant, outer} 
		"125 variations for receiver+2 parameter messages like 'anArray at:aLocal
put:aConstant'"

You also needed some to fill in the less common cases:

	move temp, {temp, ivar, global, constant, outer}
	move {temp, ivar, global, constant, outer}, temp

And also for sends with more than 2 parameters:

	push {temp, ivar, global, constant, outer}
	push2 {temp, ivar, global, constant, outer}, {temp, ivar, global,
constant, outer}

There were no special bytecodes for involking primitives or flow control,
you just did a pseudo send which involked a piece of native code (via send
cache magic), like this:

	send2	ivar,temp,constant	"send #compareEqual message with constant as the
bytecode branch offset"

As I remember, you did need a method return bytecode. This basically stored
one of the current context locals back into the calling context at the
index specified in the involking bytecode (which you either peek at or else
stored on original method activation).

I eventually decided making a new Smalltalk was not exactly a profitable
idea (circa '94-'95), and I needed to go on with my life, so stopped
working on the new interpreter architecture. I must still have the code
around someplace. It would have been nice to make a paper all about this,
but I got a little busy making a living.

Using these ideas in Squeak would be very significant work, as everything
has to "mesh" for things to work. It's not clear this effort wouldn't be
better spent on a native code JIT, which ultimatly I think would have the
best performance (processor branch prediction would actually work for
sends). If small memory footprint is still a major goal, then mabey these
are useful ideas.

- Jan

___________________________________________________________________
            Paradigm Matrix Inc., San Ramon California
   "video products and development services for Win32 platforms"
Internet: Jan Bottorff janb at pmatrix.com
          WWW          http://www.pmatrix.com
Phone: voice  (925) 803-9318
       fax    (925) 803-9397
PGP: public key  <http://www-swiss.ai.mit.edu/~bal/pks-toplev.html>
     fingerprint  52 CB FF 60 91 25 F9 44  6F 87 23 C9 AB 5D 05 F6
___________________________________________________________________