Optimizing Squeak

Tue Feb 23 07:01:15 UTC 1999

>Travis Griggs wrote:
>> I follow these threads with rapt attention, but I have to admit my ignorance on what is meant by "translate to threaded code". I know what byte codes are, I know what native code is, and how the JIT stuff does it's dynamic cacheing, but what the heck is "threaded code." Please excuse my ignorance. Either I'm a big idiot, or there's a bunch of other idiots out there for not asking. :)

Dwight Hughes <dwighth at ipa.net> replied:
>Fundamentally, in threaded code, the execution tokens are the addresses
>of the interpretive routines, either directly (direct threaded code) or
>indirectly through a jump table (token threaded code). So the core of
>the interpreter dispatch loop becomes trivial -- and very fast. Threaded
>tokens are usually a machine word long. There are an assortment of
>variants on these schemes, as can be seen here:
>http://www.complang.tuwien.ac.at/forth/threaded-code.html

An obvious question is how much faster one should be able to run with wordcodes (normal threading) than with bytecodes.  John Malone did an experiment on a G3 Mac and found the following results comparing the two approaches:

John Malone <johnm at wdi.disney.com> reported:
>... I went home and played with various
>dispatch schemes (writting in PPC assembler). To facilitate experiments,
>I defined a minimal (4 instruction) bytecode set, and a
>short test program for benchmarking (a loop counting down from
>10,000,000 to zero). My goal was to understand:
>
>  a. what are the performance improvements of simple threading
>    (where a method is a list of addresses) over bytecodes?
>  b. what are the performance limits of each? how close are they
>     to the speed of inline-expanded code (a la Deutsch-Shiffman)?
>  c. how close is Squeak to these limits?
>  d. are there any interesting intermediate points in the performance
>    spectrum?
>
>The metric I focused on was "dispatches/sec" for the count-down loop
>example. Here are some numbers I got on my G3 PowerBook (which
>has a 1-MByte backside cache, so we can assume the best possible
>caching behavior):
>
>		mSecs	dispatches/sec (millions)	loop length (bytecodes)
>bytecode VM1	9333		21.43		2 bytecodes
>bytecode VM2	7950		25.16		2 bytecodes
>bytecode VM3	7267		27.52		2 bytecodes
>wordcode VM1	6567		30.45		2 wordcodes
>wordcode VM2	4850		41.24		2 wordcodes
>wordcode VM3	4833		41.38		2 wordcodes
>Squeak		39161		22.98		9 bytecodes
>
>Observation 1:
>Squeak currently achieves a very high dispatch rate. The sample program
>is slower than the test cases more because of the relatively large number
>of bytecodes in the loop rather than because the dispatch cost is high.
>(Hence, one approach to better performance would be to re-design the
>bytecode set to do several things in one instruction--a single instruction
>could compare a temporary to zero and branch if greater, for example.)
>
>Observation 2:
>The wordcoded VM's are faster because they avoid the indirection of
>looking up the bytecode in a table to get the address of the opcode
>to execute (which corresponds to the case statement in the Squeak VM).
>But the fastes wordcoded VM is only 50% faster than the fastest
>bytecoded one.
>
>Observation 3:
>The difference between the slowest and fastest versions in each category
>arise from fetching the next instruction early in the code for an opcode,
>allowing the memory access to proceed in parallel with the real work. We
>could perhaps get a factor of 30% by doing this in the Squeak VM. (The
>benefit would be less on less sophisticated processors.)
>
>To get a sense of how the above performance might compare with the
>code generated by a Deutsch-Shiffman style inlining tranlator, I
>hand-inlined the two instructions into a total of 6 PPC instrucitons:
>
>		mSecs	dispatches/sec (millions)	loop length (bytecodes)
>Hand-Inlined	2767		72.28		2 inlined bytecodes (6 PPC instructions)
>
>This certainly supports the argument that doing full translation is
>more effective than simply translating to some kind of threaded jump
>list. On the other hand, the performance improvement is bounded by
>a factor of three, and you are unlikely to see this in any real program
>(which spends most of its time doing sends, primitives, and memory
>management, not simply running useless tight loops!)

	- Dan