tinyBenchmarks on PocketSmalltalk [ was Re: Interesting new target with Palm OS]

Mark van Gulik ghoul6 at home.net
Tue Oct 24 04:52:46 UTC 2000


>> Ok, heres what I get on my Palm IIIe (Palm OS 3.1):
>>    ~17,500 bytecodes/sec
>>    ~1,260 sends/sec
>
> When I saw this I had to laugh/cry, but I was not surprised.

Eric, tsk, tsk.  My Atari ST Smalltalk (circa 1988) was doing around 10,000
bytecodes/sec on an 8 MHz Atari 1040ST with 1 MB of RAM.  Have you
considered abandoning the concept of a VM coded entirely in C?  Much of the
VM I constructed was coded in good ol' 68000 assembly (using inlined asm{}
code in MegaMax C).  Instead of their pathetic switch statement I used the
following dispatch loop...

-------------------------------
step:
    bra.s   trapped           ;Dynamically overwritten with "clr.l D7"
                              ;(0x4287) if there is no inter-bytecode
                              ;interrupt pending.
stepP2:
    move.b (bytecode)+,D7
    addq.l #1,PV_odometer(A4) ;only for counting instructions
    lsl.w  #2,D7              ;times four
    dc.w   0x4EFB,0x7002      ;jmp nextline(PC,D7.w)
;bytecode 0 -- skip two bytecodes
    jmp jumpFwd
;bytecode 1 -- skip three bytecodes
    jmp jumpFwd
;bytecode 2 -- skip four bytecodes
    jmp jumpFwd
;bytecode 3 -- skip N bytecodes (N in bytecode stream)
    jmp jumpL
...
;bytecode 16 -- dup. [note: instruction coded inline here in four bytes]
    move.l  (stackp),-(stackp)
    bra.s   stepP2
;bytecode 17 -- pop. [note: instruction coded inline here in four bytes]
    addq.l  #4,stackp
    bra.s   stepP2

...etc...

jumpFwd:
    lsl.w  #6,D7   ;already shifted 2 bits left, for 8 total
    move.b (bytecode)+,D7
    adda.l D7,bytecode
    jmp    step

...etc...
-------------------------------

Note that I used real interrupt handlers which overwrote the 68000
instruction at "step:".  To trigger an inter-bytecode interrupt at the next
convenient opportunity, a "bra.s trapped" instruction was written there.
When all pending interrupts had been handled (by the handler at location
"trapped:", not shown), a "clr.l D7" would be written there.

There are much better approches, but this kept the interpreter extremely
compact, at the expense of some inlined, self-modifying assembly code.  Does
the Palm allow the executable code to be overwritten while it's running?  If
not, just put in an explicit test instead of the self-modifying stuff.  Note
that even a lowly *68020* might not detect inter-bytecode interrupts
correctly in some cases because of its larger (i.e., actually existent)
instruction cache.


> I would be very interested in finding out how you found the number of
> bytecodes.  Also, the timing of the run.  It turns out that if you use the
> time function from PST, that the resolution of the clock is REALLY bad, and
> therefore you need to run a sample that exceeds 5-10 seconds.  It counts
> ticks, which are some number/sec but not fixed.  You need to get into some
> deeper guts to get good timing info.

Add a statement like "odometer++;" to your main dispatch loop.  Then run a
benchmark once to count how many bytecodes get executed for a particular
benchmark (you might need to add a primitive to get the current odometer
value).  Then recompile your system *without* the "odometer++;" line, and
see how long it actually takes (with a stopwatch).  It should take the same
number of instructions it took with the odometer enabled, but without the
cost of keeping track this time around.

BTW, if the times with and without the odometer are close enough to each
other, you may as well just leave it always enabled.  My interpreter also
counted number of method contexts and block contexts entered, the number of
block closures created, and the number of method contexts that were
recycled.


> Also, you MUST keep in mind that the current VM does not compile very well.
> The case statement handling the bytecode dispatch does not translate to a
> good table lookup, we had been trying to figure out how to get the compilet
> to create a good tight case statement.  If you look at the code gen'ed for
> the dispatch loop, it is REALLY awful.  It does range checks on subsets,
> checks for if = to different values and all sorts of really stupid things. I
> am assuming that the Squeak VM's bytecode dispatch is much better coded and
> creates good machine language.

Bypass the compiler.  Try inline assembly -- just insert "asm{}" somewhere
in your code and see what the compiler says about it.

Alternatively, ask the compiler to produce an assembly listing, then
hand-tweak it, comment it, and discontinue using the C version entirely.





More information about the Squeak-dev mailing list