[Enh][VM] primitiveApplyToFromTo for the heart of the enumeration of collections?

Bryce Kampjes bryce at kampjes.demon.co.uk
Sat Sep 16 12:08:09 UTC 2006


What I missed until now was the "thisContext tempAt: 2 put: index + 1"
in the shadow code. I still maintain that your version is overly
clever and very likely to cause mass confusion and maintenance issues.
do: loops are something that should be fully understandable by any
Smalltalk programmer. Everyone is going to be seeing this in any walk
back that involves a do:.

Normally, arguments are immutable in Squeak. Preserving that matters
to me when reading code. Exupery doesn't care as the bytecodes are
identical. I do care, as a programmer that normal invariants are held
especially in code that I'm likely to be glancing at continually when
developing.

Now, the way to optimise an expression that uses thisContext on a
system with Exupery or without your VM patch is to remove the use of
thisContext. The shadow code is going to be much slower on systems
that do not have the VM mod than the current implementation. My policy
for use of thisContext and tempAt:put: is that neither will be
compiled by Exupery, tempAt:put: should cause Exupery to de-optimise
the context then drop back into the interpreter (it doesn't yet). I'm
trying to optimise the common case.


Re-calling a primitive from a return is close enough to re-entering
it. That is something that hasn't been done. It is a major design
change.


Your VM mod adds work to both common send and common return which
every send is going to have to execute. 

Do you have any measurements to show that any realistic code is
spending enough time in do: loop overhead to justify slowing down
all sends?

On my machine, I can do an interpreted message send every 291
clocks. You are adding at least 11 instructions to the send return
sequence which would take four clocks to execute at the peak execution
rate. The worst case time is about 42 clocks which is the full latency
costs plus two branch mispredicts (at 15 clocks each). The time
estimates are based on a Athlon 64 though they will be very similar
for other CPUs modern desktop CPUs (nuclear reactors in silicon)
except the Pentium 4 where a mispredict costs 30 clocks.

The performance effects of the VM mod will depend on the architecture,
the compiler, and how well the branch predictor manages on those two
extra branches. Unfortunately, to prove there is not a speed loss, you
are really going to need to test on many architectures and compilers
under many different loads. Two branch mispredicts alone could cost
10% in send performance. Just on the x86 the costs are likely to
differ between the Pentium 3, Pentium 4, Pentium M, Intel's core, and
Athlon (where the XP may be different to the 64).

An out of order CPU may be able to hide the cost of the extra
instructions behind the current flabby send and return code. An in
order CPU will not be able do do this. So expect a greater performance
loss on slower machines such as ARMs and other chips aimed at the
hand-held and embedded market. Also the risks of a speed drop on a
Pentium-M are greater that those on an Athlon, the Pentium-M manages
to execute more instructions per clock when interpreting.

I calculated the clocks for a send from the clock speed 2.2 GHz and
the sends/sec from tinyBenchmarks.

    232,515,894 bytecodes/sec;  7,563,509 sends/sec

Exupery's tiny benchmark numbers are:

  1,151,856,017 bytecodes/sec; 16,731,576 sends/sec

So for Exupery a common case send costs 132 clocks. There is still
plenty of room to remove waste from that. At 300 clocks, the cost is
about 15% worst case and 1% best case without out of order execution
being able to hide costs behind other delays. At 132 clocks, the
numbers are much worse. There is a good chance that with more tuning
Exuperys sends may be reduced to about 60 clocks without
inlining. VisualWorks sends cost 30 clocks. Optimise sends and the
best optimisation will be to remove primitiveApplyToFromTo if it's
not a net loss in performance now.

I vote strongly that this patch is not included in the VM. I've used
IBM Smalltalk and enjoy working on a system where do: is easy to
understand. Please don't trade simplicity for an optimisation that
risks slowing down more code than it speeds up. A do: that is trivial
to understand and is free from magic is worth a lot.

Bryce



More information about the Squeak-dev mailing list