[Enh][VM] primitiveApplyToFromTo for the heart of the enumeration of collections?

Sat Sep 16 18:48:41 UTC 2006

Klaus D. Witzel writes:
 > Hi Bryce,
 > 
 > on Sat, 16 Sep 2006 14:08:09 +0200, you wrote:
 > 
 > > What I missed until now was the "thisContext tempAt: 2 put: index + 1"
 > > in the shadow code.
 > 
 > Some clever Squeaker will present alternatives, I'm sure ;-)

The issue is that arguments are immutable.

The performance issues from the #tempAt:put: could be got around by
using a custom bytecode compiler that allows assignment to
arguments. But that does nothing to make your implementation simpler
to understand or verify.

 > > I still maintain that your version is overly
 > > clever and very likely to cause mass confusion and maintenance issues.
 > 
 > I personally know of no software developer who ever participated in a mass  
 > confusion, Bryce, and I can say that for the last 30 years. Your  
 > prediction is not believable.
 > 
 > Would you say that
 > - http://en.wikipedia.org/wiki/Duff's_device
 > caused mass confusion? It is overly clever as well.

Yes it is overly clever, for many uses. And I think Duff said so
originally when he announced it.

In this case IBM's version of this primitive did cause me a large
amount of confusion many years ago. It's not until this argument that
I'm starting to understand what they did and why. 

 > > do: loops are something that should be fully understandable by any
 > > Smalltalk programmer. Everyone is going to be seeing this in any walk
 > > back that involves a do:.
 > 
 > Yes, that is a novelty, perhaps it should be called an innovation. But I  
 > cannot claim that I have invented it, this was done some time ago.
 > 
 > > Normally, arguments are immutable in Squeak.
 > 
 > Not all all! Pass an argument to another method which does a #become: on  
 > the argument, Bryce. This is an illusion, we are talking about Smalltalk.
 > 
 > How do you handle this situation in Exupery.
 > 
 > primitiveApplyToFromTo for example is robust, it does not do anything when  
 > an argument is #become:'ed behind its back to something else.

Arguments are immutable. Try compiling "selector: a a := 1". We allow
deep access and the possibility to extend the language. This is a good
thing. However you have stepped into language modification from normal
programming. Having the power to do that easily is a wonderful thing,
using it is almost always mistaken.

Exupery needs to bail out if you modify the context too much. It makes
no assumptions when execution is outside of the method except that
neither the PC nor stack pointer are touched. When it is executing it
owns the context. Exupery is safe because it can always drop back to
the interpreter. 

The issue with maintainability is how likely the implementation is to
remain correct and how hard it is to verify that it is correct. It was
only this morning that I was tolerably confident that it is correct
ignoring interrupts. I have not done the work required to have any
confidence that it is correct if an interrupt occurs, it may be.

 > > Now, the way to optimise an expression that uses thisContext on a
 > > system with Exupery or without your VM patch is to remove the use of
 > > thisContext. The shadow code is going to be much slower on systems
 > > that do not have the VM mod than the current implementation.
 > 
 > This is the case for all pimitive v.s. shadow code comparisions. They must  
 > be slower. Have you ever seen the opposite?

Your shadow do: will be much slower than a regular do:. It will
definitely be much slower after compilation.

Now, there is no guarantee that the code after a primitive is shadow
code. It is not always. Sometimes it handles cases that the primitive
doesn't. The only way to know for sure is to study both carefully and
fully understand what both do in all cases where they're used.

Your modified VM will be slower executing sends on some architectures
and some compilers executing some loads. It does more work. From my
calculations the slow down should be between 1% and 15%. It is
possible that the magic of modern hardware will hide the cost in some
cases, but not all. You are doing more work in the common case to
speed up a special case.

 > > My policy
 > > for use of thisContext and tempAt:put: is that neither will be
 > > compiled by Exupery,
 > 
 > But #tempAt:put: is used by debugger and friends (through clients)! Why  
 > blame debugger that it *must* use #tempAt:put: - what would the  
 > Smalltalker do without it, have Java? </grin>

I never said we should remove #tempAt:put:. It is a fine and useful
tool. I just object to it being abused to break a language rule in
code that will be read by many people. I also object to it because it
is much slower than the equivalent byte-codes.

Your shadow code will allow your implementation to run but it will
slow down images when running on VMs without your primitive. 

 > > Re-calling a primitive from a return is close enough to re-entering
 > > it. That is something that hasn't been done. It is a major design
 > > change.
 > 
 > When you'd look at #commonReturn all you see is that another context is  
 > declared to be activeContext. This is irrelevant of my change. This  
 > happens as is since the Blue Book.

What does the blue book have to do with this? 

 > But okay you want it to be a design change, I can live with compliments ;-)
 > 
 > > Your VM mod adds work to both common send and common return which
 > > every send is going to have to execute.
 > 
 > Right. That's the price to pay. What do you expect when adding new  
 > functionality to the VM, that the system runs faster? If so, you must pay  
 > the price.

You are not adding new functionality. This change is a purely an
optimisation. Therefore if it must speed up the system, not just
part of it.

 > > Do you have any measurements to show that any realistic code is
 > > spending enough time in do: loop overhead to justify slowing down
 > > all sends?
 > 
 > No, but since you are insisting on this all the time I expect that you  
 > post that.

I'm not proposing changing the VM or do:. I'm arguing for the status
quo. The burden of proof is on you. You're also proposing an
optimisation with high maintenance costs that replaces simple code
with very clever code, that adds cost to message sends which are a
very common operation. Such changes should be considered guilty until
proven innocent beyond any doubt.

 > > On my machine, I can do an interpreted message send every 291
 > > clocks. You are adding at least 11 instructions to the send return
 > > sequence which would take four clocks to execute at the peak execution
 > > rate
 > ... yes, I agree it has a price.
 > 
 > > The performance effects of the VM mod will depend on the architecture,
 > > the compiler, and how well the branch predictor manages on those two
 > > extra branches. Unfortunately, to prove there is not a speed loss, you
 > > are really going to need to test on many architectures and compilers
 > > under many different loads.
 > 
 > Since the VM is used on so many platforms, I do not see any problem  
 > getting this feedback.

The VM is also a mature slow moving piece of software that many people
rely on. VM bugs are painful. VM changes are a very conservative
thing. It will take several years for a new VM to enter normal use
after it's been released.

What are you proposing here? We release this change then discover if
it's good or not afterwards? That we add yet another optimisation that
may may have no noticeable benefit which adds a high maintenance risk?
For this change it's worse, it adds work to message sends. All high
level code has to bear that.

We have too many optimisations that provide negligible gain already.
The ifNotNil: bug earlier was a perfect example. There were two
implementations, they got out of sync. And in a normal system both
will be used. And while we're at it class should be a standard
primitive not a bytecode so that people can override it if they wish.
The VM change for class is tiny, just reimplement the bytecode to do
a send then let that execute the primitive. 

 > > An out of order CPU may be able to hide the cost of the extra
 > > instructions behind the current flabby send and return code. An in
 > > order CPU will not be able do do this. So expect a greater performance
 > > loss on slower machines such as ARMs and other chips aimed at the
 > > hand-held and embedded market. Also the risks of a speed drop on a
 > > Pentium-M are greater that those on an Athlon, the Pentium-M manages
 > > to execute more instructions per clock when interpreting.
 > 
 > Bryce, aren't you overexaggerating when you blame young, innocent  
 > primitiveApplyToFromTo for performance loss out of all these technical  
 > reasons. A simple ABC analysis reveals that, A bytecode routines, B  
 > interpreter primitives and C message sends have frequency A>>B>>C with >>  
 > the usual much grater than.

Where are the numbers? Where is the analysis?

You are not optimising primitive execution. You are optimising
do:. You can only gain noticeably if most of the time is spent in in
do: overhead not in the work that is done. And not waiting on the
memory.

primitiveApplyToFromTo is more subtle than most of the primitives in
the VM. The only thing that I can think of that is more subtle is
exception handling. That provides useful
functionality. primitiveApplyToFromTo does not. All
primitiveApplyToFromTo can provide is performance. So all it's costs,
including performance costs must be justified by performance
arguments.

 > So if you want to save VM's performance, get rid of performance lost in  
 > bytecode routines, thereafter in interpreter primitives and thereafter in  
 > message sends - not first C then B then A. I think that this is what you  
 > aim for with Exupery?
 > 
 > > I calculated the clocks for a send from the clock speed 2.2 GHz and
 > > the sends/sec from tinyBenchmarks.
 > >
 > >     232,515,894 bytecodes/sec;  7,563,509 sends/sec
 > >
 > > Exupery's tiny benchmark numbers are:
 > >
 > >   1,151,856,017 bytecodes/sec; 16,731,576 sends/sec
 > 
 > Fascinating. Is this with or without primitiveApplyToFromTo compiled into  
 > the VM.

Without the primitiveApplyToFromTo applied.

 > > So for Exupery a common case send costs 132 clocks. There is still
 > > plenty of room to remove waste from that. At 300 clocks, the cost is
 > > about 15% worst case and 1% best case without out of order execution
 > > being able to hide costs behind other delays. At 132 clocks, the
 > > numbers are much worse. There is a good chance that with more tuning
 > > Exuperys sends may be reduced to about 60 clocks without
 > > inlining. VisualWorks sends cost 30 clocks. Optimise sends and the
 > > best optimisation will be to remove primitiveApplyToFromTo if it's
 > > not a net loss in performance now.
 > 
 > Will be to remove, Bryce?

I don't understand what you're asking here.

Personally, I'd solve the original problem of by either leaving the
current implementation of occurrencesOf: or just using count: as it
stands. I'm concerned about making the core more complex both in the
image and in the VM for no real gains and possibly a loss in
performance.

Bryce