Report from a novice VM h4x0r.

Thu Apr 1 00:39:41 UTC 2004

Tim Rowledge wrote:

>>Furthermore, making all integer values word-aligned (are they already?) 
>>will tremendously improve memory access times on many architectures.
>>    
>>
>I can't think of any non-aligned integers in use. 
>  
>

So much the better then...

>>2. While the construct:  
>>
>>| foo |
>>foo _ self bar: baz
>>foo = bat ifTrue: [Mars invade].
>>
>>will help the inliner when bar is a non-trivial method significantly, it 
>>becomes counterproductive when bar happens to be "integerAt:" (which it 
>>is in several places) which is actually a compiler macro.
>>    
>>
>How so? I can see it won't be much of a help, but how is it a problem?
>  
>

In some cases, it does.
Looking carefully at the outputted C code ( using a command similar to 
"diff oldInterp.c newinterp.c | less" )

When the inliner sees that, and the method is a oneliner, it will be 
inlined.. However, if the method is nontrivial the inliner won't create 
a construct like

[text of bar]
int temp1 = result of bar;
 baz ( temp1)...

It will simply write:

baz (bar());

This isn't necessarily bad C but it does slow things down sometimes.

>>5. I started experamenting with replacing long-coded block moves with 
>>calls to C's " memmove" and "memcpy" where appropriate. The motovation 
>>is that the 386 has an ungodly fast "rep movsb " compound instruction 
>>which does the  operation at the speed of the FSB. It would be even 
>>better to use an inline assembly command for this but to remain portable 
>>I started trying to insert the C library calls mentioned.
>>    
>>
>Inline assembler and portable in one sentence? The only way is "inline
>assembler is not portable". And don't forget that x86 is not the only
>architecture and not all optimisations apply even across all x86 chips.
>You'd probably be horrified at how much it can take to cope with all
>that crap.
>  
>

I know very well. I tought myself assembler back in highschool. I used 
to be in the OS development business. I *COULD* have written it in 
assembler but I realized that it would be much too hard to maintain and 
enhance. I hit roadblocks when I couldn't find out how to run compiled 
binaries...

Moving memory makes for a very interesting story!

On 286 and earlier machines, (ones we don't care about these days), the 
fastest way to move memory was to configure one of the secondary chips 
on the motherboard (the DMA controller)

The 386 added the rep movsd instruction which allowed it to move data 
across its entire 32 bit bus while the DMA controller was still 16 bits.

This was optimal for a while untill the pentium came along. For a while 
it seemed that the FPU was the fastest way to move the data across its 
64 bit bus... When MMX came along it became the preferred method...

While the C library function _SHOULD_ be the fastest, it involves a 
library call... This means that it doesn't start paying dividends until 
after that startup cost is paid down... There are ways to write calls 
very fast, for example by saving only the registers you need to use (and 
even better, by passing the parameters in the registers that they are 
most likley to be used by the function such as by placing the number of 
bytes to transfer in ECX). (A system  compiler issue)

Squeak seems to use alot of relatively short moves so maybe the system 
call is generally not worth it... Only building both versions and 
testing them can answer that...

As for JM's comment: I don't see why a natively compiled library would 
need to re-configure itself for each run... At compile time I would 
determine what chip I was targeting, use the MMX registers for the bulk 
of the move and do whatever was left with integer ops... OTOH, I've seen 
some baaaaaaaaaad code come out of the non-squeak community. Everyone 
seems to be focused on the kernel which probably accounts for less than 
10% of an average program while the core libraries/compilers which 
affect every damn line of code go relatively unnoticed... =\

>>7. The compiler emits many unnecessary gotos.
>>    
>>
>Part of the not-terribly neat inlining we do. I have an untested theory
>that it would be better for us to NOT textually inline but mark the
>code with _inline_ and see what CC will do for us.
>  
>

YES!!! Excelant idea!
I hadn't gotten to looking at the cCompiler but having it simply pass 
the inline/no inline specifier already in many class functions to the 
host compiler should be a user configurable behavior...  (It apparently 
is but I don't think it is currently capable of passing "inline"), 
Inline is a C99 enhancement that was present in many earlier compilers...
=)

>>8. The compiler will compile:
>>
>>foo _ foo + 1.
>>
>>as  foo += 1,
>>    
>>
>Again, very architecture dependent. And the idiom works for other
>values than 1..
>  
>

ALL YOUR CHIP ARE BELONG TO INTEL. (I can't afford better hardware.)

>>13. The add opcode seems to attempt the add without changing the stack 
>>pointer and change it only after succeding. The logic operations (and 
>>some others) will change the stack pointer 3 times on success or 
>>failure!  ( doing two pops followed by a push)
>>    
>>
>Not so bad; look at the actual implementation of pop:thenPush: for
>example and recall that our SP is _not_ the C sp. We don't actually
>push and pop here.
>  
>

Ofcourse...
Infact, I was thinking about how this code would run on a FORTH chip 
such as the ones at www.ultratechnology.com ;)

Even if it wasn't a big improvment, every little bit helps, especially 
for the core opcode set... Furthermore it's not good to have two 
seperate sets of functions that do roughly the same thing in what should 
be considered a tight loop program, the issue is cache optimization and 
bandwidth utilization.

I consider all of these "atomic" operations hyper time-critical and 
absolutly anything that, even hypothetically, will improve the code 
should be tried...

Naturally, the syscall bytecodes are much more tolerant of sub-optimal 
code.

>>14. Many of the opcodes that access the stack without pushing will use 
>>the method "stackValue: 0" instead of the cleaner "stackTop". This 
>>probably won't affect the binary but it adds alot of constant arithmetic 
>>to the generated C code. It also indicates that the cCompiler relies on 
>>the native compiler too much to optimize out constant arithmetic...)
>>    
>>
>I've cleaned out at laest some of these recently.
>  
>

The best way to find them is to simply write it in workspace and then 
use "find all method source with it" -- caution: plugin code using 
"InterpreterProxy" will not work with this, even when the method is 
added to said class...

>If we could really rely on having C99 compilers at all times this might
>be interesting.
>  
>

Do it anyway and see who screams. ;)