VM performance discrepancy on Linux and Windows

List overview All Threads
Download

newer

older

Re: [squeak-dev] Re: [Vm-dev] Re:...

One bytecode missing in...

Yoshiki Ohshima

9 Apr 2008 9 Apr '08

10:38 p.m.

Hello,

I had some suspicous for a while but we did a little test with a computer that dual boot Windows XP and (Ubuntu) Linux to run tinyBenchmarks. (The computer happens to be a 1.8GHz Pentium-M Dell laptop.)

On Windows, 3.10.6 VM (pre-compiled one on the site) with etoys-dev.image, the result was:

311 million bytecodes/sec, 8.9 million sends/sec

On Linux, 3.9-8 VM (pre-compiled one on the site) with etoys-dev.image the result was:

190 million bytecodes/ec, 5.7 million seonds/sec

Has any of you been experiencing similar gap? Have anybody looked at the generated code, or has anybody done some experiment recently?

-- Yoshiki

Show replies by date

John M McIntosh

10 Apr 10 Apr

1:21 a.m.

New subject: [squeak-dev] VM performance discrepancy on Linux and Windows

Ian was/is aware of the magic to ensure the compiler makes a more efficient set of assembler instructions for non-generic intel CPU flavors. I'll assume these were not applied to the 3.9-8 VM you are using

On Apr 9, 2008, at 1:38 PM, Yoshiki Ohshima wrote:

...

Hello,

I had some suspicous for a while but we did a little test with a computer that dual boot Windows XP and (Ubuntu) Linux to run tinyBenchmarks. (The computer happens to be a 1.8GHz Pentium-M Dell laptop.)

On Windows, 3.10.6 VM (pre-compiled one on the site) with etoys-dev.image, the result was:

311 million bytecodes/sec, 8.9 million sends/sec

On Linux, 3.9-8 VM (pre-compiled one on the site) with etoys-dev.image the result was:

190 million bytecodes/ec, 5.7 million seonds/sec

Has any of you been experiencing similar gap? Have anybody looked at the generated code, or has anybody done some experiment recently?

-- Yoshiki

-- = = = ======================================================================== John M. McIntosh johnmci@smalltalkconsulting.com Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ========================================================================

Yoshiki Ohshima

11 Apr 11 Apr

9:13 a.m.

New subject: [squeak-dev] VM performance discrepancy on Linux and Windows

Well,

So some god words came and now I'm looking at the assembly code...

Bottom line is: gcc 2.95.2 on my Linux makes the bytecode/sec count larger, but makes send/sec count smaller.

gcc 2.95.2 on Windows generates a code sequence for two bytecodes like this:

------------------- 69d8: 46 inc %esi 69d9: 0f b6 1e movzbl (%esi),%ebx 69dc: 83 c7 04 add $0x4,%edi 69df: a1 00 00 00 00 mov 0x0,%eax 69e4: 8b 40 08 mov 0x8(%eax),%eax 69e7: 89 07 mov %eax,(%edi) 69e9: ff 24 9d 80 27 00 00 jmp *0x2780(,%ebx,4) 69f0: 46 inc %esi 69f1: 0f b6 1e movzbl (%esi),%ebx 69f4: 83 c7 04 add $0x4,%edi 69f7: a1 00 00 00 00 mov 0x0,%eax 69fc: 8b 40 0c mov 0xc(%eax),%eax 69ff: 89 07 mov %eax,(%edi) 6a01: ff 24 9d 80 27 00 00 jmp *0x2780(,%ebx,4) -------------------

Apparently, %esi is used (exclusively) for IP, and %ebx keeps the next byte, and "jmp *" takes you to the next location stored in the table starts at 0x2780.

gcc 4.1.2 on Fedora Core 7 generates a code sequence for two bytecodes like this:

------------------- efcf: 8d 46 01 lea 0x1(%esi),%eax efd2: 0f b6 08 movzbl (%eax),%ecx efd5: 89 c6 mov %eax,%esi efd7: a1 40 00 00 00 mov 0x40,%eax efdc: 8d 57 04 lea 0x4(%edi),%edx efdf: 89 d7 mov %edx,%edi efe1: 89 cb mov %ecx,%ebx efe3: 8b 40 2c mov 0x2c(%eax),%eax efe6: 89 02 mov %eax,(%edx) efe8: 8b 04 8d 20 04 00 00 mov 0x420(,%ecx,4),%eax efef: ff e0 jmp *%eax eff1: 8d 46 01 lea 0x1(%esi),%eax eff4: 0f b6 08 movzbl (%eax),%ecx eff7: 89 c6 mov %eax,%esi eff9: a1 40 00 00 00 mov 0x40,%eax effe: 8d 57 04 lea 0x4(%edi),%edx f001: 89 d7 mov %edx,%edi f003: 89 cb mov %ecx,%ebx f005: 8b 40 30 mov 0x30(%eax),%eax f008: 89 02 mov %eax,(%edx) f00a: 8b 04 8d 20 04 00 00 mov 0x420(,%ecx,4),%eax f011: ff e0 jmp *%eax -------------------

%esi is almost used for IP but use %eax for fetching the next byte, jmp also seems to use %eax so right before it is spilled and the destination address is brought into %eax.

I'd be surprized that this is optimized for a specific x86 variation. I copy the command line option from Windows Makefile to Fedora:

-mpentium -mwindows -Werror-implicit-function-declaration -fomit-frame-pointer -funroll-loops -fschedule-insns2

and got equally unsatisfying (slightly different) sequence.

Ok, so one thing to try is to install gcc 2.95.2 to Fedora Core 7 and compile the interpreter with it. The resulting assembly code is close to the one on Windows. The bytecode/sec count went put but send/sec went down. I have a feeling that I saw it before but of course cannot remember the exact condition...

If somebody has dual boot machine and can compare 8 (or more) cases (Namely, the combination of Windows/Linux, 2.95.2/4.1.2, more options/less options), that would be great.

-- Yoshiki

Andreas Raab

9:22 a.m.

New subject: [squeak-dev] VM performance discrepancy on Linux and Windows

Yoshiki Ohshima wrote:

...

Apparently, %esi is used (exclusively) for IP, and %ebx keeps the next byte, and "jmp *" takes you to the next location stored in the table starts at 0x2780.

All of that comes straight out of sqGnu.h:

#define BC_CASE(N) case N: _##N: #define BC_BREAK goto *jumpTable[currentBytecode]

#if defined(__i386__) # define IP_REG asm("%esi") # define SP_REG asm("%edi") # define CB_REG asm("%ebx") #endif

You might want to check if the gnuifier got confused over time - I had to update it to deal correctly with sqInt etc. gnu-interp.c should look like here:

sqInt interpret(void) { sqInt localReturnValue; sqInt localReturnContext; sqInt localHomeContext; register char* localSP SP_REG; register char* localIP IP_REG; register sqInt currentBytecode CB_REG; BC_JUMP_TABLE;

switch (currentBytecode) { BC_CASE(0) /* pushReceiverVariableBytecode */ BC_BREAK;

...

%esi is almost used for IP but use %eax for fetching the next byte, jmp also seems to use %eax so right before it is spilled and the destination address is brought into %eax.

Sounds more like the static register assignments get ignored.

Cheers, - Andreas

John M McIntosh

9:42 a.m.

New subject: [squeak-dev] VM performance discrepancy on Linux and Windows

The sqGnu.h I have reads

#if defined(__i386__) # define IP_REG asm("%esi") # define SP_REG asm("%edi") //# if (__GNUC__ > 2) || ((__GNUC__ == 2) && (__GNUC_MINOR__ >= 95)) # define CB_REG asm("%ebx") //# else //# define CB_REG /* avoid undue register pressure */ //# endif #endif

The first two byte codes assemble to this when done right.

L10161: addl $1, %esi movzbl (%esi), %ebx addl $4, %edi movl _foo, %eax movl 84(%eax), %eax movl 4(%eax), %eax movl %eax, (%edi) movl 512(%esp,%ebx,4), %eax L10421: jmp *%eax

L10162: addl $1, %esi movzbl (%esi), %ebx addl $4, %edi movl _foo, %eax movl 84(%eax), %eax movl 8(%eax), %eax movl %eax, (%edi) movl 512(%esp,%ebx,4), %eax jmp *%eax

sqInt interpret(void) { #ifdef FOO_REG register struct foo * foo FOO_REG = &fum; #endif sqInt localReturnValue; sqInt localReturnContext; sqInt localHomeContext; char* localSP; char* localIP; sqInt currentBytecode; JUMP_TABLE;

Plus use of -DUSE_INLINE_MEMORY_ACCESSORS

However much of this also relies on GCC version, in this case 4.01, usage of SP_REG, etc produced dreadful code with GCC 4.x, but was required for earlier versions.

I noted for PowerPC (Note building with GCC 3.3 PowerPC produces better code than gcc 4.0, gcc 3.1 or gcc 2.95, FYI gcc 3.1 produces lousy code But since you are building on Intel your milage will vary (lots)

On Apr 11, 2008, at 12:22 AM, Andreas Raab wrote:

...

Yoshiki Ohshima wrote:

...
Apparently, %esi is used (exclusively) for IP, and %ebx keeps the next byte, and "jmp *" takes you to the next location stored in the table starts at 0x2780.

All of that comes straight out of sqGnu.h:

#define BC_CASE(N) case N: _##N: #define BC_BREAK goto *jumpTable[currentBytecode]

#if defined(__i386__) # define IP_REG asm("%esi") # define SP_REG asm("%edi") # define CB_REG asm("%ebx") #endif

You might want to check if the gnuifier got confused over time - I had to update it to deal correctly with sqInt etc. gnu-interp.c should look like here:

sqInt interpret(void) { sqInt localReturnValue; sqInt localReturnContext; sqInt localHomeContext; register char* localSP SP_REG; register char* localIP IP_REG; register sqInt currentBytecode CB_REG; BC_JUMP_TABLE;
switch (currentBytecode) {
BC_CASE(0)
	/* pushReceiverVariableBytecode */
	BC_BREAK;
...
%esi is almost used for IP but use %eax for fetching the next byte, jmp also seems to use %eax so right before it is spilled and the destination address is brought into %eax.

Sounds more like the static register assignments get ignored.

Cheers,

Andreas

5880

Age (days ago)

5882

Last active (days ago)

vm-dev@lists.squeakfoundation.org

4 comments

3 participants

tags (0)

participants (3)

Andreas Raab
John M McIntosh
Yoshiki Ohshima