Hi list, I been playing around with exupery. And now I have a few questions:
1) I cant get tinyBenchmarks working, neither in linux, nor in windows,
Downloaded all the staff from: http://wiki.squeak.org/squeak/Installing+Exupery
used: http://ftp.squeak.org/Exupery/vms/exupery-vm-0.11-linux.tz in linux and: http://ftp.squeak.org/Exupery/vms/exupery-vm-0.11-win32.zip in windows
with prebuild image: http://ftp.squeak.org/Exupery/images/exupery-0.10.tz
Examples run ok, but when I try to run tinyBenchmarks I get segmentation faults
2) Tried tinyBenchmarks in VisualWorks (NonCommercial 7.4.1) in my machine, I got: '652,229,299 bytecodes/sec; 89,016,165 sends/sec'
Does anyone know Why I get almost 90 million sends/sec? I think It's quite a big difference from previous versions of vw
3) I saw that primitives for #at: and #at:put: are getting inlined, but I think they are only implemented for Variable Objects (not for bytes nor Characters nor anything else) Is that true?
4) In my experiments with exupery, I get an error if I inline too many methods. I think I am getting out of machine registers, for example, when I try to compile Integer-#digitDiv:reg:. I get this error In the ColouringRegisterAllocator phase, but it is not a "You dont have more registers, dude" kind of error. Is the "no more registers" situation taken into consideration?
5) Is there a way to implement indirect jump tables in exupery?
Thanks a lot. Cheers Guille
Guillermo Adrián Molina writes:
Hi list, I been playing around with exupery. And now I have a few questions:
- I cant get tinyBenchmarks working, neither in linux, nor in windows,
Downloaded all the staff from: http://wiki.squeak.org/squeak/Installing+Exupery
used: http://ftp.squeak.org/Exupery/vms/exupery-vm-0.11-linux.tz in linux and: http://ftp.squeak.org/Exupery/vms/exupery-vm-0.11-win32.zip in windows
with prebuild image: http://ftp.squeak.org/Exupery/images/exupery-0.10.tz
Examples run ok, but when I try to run tinyBenchmarks I get segmentation faults
Try using the 0.11 Exupery VM with Exupery 0.11. Exupery VMs must match the Exupery version. The interface between Exupery and the VM is still evolving.
- Tried tinyBenchmarks in VisualWorks (NonCommercial 7.4.1) in my
machine, I got: '652,229,299 bytecodes/sec; 89,016,165 sends/sec'
Does anyone know Why I get almost 90 million sends/sec? I think It's quite a big difference from previous versions of vw
- I saw that primitives for #at: and #at:put: are getting inlined, but I
think they are only implemented for Variable Objects (not for bytes nor Characters nor anything else) Is that true?
It's true. #at: and #at:put: are only implemented for variable objects. I should write primitives for other types. Good benchmarks that demonstrate the need for such primitives would be nice.
- In my experiments with exupery, I get an error if I inline too many
methods. I think I am getting out of machine registers, for example, when I try to compile Integer-#digitDiv:reg:. I get this error In the ColouringRegisterAllocator phase, but it is not a "You dont have more registers, dude" kind of error. Is the "no more registers" situation taken into consideration?
I'd guess that it was because a variable was live at an entry point. There's a stack tracing bug which I'm just fixing that could have caused that.
I use the liveness analyser in the register allocator to catch compiler bugs. It's much nicer to catch them there than with crashes.
- Is there a way to implement indirect jump tables in exupery?
It would be possible. I do use indirect jumps for returns to compiled methods. If you look at any method you should see at least one indirect jump in the return code. Just jump to a register.
Bryce
Hi there! Thanks for the answers, found them very useful I have a few more questions
Guillermo Adrián Molina writes:
Hi list, I been playing around with exupery. And now I have a few
questions:
- I cant get tinyBenchmarks working, neither in linux, nor in windows,
Downloaded all the staff from: http://wiki.squeak.org/squeak/Installing+Exupery
used: http://ftp.squeak.org/Exupery/vms/exupery-vm-0.11-linux.tz in
linux
and: http://ftp.squeak.org/Exupery/vms/exupery-vm-0.11-win32.zip in
windows
with prebuild image:
http://ftp.squeak.org/Exupery/images/exupery-0.10.tz
Examples run ok, but when I try to run tinyBenchmarks I get
segmentation
faults
Try using the 0.11 Exupery VM with Exupery 0.11. Exupery VMs must match the Exupery version. The interface between Exupery and the VM is still evolving.
Ok!, tried that, it worked: 668407310 bytecodes/sec; 13559830 sends/sec 760772659 bytecodes/sec; 13803237 sends/sec 777524677 bytecodes/sec; 12762744 sends/sec 760772659 bytecodes/sec; 13834279 sends/sec 775757575 bytecodes/sec; 13569800 sends/sec I read something about intel being faster than AMD for exupery, Do you know why is that?
- Tried tinyBenchmarks in VisualWorks (NonCommercial 7.4.1) in my
machine, I got: '652,229,299 bytecodes/sec; 89,016,165 sends/sec'
Does anyone know Why I get almost 90 million sends/sec? I think It's quite a big difference from previous versions of vw
- I saw that primitives for #at: and #at:put: are getting inlined, but
I
think they are only implemented for Variable Objects (not for bytes nor Characters nor anything else) Is that true?
It's true. #at: and #at:put: are only implemented for variable objects. I should write primitives for other types. Good benchmarks that demonstrate the need for such primitives would be nice.
I 'll try to check that, thanks
- In my experiments with exupery, I get an error if I inline too many
methods. I think I am getting out of machine registers, for example,
when
I try to compile Integer-#digitDiv:reg:. I get this error In the ColouringRegisterAllocator phase, but it is not
a
"You dont have more registers, dude" kind of error. Is the "no more registers" situation taken into consideration?
I'd guess that it was because a variable was live at an entry point. There's a stack tracing bug which I'm just fixing that could have caused that.
I use the liveness analyser in the register allocator to catch compiler bugs. It's much nicer to catch them there than with crashes.
Yes I've seen those kind of errors (variable live at entry point), corrected them initializing temps with nil. I think this is something different. In this method of the ColouringRegisterAllocator:
findNodeToSpill | spillable | "This is just a basic heuristic, spill the register that interferes with the most other registers. It is possible to do a lot better. The heuristic should concider how much each register is used while it is alive" spillable := spillWorklist select: [:each | ((self hasSpill: each register) not) and: [each register isMachineRegister not]]. spillable := spillable asSortedCollection: [:a :b| a spillWeight > b spillWeight]. ^ spillable first
After compiling lots of methods using exupery, it fails with very big methods because spillable is nil, and spillable first throws an error. If I make less inlining (for example, not inlining divisions and multiplications), it compiles ok! Any ideas?
- Is there a way to implement indirect jump tables in exupery?
It would be possible. I do use indirect jumps for returns to compiled methods. If you look at any method you should see at least one indirect jump in the return code. Just jump to a register.
Yes, I checked that, but I still need to initialize that register with the convenient block, but I need to do that without using Jcc (conditional jumps) to choose from the right one, Any suggestions?
Bryce _______________________________________________ Exupery mailing list Exupery@lists.squeakfoundation.org http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
Thanks a lot cheers, Guille
Guillermo Adrián Molina writes:
Ok!, tried that, it worked: 668407310 bytecodes/sec; 13559830 sends/sec 760772659 bytecodes/sec; 13803237 sends/sec 777524677 bytecodes/sec; 12762744 sends/sec 760772659 bytecodes/sec; 13834279 sends/sec 775757575 bytecodes/sec; 13569800 sends/sec I read something about intel being faster than AMD for exupery, Do you know why is that?
Exupery was much faster than the interpreter on Pentium 4s. That's because the Pentium 4 is an inefficient chip to run the interprter on.
Those comparisions are rather old now. Hardware has moved on and so has Exupery. Benchmarking now with bigger suites may show different numbers.
- In my experiments with exupery, I get an error if I inline too many
methods. I think I am getting out of machine registers, for example,
when
I try to compile Integer-#digitDiv:reg:. I get this error In the ColouringRegisterAllocator phase, but it is not
a
"You dont have more registers, dude" kind of error. Is the "no more registers" situation taken into consideration?
I'd guess that it was because a variable was live at an entry point. There's a stack tracing bug which I'm just fixing that could have caused that.
I use the liveness analyser in the register allocator to catch compiler bugs. It's much nicer to catch them there than with crashes.
Yes I've seen those kind of errors (variable live at entry point), corrected them initializing temps with nil. I think this is something different. In this method of the ColouringRegisterAllocator:
findNodeToSpill | spillable | "This is just a basic heuristic, spill the register that interferes with the most other registers. It is possible to do a lot better. The heuristic should concider how much each register is used while it is alive" spillable := spillWorklist select: [:each | ((self hasSpill: each register) not) and: [each register isMachineRegister not]]. spillable := spillable asSortedCollection: [:a :b| a spillWeight > b spillWeight]. ^ spillable first
After compiling lots of methods using exupery, it fails with very big methods because spillable is nil, and spillable first throws an error. If I make less inlining (for example, not inlining divisions and multiplications), it compiles ok! Any ideas?
I'd guess it's a limit with the register allocator. It is possible that it can fail to find a register to spill when it needs to spill something. Given this bug will not cause crashes or incorrect execution it's not high priority.
- Is there a way to implement indirect jump tables in exupery?
It would be possible. I do use indirect jumps for returns to compiled methods. If you look at any method you should see at least one indirect jump in the return code. Just jump to a register.
Yes, I checked that, but I still need to initialize that register with the convenient block, but I need to do that without using Jcc (conditional jumps) to choose from the right one, Any suggestions?
Exupery also can get the address of a block. That's also done in the send code to save the compiled program counter. The compiled program counter is the address of the machine code block to return to encoded as a SmallInteger. Return blocks are aligned to 2 byte boundaries to allow for tagging. That's enough to build an indirect jump table if you wanted to do that.
Why do you need to build an indirect jump table? What are you trying to do?
Bryce
Guillermo Adrián Molina writes:
Ok!, tried that, it worked: 668407310 bytecodes/sec; 13559830 sends/sec 760772659 bytecodes/sec; 13803237 sends/sec 777524677 bytecodes/sec; 12762744 sends/sec 760772659 bytecodes/sec; 13834279 sends/sec 775757575 bytecodes/sec; 13569800 sends/sec I read something about intel being faster than AMD for exupery, Do you know why is that?
Exupery was much faster than the interpreter on Pentium 4s. That's because the Pentium 4 is an inefficient chip to run the interprter on.
Those comparisions are rather old now. Hardware has moved on and so has Exupery. Benchmarking now with bigger suites may show different numbers.
- In my experiments with exupery, I get an error if I inline too
many
methods. I think I am getting out of machine registers, for
example,
when
I try to compile Integer-#digitDiv:reg:. I get this error In the ColouringRegisterAllocator phase, but it
is not
a
"You dont have more registers, dude" kind of error. Is the "no more registers" situation taken into consideration?
I'd guess that it was because a variable was live at an entry point. There's a stack tracing bug which I'm just fixing that could have caused that.
I use the liveness analyser in the register allocator to catch compiler bugs. It's much nicer to catch them there than with crashes.
Yes I've seen those kind of errors (variable live at entry point), corrected them initializing temps with nil. I think this is something different. In this method of the ColouringRegisterAllocator:
findNodeToSpill | spillable | "This is just a basic heuristic, spill the register that interferes
with
the most other registers. It is possible to do a lot better. The heuristic should concider how much each register is used while it
is
alive" spillable := spillWorklist select: [:each | ((self hasSpill: each register) not) and: [each register isMachineRegister not]]. spillable := spillable asSortedCollection: [:a :b| a spillWeight > b spillWeight]. ^ spillable first
After compiling lots of methods using exupery, it fails with very big methods because spillable is nil, and spillable first throws an error.
If
I make less inlining (for example, not inlining divisions and multiplications), it compiles ok! Any ideas?
I'd guess it's a limit with the register allocator. It is possible that it can fail to find a register to spill when it needs to spill something. Given this bug will not cause crashes or incorrect execution it's not high priority.
- Is there a way to implement indirect jump tables in exupery?
It would be possible. I do use indirect jumps for returns to compiled methods. If you look at any method you should see at least one indirect jump in the return code. Just jump to a register.
Yes, I checked that, but I still need to initialize that register with
the
convenient block, but I need to do that without using Jcc (conditional jumps) to choose from the right one, Any suggestions?
Exupery also can get the address of a block. That's also done in the send code to save the compiled program counter. The compiled program counter is the address of the machine code block to return to encoded as a SmallInteger. Return blocks are aligned to 2 byte boundaries to allow for tagging. That's enough to build an indirect jump table if you wanted to do that.
Yes I also notice that, using MedAddress, right? Forgive me, but I still can't get the point: For example:
MedMov from: (MedAddress addressOf: blockN) to: aMedReg MedJump type: #jmp target: aMedReg block1: do something1 jmp end block2: do something2 jmp end block3: do something3 end:
this could be a jump table, But I still need to select which block to jmp. The only way of selecting the block I can Imagine is nesting compares, something with jumps like: MedJump type: #jc target: aLabel instruction: (MedComparision operator: #bitTest arg1: aMed arg2: (MedLiteral literal: 0))). But I want to implement a jump table to avoid conditional branching
Why do you need to build an indirect jump table? What are you trying to do?
I am implementing a smalltalk. It compiles directly to machine code, with exupery. The last time I asked something to the list I was starting to use exupery. Now I am almost done with that (without many optimizations). I am doing unit testing right now. My first mail to the list asked what would be the best to implement a new st, so, in my implementation I use: 0 tagged ints. A simple (and a little fat) object memory. A very straightforward send mechanism (with C calling convention for calling methods). No contexts, but using BlockClosures (frames are the same as in C, the C compiler does not differentiate C code from ST code). I compile the ST code from .st files to .s (assembler) using SmaCC, RefactoryBrowser, and then exupery, I still need squeak in order to run all that. I only use the bottom layer of exupery, (does not use IntermediateXXXXXX classes) I implemented the cmovxx instruction in exupery, because it is very useful. But I need jump tables to implement for example, faster versions of ifTrue:ifFalse:, and a lot of other things. This could lead to faster results. Right Now I am getting (with the same machine), tinyBenchmarks: Squeak: 172043010 bytecodes/sec; 5468700 sends/sec Squeak/Exupery: 775757575 bytecodes/sec; 13569800 sends/sec. myST/Exupery: 1072251308 bytecodes/sec; 36056442 sends/sec
Bryce _______________________________________________ Exupery mailing list Exupery@lists.squeakfoundation.org http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
Cheers Guille
Why do you need to build an indirect jump table? What are
you trying
to do?
I am implementing a smalltalk. It compiles directly to machine code, with exupery. The last time I asked something to the list I was starting to use exupery. Now I am almost done with that (without many optimizations). I am doing unit testing right now. My first mail to the list asked what would be the best to implement a new st, so, in my implementation I use: 0 tagged ints. A simple (and a little fat) object memory. A very straightforward send mechanism (with C calling convention for calling methods). No contexts, but using BlockClosures (frames are the same as in C, the C compiler does not differentiate C code from ST code).
Hi Guille, I don't get something here. If you are using Exupery to generate asm code why are you talking about a C compiler?
I compile the ST code from .st files to .s (assembler) using SmaCC, RefactoryBrowser, and then exupery, I still need squeak in order to run all that. I only use the bottom layer of exupery, (does not use IntermediateXXXXXX classes) I implemented the cmovxx instruction in exupery, because it is very useful. But I need jump tables to implement for example, faster versions of ifTrue:ifFalse:, and a lot of other things. This could lead to faster results. Right Now I am getting (with the same machine), tinyBenchmarks: Squeak: 172043010 bytecodes/sec; 5468700 sends/sec Squeak/Exupery: 775757575 bytecodes/sec; 13569800 sends/sec. myST/Exupery: 1072251308 bytecodes/sec; 36056442 sends/sec
That are numbers!
Cheers,
Sebastian
Bryce _______________________________________________ Exupery mailing list Exupery@lists.squeakfoundation.org http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
Cheers Guille
Exupery mailing list Exupery@lists.squeakfoundation.org http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
Guillermo Adrián Molina writes:
Exupery also can get the address of a block. That's also done in the send code to save the compiled program counter. The compiled program counter is the address of the machine code block to return to encoded as a SmallInteger. Return blocks are aligned to 2 byte boundaries to allow for tagging. That's enough to build an indirect jump table if you wanted to do that.
Yes I also notice that, using MedAddress, right? Forgive me, but I still can't get the point: For example:
MedAddress is a literal that represents the address of a block. In Exupery it gets relocated to be the blocks actual address.
You could write now: (jmp (mem (add (MedAddress blockWithTable) (sar anIndex 2))))
The only thing missing is a way to produce a block that just contained literals. In your case a block that contained MedAddresses.
The MedAddress should be translated into a label refering to the block.
Exupery currently does not have blocks that contain literals but it shouldn't be too hard to add.
I am implementing a smalltalk. It compiles directly to machine code, with exupery. The last time I asked something to the list I was starting to use exupery. Now I am almost done with that (without many optimizations). I am doing unit testing right now.
Interesting, what is the goal of your new Smalltalk? What are you trying to do better than the other dialects or is this purely for enjoyment?
Bryce
Guillermo Adrián Molina writes:
After compiling lots of methods using exupery, it fails with very big methods because spillable is nil, and spillable first throws an error.
If
I make less inlining (for example, not inlining divisions and multiplications), it compiles ok! Any ideas?
I'd guess it's a limit with the register allocator. It is possible that it can fail to find a register to spill when it needs to spill something. Given this bug will not cause crashes or incorrect execution it's not high priority.
If you want to fix that limit in the register allocator I could give you some pointers. The problem is due to to how the problem is broken down into stages. I'd need to dig through code to remember the details though.
I'm planning on working on the register allocator in the next release. The goal will be making it faster, it has a few serious performance problems.
Bryce
Guillermo Adrián Molina writes:
After compiling lots of methods using exupery, it fails with very
big
methods because spillable is nil, and spillable first throws an
error.
If
I make less inlining (for example, not inlining divisions and multiplications), it compiles ok! Any ideas?
I'd guess it's a limit with the register allocator. It is possible that it can fail to find a register to spill when it needs to spill something. Given this bug will not cause crashes or incorrect execution it's not high priority.
If you want to fix that limit in the register allocator I could give you some pointers. The problem is due to to how the problem is broken down into stages. I'd need to dig through code to remember the details though.
Yes I do want. Please let me know where to start.
I'm planning on working on the register allocator in the next release. The goal will be making it faster, it has a few serious performance problems.
Exupery's compile time is not a problem for me. But may be I have to wait for you to finish with the register allocator, in order to try to fix the limit. Please let me know what do you want me to do. Right now, I have allready finished with unit testing. The next thing I will do is to include all the compiler classes in my project (remeber that right now, that is done in Squeak), may be it would be convenient for me to wait for 0.12 before I do that.
Another thing, Do you want the code I made for cmovxx?
Cheers Guille.
Bryce _______________________________________________ Exupery mailing list Exupery@lists.squeakfoundation.org http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
Guillermo Adrián Molina writes:
If you want to fix that limit in the register allocator I could give you some pointers. The problem is due to to how the problem is broken down into stages. I'd need to dig through code to remember the details though.
Yes I do want. Please let me know where to start.
If it's not an urgent problem then it may be better to wait until after 0.13. Or to look at the register allocator during 0.13 development.
Have a look at the stages of simplification. They're done
ColouringRegisterAllocator>>processWorkLists simplifyWorklist isEmpty ifFalse: [^ self simplify]. self coalesce ifTrue: [^ self]. self freeze ifTrue: [^ self]. spillWorklist isEmpty ifFalse: [^ self spillRegister]. self spillMove
Sets the steps for processing. However the spill worklist has some registers on it that shouldn't be spilled, so it tries to select a register to spill. It discards all registers then fails.
I'd see if there are any moves that might be spilled afterwards, if so, then all you'd need to do is allow spillRegister to fail gracefully.
I'm planning on working on the register allocator in the next release. The goal will be making it faster, it has a few serious performance problems.
Exupery's compile time is not a problem for me. But may be I have to wait for you to finish with the register allocator, in order to try to fix the limit. Please let me know what do you want me to do. Right now, I have allready finished with unit testing. The next thing I will do is to include all the compiler classes in my project (remeber tat right now, that is done in Squeak), may be it would be convenient for me to wait for 0.12 before I do that.
Another thing, Do you want the code I made for cmovxx?
I'm interested.
Does it have unit test coverage? Exupery development relies on testing so that's required.
When was cmov introduced? I know it was a long time ago but can't remember precisely when. What I'm concerned with is making Exupery incompatable with some chips that might still be being used.
Given adequate test coverage I'll add it.
Bryce
Guillermo Adrián Molina writes:
If you want to fix that limit in the register allocator I could give you some pointers. The problem is due to to how the problem is broken down into stages. I'd need to dig through code to remember the
details
though.
Yes I do want. Please let me know where to start.
If it's not an urgent problem then it may be better to wait until after 0.13. Or to look at the register allocator during 0.13 development.
Have a look at the stages of simplification. They're done
ColouringRegisterAllocator>>processWorkLists simplifyWorklist isEmpty ifFalse: [^ self simplify]. self coalesce ifTrue: [^ self]. self freeze ifTrue: [^ self]. spillWorklist isEmpty ifFalse: [^ self spillRegister]. self spillMove
Sets the steps for processing. However the spill worklist has some registers on it that shouldn't be spilled, so it tries to select a register to spill. It discards all registers then fails.
I'd see if there are any moves that might be spilled afterwards, if so, then all you'd need to do is allow spillRegister to fail gracefully.
Ok, I will try to see what is happening. Is there any hard limit (besides the number of available registers in x86 arch)?
I'm planning on working on the register allocator in the next
release.
The goal will be making it faster, it has a few serious performance problems.
Exupery's compile time is not a problem for me. But may be I have to
wait
for you to finish with the register allocator, in order to try to fix
the
limit. Please let me know what do you want me to do. Right now, I have allready finished with unit testing. The next thing I will do is to include all the compiler classes in my project (remeber
tat
right now, that is done in Squeak), may be it would be convenient for
me
to wait for 0.12 before I do that.
Another thing, Do you want the code I made for cmovxx?
I'm interested.
Does it have unit test coverage? Exupery development relies on testing so that's required.
Not right now, I will work on that later, When I have it I will send it to you.
When was cmov introduced? I know it was a long time ago but can't remember precisely when. What I'm concerned with is making Exupery incompatable with some chips that might still be being used.
Intel's optimization manual says that cmov was introduced in Pentium, and in AMD's optimization manual says that cmov is available from athlon. I actually didn't investigate that thoroughly. The fact is that any modern computer should have it. I know that in earlier implementations of cmov (Pentium Pro) using the instruction wasn't really an advantage. But now, it is really faster. My tinyBenchamrks showed a speed up of 10% when I implemented cmov for smallinteger additions. But, If you are really concerned about compatibility I think you should be better considering not to use it.
Given adequate test coverage I'll add it.
I also implemented enter and leave instructions. Not because they were better (they aren't), but, beacuse I use it to signal the inclusion of additional prologue and epilogue code in a final phase added just after the allocator. I do it that way because I dont know until then, which registrs are used, and the number of additional temps needed. I know that exupery allways push and pop all the registers (which aren't eax, edx and ecx). And that it make place for a big context as temp space in stack. I don't do that. I only push the used regs, and if that is not enough, I enter additional stack space. That brakes compatibility with original exupery, but I wanted to implement it that way. For small methods, that is really better. So, given that, I don't offer anything of this for you. I think you'll understand.
Cheers, Guille
Bryce _______________________________________________ Exupery mailing list Exupery@lists.squeakfoundation.org http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
Guillermo Adrián Molina writes:
Sets the steps for processing. However the spill worklist has some registers on it that shouldn't be spilled, so it tries to select a register to spill. It discards all registers then fails.
I'd see if there are any moves that might be spilled afterwards, if so, then all you'd need to do is allow spillRegister to fail gracefully.
Ok, I will try to see what is happening. Is there any hard limit (besides the number of available registers in x86 arch)?
There should be no limit on the number of registers you can use. The worst that should happen is you end up with a lot of spill code.
Another thing, Do you want the code I made for cmovxx?
I'm interested.
Does it have unit test coverage? Exupery development relies on testing so that's required.
Not right now, I will work on that later, When I have it I will send it to you.
OK
When was cmov introduced? I know it was a long time ago but can't remember precisely when. What I'm concerned with is making Exupery incompatable with some chips that might still be being used.
Intel's optimization manual says that cmov was introduced in Pentium, and in AMD's optimization manual says that cmov is available from athlon. I actually didn't investigate that thoroughly. The fact is that any modern computer should have it. I know that in earlier implementations of cmov (Pentium Pro) using the instruction wasn't really an advantage. But now, it is really faster. My tinyBenchamrks showed a speed up of 10% when I implemented cmov for smallinteger additions. But, If you are really concerned about compatibility I think you should be better considering not to use it.
I'm surprised that your SmallInteger addition code was helped.
In Exupery the SmallInteger addtion sequence is bitTest arg1 jumpIfSet failureBlock bitTest arg2 jumpIfSet failureBlock clearTagBit arg1 add arg1 arg2 jumpOverflow failureBlock
The failure case is a full message send.
There are code fragments where cmov whould be helpful. Converting to a boolean comes to mind. The part of "a > b" where you're loading either true or false into the result register.
Given adequate test coverage I'll add it.
I also implemented enter and leave instructions. Not because they were better (they aren't), but, beacuse I use it to signal the inclusion of additional prologue and epilogue code in a final phase added just after the allocator. I do it that way because I dont know until then, which registrs are used, and the number of additional temps needed. I know that exupery allways push and pop all the registers (which aren't eax, edx and ecx). And that it make place for a big context as temp space in stack. I don't do that. I only push the used regs, and if that is not enough, I enter additional stack space. That brakes compatibility with original exupery, but I wanted to implement it that way. For small methods, that is really better. So, given that, I don't offer anything of this for you. I think you'll understand.
Exupery's prolog and epilogue sequences could be improved. I've been thinking about overhauling that area for a few years now. I'd like to have variables spill into their actual locations. So if a stack variable was stored, it would always be fetched from the context. Then spilled registers wouldn't need to be loaded and stored on context switches.
On thing that I might do in 0.13 is colour the isolated parts of a method separately. That should improve register allocation as the inteference graph will not be polluted by other isolated sections of code. A compiled method is often made up of completely isolated sections of code. Colouring the sections separately should also speed up register allocation.
Bryce
Guillermo Adrián Molina writes:
Sets the steps for processing. However the spill worklist has some registers on it that shouldn't be spilled, so it tries to select a register to spill. It discards all registers then fails.
I'd see if there are any moves that might be spilled afterwards, if so, then all you'd need to do is allow spillRegister to fail gracefully.
Ok, I will try to see what is happening. Is there any hard limit
(besides
the number of available registers in x86 arch)?
There should be no limit on the number of registers you can use. The worst that should happen is you end up with a lot of spill code.
Another thing, Do you want the code I made for cmovxx?
I'm interested.
Does it have unit test coverage? Exupery development relies on testing so that's required.
Not right now, I will work on that later, When I have it I will send it
to
you.
OK
When was cmov introduced? I know it was a long time ago but can't remember precisely when. What I'm concerned with is making Exupery incompatable with some chips that might still be being used.
Intel's optimization manual says that cmov was introduced in Pentium,
and
in AMD's optimization manual says that cmov is available from athlon. I actually didn't investigate that thoroughly. The fact is that any
modern
computer should have it. I know that in earlier implementations of cmov (Pentium Pro) using the instruction wasn't really an advantage. But
now,
it is really faster. My tinyBenchamrks showed a speed up of 10% when I implemented cmov for smallinteger additions. But, If you are really concerned about compatibility I think you should
be
better considering not to use it.
I'm surprised that your SmallInteger addition code was helped.
In Exupery the SmallInteger addtion sequence is bitTest arg1 jumpIfSet failureBlock bitTest arg2 jumpIfSet failureBlock clearTagBit arg1 add arg1 arg2 jumpOverflow failureBlock
The failure case is a full message send.
The problem with the above code is that you have 3 branches. That is why I need jump tables, there are cases where cmov really dosn't help
Before I started using exupery, I called special methods in C that implemented faster code. Every special method (and primitives) returned 1 in case of an error, and if success, returned the result object. One of this special methods was +. This is part of the code:
if(areIntegers(rcvr,arg)) { int result; asm( "movl $1,%%edx\n\t" "movl %[rcvr],%[result]\n\t" "addl %[arg],%[result]\n\t" "cmovol %%edx,%[result]" : [result] "=r" (result) : [rcvr] "r" (rcvr), [arg] "r" (arg) : "edx" ); return result; }
with this code, I've got up to 10% faster code in + intensive tests.
There are code fragments where cmov whould be helpful. Converting to a boolean comes to mind. The part of "a > b" where you're loading either true or false into the result register.
Yes, I implemented that with exupery (code for less "<"):
self addExpression: (MedMov from: (self literal: false) to: answer ). trueReg := machine createTemporaryRegister. self addExpression: (MedMov from: (self literal: true) to: trueReg ). self addExpression: (MedComparision operator: #cmp arg1: arg1 arg2: arg2). self addExpression: (MedCMov type: #cmovl from: trueReg to: answer).
This gave me an impressive improvement (up to 40-50%), when I implemented all the smallint comparissons in this way. Because, as you know, we dont need to detag before compare.
Given adequate test coverage I'll add it.
I also implemented enter and leave instructions. Not because they were better (they aren't), but, beacuse I use it to signal the inclusion of additional prologue and epilogue code in a final phase added just after the allocator. I do it that way because I dont know until then, which registrs are used, and the number of additional temps needed. I know
that
exupery allways push and pop all the registers (which aren't eax, edx
and
ecx). And that it make place for a big context as temp space in stack.
I
don't do that. I only push the used regs, and if that is not enough, I enter additional stack space. That brakes compatibility with original exupery, but I wanted to implement it that way. For small methods, that
is
really better. So, given that, I don't offer anything of this for you. I think you'll understand.
Exupery's prolog and epilogue sequences could be improved. I've been thinking about overhauling that area for a few years now. I'd like to have variables spill into their actual locations. So if a stack variable was stored, it would always be fetched from the context. Then spilled registers wouldn't need to be loaded and stored on context switches.
On thing that I might do in 0.13 is colour the isolated parts of a method separately. That should improve register allocation as the inteference graph will not be polluted by other isolated sections of code. A compiled method is often made up of completely isolated sections of code. Colouring the sections separately should also speed up register allocation.
Every improvement you make will help me. Cheers, Guille
Bryce _______________________________________________ Exupery mailing list Exupery@lists.squeakfoundation.org http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
Guillermo Adrián Molina writes:
In Exupery the SmallInteger addtion sequence is bitTest arg1 jumpIfSet failureBlock bitTest arg2 jumpIfSet failureBlock clearTagBit arg1 add arg1 arg2 jumpOverflow failureBlock
The failure case is a full message send.
The problem with the above code is that you have 3 branches. That is why I need jump tables, there are cases where cmov really dosn't help
There is only 3 branches and I'm hoping that they will never be taken so they should be easy to predict. That said the branches do use branch predictor resources which could cause other branches not to be predicted as well.
Before I started using exupery, I called special methods in C that implemented faster code. Every special method (and primitives) returned 1 in case of an error, and if success, returned the result object. One of this special methods was +. This is part of the code:
if(areIntegers(rcvr,arg)) { int result; asm( "movl $1,%%edx\n\t" "movl %[rcvr],%[result]\n\t" "addl %[arg],%[result]\n\t" "cmovol %%edx,%[result]" : [result] "=r" (result) : [rcvr] "r" (rcvr), [arg] "r" (arg) : "edx" ); return result; }
with this code, I've got up to 10% faster code in + intensive tests.
Do you have conditionals inside areIntegers and to check if the result is 1 indicating an error?
There are code fragments where cmov whould be helpful. Converting to a boolean comes to mind. The part of "a > b" where you're loading either true or false into the result register.
Yes, I implemented that with exupery (code for less "<"):
self addExpression: (MedMov from: (self literal: false) to: answer ). trueReg := machine createTemporaryRegister. self addExpression: (MedMov from: (self literal: true) to: trueReg ). self addExpression: (MedComparision operator: #cmp arg1: arg1 arg2: arg2). self addExpression: (MedCMov type: #cmovl from: trueReg to: answer).
This gave me an impressive improvement (up to 40-50%), when I implemented all the smallint comparissons in this way. Because, as you know, we dont need to detag before compare.
Exupery removes many of the boolean conversion sequences.
"a < b ifTrue: [x]"
First gets translated into:
(booleanToControlFlow (controlFlowToBoolean (a < b)))
Then Exupery removes the booleanToControlFlow controlFlowToBoolean sequence. The booleanToControlFlow sequence is moved to the failure case where either a or b are not SmallIntegers.
So I'm not sure if speeding up the general case will help Exupery as I'm not sure how often it's called.
Bryce
Guillermo Adrián Molina writes:
In Exupery the SmallInteger addtion sequence is bitTest arg1 jumpIfSet failureBlock bitTest arg2 jumpIfSet failureBlock clearTagBit arg1 add arg1 arg2 jumpOverflow failureBlock
The failure case is a full message send.
The problem with the above code is that you have 3 branches. That is why I need jump tables, there are cases where cmov really
dosn't help
There is only 3 branches and I'm hoping that they will never be taken so they should be easy to predict. That said the branches do use branch predictor resources which could cause other branches not to be predicted as well.
Yes, I agree. I am really not an expert int this matters, but I think It is not so uncommon to send #+ with other objects than smallints, in that case, may be one of the first 2 branches would be misspredicted. May be you could test that both of them are smallints with just one branch. (I am doing that right now). But may be I will try to do it without branching at all
Before I started using exupery, I called special methods in C that implemented faster code. Every special method (and primitives) returned
1
in case of an error, and if success, returned the result object. One of this special methods was +. This is part of the code:
if(areIntegers(rcvr,arg)) { int result; asm( "movl $1,%%edx\n\t" "movl %[rcvr],%[result]\n\t" "addl %[arg],%[result]\n\t" "cmovol %%edx,%[result]" : [result] "=r" (result) : [rcvr] "r" (rcvr), [arg] "r" (arg) : "edx" ); return result; }
with this code, I've got up to 10% faster code in + intensive tests.
Do you have conditionals inside areIntegers and to check if the result is 1 indicating an error?
As I dont use this code so often as before, (because I inline that with exupery at compile time) I dont't worry about it any more. But areIntegers() is just an "or" and an "and", the branch is represented in the C "if" statement. I wrote the addition that way because I wanted to test if cmov was really that fast. It was better, but not THAT better.
Guille
exupery@lists.squeakfoundation.org