[squeak-dev] Short-circuiting comparisons (booleanCheat) via special selectors [Was Using #= for integer comparison instead of #==]

Eliot Miranda eliot.miranda at gmail.com
Fri Nov 19 19:48:17 UTC 2010


Hi All,

    so the performance benefit of the special selectors is interesting.  One
thing the interpreter does on all relational special selectors #= #~= #< #<=
#> #>= is statically predict SmallIntegers and/or Floats and look ahead for
a following conditional branch, evaluating the branch immediately, avoiding
reifying the result of the relational as either true or false and later
testing for it, hence jumping directly on the condition codes.

e.g.
bytecodePrimLessThan
| rcvr arg aBool |
rcvr := self internalStackValue: 1.
arg := self internalStackValue: 0.
(self areIntegers: rcvr and: arg) ifTrue:
["The C code can avoid detagging since tagged integers are still signed.
 But this means the simulator must override to do detagging."
^self cCode: [self booleanCheat: rcvr < arg]
inSmalltalk: [self booleanCheat: (objectMemory integerValueOf: rcvr) <
(objectMemory integerValueOf: arg)]].

self initPrimCall.
aBool := self primitiveFloatLess: rcvr thanArg: arg.
self successful ifTrue: [^ self booleanCheat: aBool].

messageSelector := self specialSelector: 2.
argumentCount := 1.
self normalSend

booleanCheat: cond
"cheat the interpreter out of the pleasure of handling the next bytecode IFF
it is a jump-on-boolean. Which it is, often enough when the current bytecode
is something like bytecodePrimEqual"
<inline: true>

cond
ifTrue: [self booleanCheatTrue]
ifFalse: [self booleanCheatFalse]

booleanCheatFalse
"cheat the interpreter out of the pleasure of handling the next bytecode IFF
it is a jump-on-boolean. Which it is, often enough when the current bytecode
is something like bytecodePrimEqual"
| bytecode offset |
<sharedCodeNamed: 'booleanCheatFalse' inCase: 179>

bytecode := self fetchByte.  "assume next bytecode is jumpIfFalse (99%)"
self internalPop: 2.
(bytecode < 160 and: [bytecode > 151]) ifTrue:  "short jumpIfFalse"
[^self jump: bytecode - 151].

bytecode = 172 ifTrue:  "long jumpIfFalse"
[offset := self fetchByte.
^self jump: offset].

"not followed by a jumpIfFalse; undo instruction fetch and push boolean
result"
localIP := localIP - 1.
self fetchNextBytecode.
self internalPush: objectMemory falseObject

booleanCheatTrue
"cheat the interpreter out of the pleasure of handling the next bytecode IFF
it is a jump-on-boolean. Which it is, often enough when the current bytecode
is something like bytecodePrimEqual"
| bytecode |
<sharedCodeNamed: 'booleanCheatTrue' inCase: 178>

bytecode := self fetchByte.  "assume next bytecode is jumpIfFalse (99%)"
self internalPop: 2.
(bytecode < 160 and: [bytecode > 151]) ifTrue:  "short jumpIfFalse"
[^self fetchNextBytecode].

bytecode = 172 ifTrue: "long jumpIfFalse"
[self fetchByte.
^self fetchNextBytecode].

"not followed by a jumpIfFalse; undo instruction fetch and push boolean
result"
localIP := localIP - 1.
self fetchNextBytecode.
self internalPush: objectMemory trueObject


Until yesterday Cog didn't do this for jitted code.  The conversation about
using #= for testing got me motivated to implement it.  Note that
VisualWorks' HPS VM does something even but more aggressive than
booleanCheat:.  So what are we talking about?  Here's my micro-benchmark:

Time millisecondsToRun: [1 to: 100000000 do: [:i| ]]

The bytecode compiler optimizes the inner loop to code equivalent to

| i |
i := 1.
[i <= 100000000] whileTrue: [i := i + 1]

Here's the Squeak bytecode:

25 <76> pushConstant: 1
26 <68> popIntoTemp: 0
27 <10> pushTemp: 0
28 <20> pushConstant: 100000000
29 <B4> send: <=
30 <9D> jumpFalse: 37
31 <10> pushTemp: 0
32 <76> pushConstant: 1
33 <B0> send: +
34 <68> popIntoTemp: 0
35 <A3 F6> jumpTo: 27
37:

And here's the VisualWorks bytecode:
 6 <4A> push 1
 7 <4C> store local 0; pop
 8 <67> loop head
 9 <10> push local 0
10 <1C> push 100000000
11 <A4> send <=
12 <C4> jump false 18
13 <10> push local 0
14 <C8> push 1; send +
15 <4C> store local 0; pop
16 <E3 F6> jump 8
18:

Note that loopHead is a no-op that allows the HPS JIT to be one pass; it
tells the JIT that there is a backward branch to this instruction and so
knows that loopHead is a control-flow join, which means that the registers
that cache the receiver and its base pointer (VW has an indirection from the
object header to an object body) must be reloaded if needed.

So the current version of Cog generates the following code for the loop:

25 <76> pushConstant: 1
a1ea4: movl $0x00000003, %eax : B8 03 00 00 00
a1ea9: pushl %eax : 50
26 <68> popIntoTemp: 0
a1eaa: popl %eax : 58
a1eab: movl %eax, %ss:0xfffffff0(%ebp) : 89 45 F0
27 <10> pushTemp: 0
a1eae: movl %ss:0xfffffff0(%ebp), %eax : 8B 45 F0
a1eb1: pushl %eax : 50
28 <20> pushConstant: 100000000
a1eb2: pushl $0x0bebc201 : 68 01 C2 EB 0B
29 <B4> send: <=
a1eb7: movl %ss:0x4(%esp), %edx : 8B 54 24 04
a1ebb: movl $0x0045a458=#<=, %ecx : B9 58 A4 45 00
a1ec0: call .+0xfff5e5cb (0x00000490=ceSend1Args) : E8 CB E5 F5 FF
IsSendCall:
a1ec5: pushl %edx : 52
30 <9D> jumpFalse: 37
a1ec6: popl %eax : 58
a1ec7: subl $0x00172270=false, %eax : 2D 70 22 17 00
IsObjectReference:
a1ecc: jz .+0x0000003d (0x000a1f0b=16rA1E50 at BB) : 74 3D
a1ece: cmpl $0x00000008, %eax : 83 F8 08
a1ed1: jz .+0x0000000b (0x000a1ede=16rA1E50 at 8E) : 74 0B
a1ed3: addl $0x00172270=false, %eax : 05 70 22 17 00
IsObjectReference:
a1ed8: pushl %eax : 50
a1ed9: call .+0xfff5e902 (0x000007e0=ceSendMustBeBooleanTrampoline) : E8 02
E9 F5 FF
HasBytecodePC:
31 <10> pushTemp: 0
a1ede: movl %ss:0xfffffff0(%ebp), %eax : 8B 45 F0
a1ee1: pushl %eax : 50
32 <76> pushConstant: 1
a1ee2: movl $0x00000003, %eax : B8 03 00 00 00
a1ee7: pushl %eax : 50
33 <B0> send: +
a1ee8: movl %ss:0x4(%esp), %edx : 8B 54 24 04
a1eec: movl $0x0045ae64=#+, %ecx : B9 64 AE 45 00
a1ef1: call .+0xfff5e59a (0x00000490=ceSend1Args) : E8 9A E5 F5 FF
IsSendCall:
a1ef6: pushl %edx : 52
34 <68> popIntoTemp: 0
a1ef7: popl %eax : 58
a1ef8: movl %eax, %ss:0xfffffff0(%ebp) : 89 45 F0
35 <A3 F6> jumpTo: 27
a1efb: movl %ds:0x0013aafc=&stackLimit, %eax : A1 FC AA 13 00
a1f00: cmpl %eax, %esp : 39 C4
a1f02: jnb .+0xffffffaa (0x000a1eae=16rA1E50 at 5E) : 73 AA
a1f04: call .+0xfff5e907 (0x00000810=ceCheckForInterruptsTrampoline) : E8 07
E9 F5 FF
HasBytecodePC:
a1f09: jmp .+0xffffffa3 (0x000a1eae=16rA1E50 at 5E) : EB A3
37: 0x000a1f0b/16rA1E50 at BB:

And here's the primitive part of the SmallInteger>#<= method; the
SmallInteger<#+ method is similar.  I'll omit it for brevity.

entry: (check that the receiver matches the inline cache)
00002498: movl %edx, %eax : 89 D0
0000249a: andl $0x00000001, %eax : 83 E0 01
0000249d: jnz .+0x00000010 (0x000024af=<=@37) : 75 10 (jump if the receiver
is a SmallInteger, which in our case it is)
0000249f: movl %ds:(%edx), %eax : 8B 02
000024a1: shrl $0x0a, %eax : C1 E8 0A
000024a4: andl $0x0000007c, %eax : 83 E0 7C
000024a7: jnz .+0x00000006 (0x000024af=<=@37) : 75 06
000024a9: movl %ds:0xfffffffc(%edx), %eax : 8B 42 FC
000024ac: andl $0xfffffffc, %eax : 83 E0 FC
000024af: cmpl %ecx, %eax : 39 C8                                  (compare
the receiver's tags against the inline cache ecx, which in our case will
match)
000024b1: jnz .+0xffffffdf (0x00002492=<=@1A) : 75 DF
noCheckEntry: (SmallInteger primitive #<=)
000024b3: movl %ss:0x4(%esp), %eax : 8B 44 24 04
000024b7: movl %eax, %ecx : 89 C1
000024b9: andl $0x00000001, %eax : 83 E0 01
000024bc: jz .+0x00000014 (0x000024d2=<=@5A) : 74 14 (is the argument a
SmallInteger, which in our case it will be)
000024be: cmpl %ecx, %edx : 39 CA
000024c0: jle .+0x00000008 (0x000024ca=<=@52) : 7E 08
000024c2: movl $0x00172270=false, %edx : BA 70 22 17 00 (answer the false
object)
IsObjectReference:
000024c7: ret $0x0008 : C2 08 00
000024ca: movl $0x00172278=true, %edx : BA 78 22 17 00 (answer the true
object)
IsObjectReference:
000024cf: ret $0x0008 : C2 08 00
000024d2:

You can see that the Cog code generator is extremely naive; it's generating
RISC code.  There are no compares against memory; only against registers,
etc.

The question I want you to ask yourself is how much faster the loop will go
if we add code to short-circuit the send of #<= and the comparison against
true and false with a tag test and a direct comparison.  Try answering it as
we go along.  I'll answer it below but try and answer it yourself.
 Basically we should speed up about a third of the loop substantially where
the loop has three main parts, a) the compare and branch [i <= 100000000]
whileTrue:, b) the increment i := i + 1, and c) the backward branch at the
end of the while loop which checks the stackLimit to break out if there is
an event (e.g. ctrl-.).

OK, so here's the code with the booleanCheat implemented in the JIT:

25 <76> pushConstant: 1
a52cc: movl $0x00000003, %eax : B8 03 00 00 00
a52d1: pushl %eax : 50
26 <68> popIntoTemp: 0
a52d2: popl %eax : 58
a52d3: movl %eax, %ss:0xfffffff0(%ebp) : 89 45 F0
27 <10> pushTemp: 0
a52d6: movl %ss:0xfffffff0(%ebp), %eax : 8B 45 F0
a52d9: pushl %eax : 50
28 <20> pushConstant: 100000000
a52da: pushl $0x0bebc201 : 68 01 C2 EB 0B
29 <B4> send: <=
a52df: movl %ss:(%esp), %esi : 8B 34 24
a52e2: movl %esi, %eax : 89 F0
a52e4: movl %ss:0x4(%esp), %edx : 8B 54 24 04
a52e8: andl %edx, %eax : 23 C2
a52ea: andl $0x00000001, %eax : 83 E0 01                          (are
receiver and arg SmallIntegers, which they are)
a52ed: jz .+0x00000009 (0x000a52f8=16rA5278 at 80) : 74 09
a52ef: addl $0x00000008, %esp : 83 C4 08
a52f2: cmpl %esi, %edx : 39 F2
(compare directly)
a52f4: jnle .+0x00000056 (0x000a534c=16rA5278 at D4) : 7F 56 (jump on condition
codes)
a52f6: jmp .+0x00000027 (0x000a531f=16rA5278 at A7) : EB 27 (jump past the send
and the jumpFalse:)
a52f8: movl %ss:0x4(%esp), %edx : 8B 54 24 04
a52fc: movl $0x0045a458, %ecx : B9 58 A4 45 00
a5301: call .+0xfff5b18a (0x00000490=ceSend1Args) : E8 8A B1 F5 FF
IsSendCall:
a5306: pushl %edx : 52
30 <9D> jumpFalse: 37
a5307: popl %eax : 58
a5308: subl $0x00172270=false, %eax : 2D 70 22 17 00
IsObjectReference:
a530d: jz .+0x0000003d (0x000a534c=16rA5278 at D4) : 74 3D
a530f: cmpl $0x00000008, %eax : 83 F8 08
a5312: jz .+0x0000000b (0x000a531f=16rA5278 at A7) : 74 0B
a5314: addl $0x00172270=false, %eax : 05 70 22 17 00
IsObjectReference:
a5319: pushl %eax : 50
a531a: call .+0xfff5b4c1 (0x000007e0=ceSendMustBeBooleanTrampoline) : E8 C1
B4 F5 FF
HasBytecodePC:
31 <10> pushTemp: 0
a531f: movl %ss:0xfffffff0(%ebp), %eax : 8B 45 F0
a5322: pushl %eax : 50
32 <76> pushConstant: 1
a5323: movl $0x00000003, %eax : B8 03 00 00 00
a5328: pushl %eax : 50
33 <B0> send: +
a5329: movl %ss:0x4(%esp), %edx : 8B 54 24 04
a532d: movl $0x0045ae64=#+, %ecx : B9 64 AE 45 00
a5332: call .+0xfff5b159 (0x00000490=ceSend1Args) : E8 59 B1 F5 FF
IsSendCall:
a5337: pushl %edx : 52
34 <68> popIntoTemp: 0
a5338: popl %eax : 58
a5339: movl %eax, %ss:0xfffffff0(%ebp) : 89 45 F0
35 <A3 F6> jumpTo: 27
a533c: movl %ds:0x0013aafc=&stackLimit, %eax : A1 FC AA 13 00
a5341: cmpl %eax, %esp : 39 C4
a5343: jnb .+0xffffff91 (0x000a52d6=16rA5278 at 5E) : 73 91
a5345: call .+0xfff5b4c6 (0x00000810=ceCheckForInterruptsTrampoline) : E8 C6
B4 F5 FF
HasBytecodePC:
a534a: jmp .+0xffffff8a (0x000a52d6=16rA5278 at 5E) : EB 8A

So roughly 1/3 of the loop has been sped up substantially.  How much
speedup?

2.7%.  2.7 measly percent.  A good 2 - 3 hours work and 2.7 measly % ?!?!?
 i.e. 691 milliseconds fell to 672 milliseconds (and the measurements are
nicely repeatable).

Well, one reason for the limited performance increase could be that the
event check at the backward branch is being taken a lot and dominating
costs.  So what happens if I change the heartbeat from 2KHz to 20KHz?  672
falls to 664, so only 9 milliseconds or so is due to the stack limit event
check.

Perhaps the x86 is so good at optimizing the code that it can't be sped up.
 Well, let's have a look at HPS's performance.  It is *4* times faster, 168
ms vs 672 (that's *exactly* 4 times faster :) ).  So what code does HPS
generate?  First its back-end is less naive; it doesn't move intermediate
results through registers all the time.  But most significantly it is able
to know that both the 100000000 and the + 1 are SmallIntegers because it
delays generating code until it gets to a send.  So its code is a /lot/
better.  It short-circuits bth the compare and branch and the increment.
 Here it is (remember that Squeak has only one tagged type, SmallInteger
with tag 1, and 32-bit VW has two tagged types SmallInteger and Character
with tags 3 and 1 respectively, so in VW 1 == 0x7):

 6 <4A> push 1
 7 <4C> store local 0; pop
nm at 039: xor rTemp,rTemp ! 33 c0
nm at 03b: movb $7,%al ! b0 07
nm at 03d: mov rTemp,-0x10(bFrame) ! 89 45 f0
 8 <67> loop head
 9 <10> push local 0
nm at 040: mov -0x10(bFrame),rReceiver ! 8b 5d f0
10 <1C> push 100000000
nm at 043: mov $0x17d78403,rArg1 ! be 03 84 d7 17
11 <A4> send <=
nm at 048: testb $2,%bl ! f6 c3 02 a.k.a. testb $2,rReceiver
nm at 04b: jz nm at 056 ! 74 09
nm at 04d: cmp rArg1,rReceiver ! 3b de
nm at 04f: jle nm at 075 ! 7e 24
nm at 051: jmp nm at 0ab ! e9 55 00 00 00
nm at 056: mov $0x16a8ec6c,rClass ! ba 6c ec a8 16
nm at 05b: call _81080=send1Args ! e8 28 e7 6b ea
map: 0x3a: send1(26) #<=
vpc 12:
12 <C4> jump false 18
nm at 060: cmp $0x16b33da4,rReceiver ! 81 fb a4 3d b3 16
nm at 066: jz nm at 0ab ! 74 43
nm at 068: cmp $0x16da90e4,rReceiver ! 81 fb e4 90 da 16
nm at 06e: jz nm at 075 ! 74 05
nm at 070: call _159cadd8 ! e8 6b 84 00 00
nm at 075: mov -0x10(bFrame),rReceiver ! 8b 5d f0
nm at 078: testb $2,%bl ! f6 c3 02 a.k.a. testb $2,rReceiver
nm at 07b: jz nm at 085 ! 74 08
13 <10> push local 0
14 <C8> push 1; send +
nm at 07d: add $4,rReceiver ! 83 c3 04
nm at 080: jno nm at 092 ! 71 10
nm at 082: sub $4,rReceiver ! 83 eb 04
nm at 085: push $7 ! 6a 07
nm at 087: pop rArg1 ! 5e
nm at 088: mov $0x16bbab34,rClass ! ba 34 ab bb 16
nm at 08d: call _81080=send1Args ! e8 f6 e6 6b ea
map: 0x22: send1(2) #+
15 <4C> store local 0; pop
nm at 092: mov rReceiver,-0x10(bFrame) ! 89 5d f0
16 <E3 F6> jump 8
nm at 095: cmp 0xfb190,bSP ! 3b 25 90 b1 0f 00
nm at 09b: jae nm at 040 ! 0f 83 9f ff ff ff
nm at 0a1: call _80d30 ! e8 92 e3 6b ea
vpc 18:
nm at 0a6: jmp nm at 040 ! e9 95 ff ff ff


Interesting.  VW executes the following instructions around the loop:

nm at 040: mov -0x10(bFrame),rReceiver ! 8b 5d f0
nm at 043: mov $0x17d78403,rArg1 ! be 03 84 d7 17
nm at 048: testb $2,%bl ! f6 c3 02 a.k.a. testb $2,rReceiver
nm at 04b: jz nm at 056 ! 74 09
nm at 04d: cmp rArg1,rReceiver ! 3b de
nm at 04f: jle nm at 075 ! 7e 24
nm at 051: jmp nm at 0ab ! e9 55 00 00 00
...
nm at 07d: add $4,rReceiver ! 83 c3 04
nm at 080: jno nm at 092 ! 71 10
...
nm at 092: mov rReceiver,-0x10(bFrame) ! 89 5d f0
nm at 095: cmp 0xfb190,bSP ! 3b 25 90 b1 0f 00
nm at 09b: jae nm at 040 ! 0f 83 9f ff ff ff

12 instructions, one read and one write. Cog executes many more, quite a few
of them reads and writes.  How I itch to find the time to do delayed code
generation/stack-to-register mapping in Cog...

best,
Eliot

On Tue, Nov 16, 2010 at 3:32 PM, Levente Uzonyi <leves at elte.hu> wrote:

> On Tue, 16 Nov 2010, Juan Vuletich wrote:
>
>  Hi Levente,
>>
>> Levente Uzonyi wrote:
>>
>>> Hi,
>>>
>>> I'm mostly ready with this cleanup, only a few methods left in the image
>>> that use #== for integer comparison. Some of them must use #==, others may
>>> be changed, but I wasn't sure if the code will work if it's changed or not.
>>> If you'd like to check these methods, then evaluate the following in a
>>> workspace:
>>>
>>> ... code here...
>>>
>>> Cheers,
>>> Levente
>>>
>>
>> Thanks for the snippet! It is now a method in Cuis. I also checked a bit
>> on trunk. In all senders of #nextObject you can apply the pattern in
>> #allObjectsDo:. Instead of assuming == 0 for last #nextObject, it assumes
>> that Object new will go at the end. If that ever changed, a few places would
>> need fixing...
>>
>
> I'm not sure if it's ok to use that pattern in ImageSegment. Even if it's
> ok, the code is so messy, that it seems to be hard to apply the pattern
> without breaking the code.
>
>
>> As an experiment, in #allObjectsDo: I tried to remove the Object new stuff
>> and replace with '[ 0 = object and: [ object isMemberOf: SmallInteger ]]
>> whileFalse: ', but my test became extremely slow. I suspect #= is sending
>> the message even if it has its own bytecode... So, the only method that
>> remains with the '0 ==' pattern in Cuis is #critical:ifLocked: , as I'm not
>> sure if we can clean it.
>>
>
> #= is a special selector, so I'm sure it won't send a message if it doesn't
> have to. It's possible that #= is a bit slower than #== because it has to do
> some extra checks, but the difference is probably neglible. Here's a
> benchmark:
>
> | offset equality identity |
> Smalltalk garbageCollect.
> offset := [ 1 to: 1000000 do: [ :i | ] ] timeToRun.
> equality := [ :x | [ 1 to: 1000000 do: [ :i | 0 = x ] ] timeToRun ].
> identity := [ :x | [ 1 to: 1000000 do: [ :i | 0 == x ] ] timeToRun ].
> { 0. 1. SmallInteger maxVal. SmallInteger maxVal + 1. nil. Object new. 0.0.
> Array new. 1/2 } collect: [ :each |
>        each -> ({
>                equality value: each.
>                identity value: each.
>                equality value: each.
>                identity value: each.
>                equality value: each.
>                identity value: each } - offset) ].
>
> And my results on CogVM:
>
> {
>        0->#(3 2 3 1 3 2).
>        1->#(2 4 3 0 3 4).
>        1073741823->#(3 3 2 1 3 5).
>        1073741824->#(221 4 223 1 223 4).
>        nil->#(14 4 14 0 15 4).
>        an Object->#(14 5 13 1 14 4).
>        0.0->#(622 5 623 2 624 4).
>        #()->#(16 5 14 1 16 4).
>        (1/2)->#(260 4 259 1 258 5)
> }
>
> So the cause of the slowdown in your #allObjectsDo: implementation is that
> #= is only fast if both the receiver and the argument are SmallIntegers,
> otherwise there will be message sends.
>
> Another way to implement #allObjectsDo: without the marker object is to
> replace the loop condition with: object class == SmallInteger. It's actually
> the same as your implementation without the #= check. #isMemberOf: is
> optimized to non-real sends. This is as fast as the current implementation
> on my pc.
>
>
> Levente
>
>
>> Cheers,
>> Juan Vuletich
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20101119/b22d0a9d/attachment.htm


More information about the Squeak-dev mailing list