[Vm-dev] in-line method lookup caching of booleans

Fri Aug 3 23:04:31 UTC 2018

On Fri, Aug 3, 2018 at 3:54 PM, Eliot Miranda <eliot.miranda at gmail.com>
wrote:

> Hi Ben,
>
> On Thu, Aug 2, 2018 at 7:28 PM, Ben Coman <btc at openinworld.com> wrote:
>
>>
>> Just a brain twitch about something I'd like to understand better...
>>
>> At http://www.mirandabanda.org/cogblog/2011/03/01/build-me-a-ji
>> t-as-fast-as-you-can/
>> it says... "[Monomorphic] inline caching depends on the fact that in
>> most programs at most send sites there is no polymorphism and that
>> most sends bind to *exactly_one* class of receiver over some usefully
>> long period of time. ... In Cog we implement inline caches in three
>> forms ... monomorphic inline cache ... closed polymorphic inline cache
>> ... open polymorphic cache.  What’s nice is that while these sends are
>> increasingly expensive moving from monomorphic to megamorphic they are
>> also decreasingly common in roughly a 90%, 9% 0.9% ratio, at least for
>> typical Smalltalk programs"
>>
>> First, I'm curious what is the relative performance of the three
>> different caches ?
>>
>
> here's a measurement for Spur:
>
> | n null void homogenousImmediateA homogenousImmediateB homogenousNormal
> polymorphic megamorphic |
> Smalltalk garbageCollect.
> n := 1000000.
> void := Array new: 1024 withAll: nil.
> homogenousImmediateA := Array new: 1024 withAll: 1.
> homogenousImmediateB := Array new: 1024 withAll: $1.
> homogenousNormal := Array new: 1024 withAll: 1 / 2.
> polymorphic := Array new: 1024.
> 1 to: polymorphic size by: 4 do:
> [:i|
> polymorphic
> at: i put: i;
> at: i + 1 put: i asFloat;
> at: i + 2 put: i / 1025;
> at: i + 3 put: i * 1.0s1 "scaled decimal"].
> megamorphic := Array new: 1024.
> 1 to: megamorphic size by: 8 do:
> [:i|
> megamorphic
> at: i put: i;
> at: i + 1 put: i asFloat;
> at: i + 2 put: i / 1025;
> at: i + 3 put: i * 1.0s1; "scaled decimal"
> at: i + 4 put: i class;
> at: i + 5 put: i asFloat class;
> at: i + 6 put: (i / 1025) class;
> at: i + 7 put: (i * 1.0s1) class].
> null := [1 to: 1024 * n do: [:k| | e | e := void at: k \\ 1024 + 1. e
> "yourself"]] timeToRun; timeToRun.
> { homogenousImmediateA. homogenousImmediateB. homogenousNormal.
> polymorphic. megamorphic } collect:
> [:a|
> "Smalltalk voidCogVMState."
> ([1 to: 1024 * n do: [:k| | e | e := a at: k \\ 1024 + 1. e yourself". e
> yourself"]] timeToRun; timeToRun) - null]
>
> And results on my 2.3GHz Intel Core i7 MacMini (in a far from quiet state,
> so take these with a pinch of salt) are
>
> 64-bit: #(1510 1703 1612 1830 2451)
> 32-bit: #(2852 4133 2843 3534 5317)
>
> or...
> 64-bit 32-bit
> monomorphic to 1 1510 2852
> monomorphic to $1 1703 4133
> monomorphic to obj 1612 2843
> closed polymorphic 1830 3534
> open polymorphic 2451 5317
>
> The sad thing is that Spur's monomorphic sends to immediate are very slow
> because 32-bit Spur supports 31-bit SmallInteger immediates (for
> compatibility with 32-bit V3).  Were it to use the simple tagging scheme of
> two bit tags sends would look more like the 64-bit ones.
>
> The difference between monomorphic sends to 1 and sends to $1 on 64-bit
> shows the variability in timing due to a heavily loaded machine.  There's
> no difference in the instructions executed.
>
> Here's the  monomorphic cache checking sequence generated for the prolog
> of each method in (32-bit) V3:
>
> LstackOverflow: # first byte of code, aligned on an 8 byte boundary
> xorl %edx, %edx
> Lfail:
> call ceMethodAbort0Args
> nop
> entry: # aligned on an 8 byte boundary
> movl %edx, %eax
> andl $1, %eax # only immediate class is SmallInteger
> jnz Lcompare
> movl (%edx), %eax
> andl $0x1f000, %eax # if non-zero, class is compact & compact class index
> used as cache tag
> jnz Lcompare
> movl -4(%edx), %eax # normal class plus 2 bits of header type tag,
> eliminate header type tag
> andl $0xfffffffc, %eax
> Lcompare
> cmpl %ecx, %eax
> jnz Lfail
> noCheckEntry:
>
> Here's the monomorphic cache checking sequence generated for the prolog of
> each method in 32-bit Spur:
>
> LstackOverflow: # first byte of code, aligned on an 8 byte boundary
> xorl %edx, %edx
> Lfail:
> call ceMethodAbort0Args
> nop
> Limmediate: # aligned on an 8 byte boundary
> andl $1, %eax # map SmallInteger to 1, Character to 0
> jmp Lcompare
> nop
> nop
> nop
> entry: # aligned on an 8 byte boundary
> movl %edx, %eax
> andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10
> jnz Limmediate
> movl (%edx), %eax
> andl $0x3fffff, %eax # cache tag is class index field
> Lcompare:
> cmpl %ecx, %eax
> jnz Lfail
> noCheckEntry:
>
> Here's the monomorphic cache checking sequence generated for the prolog of
> each method in 32-bit Spur:
>
> LstackOverflow: # first byte of code, aligned on an 8 byte boundary
> xorq %rdx, %rdx
> Lfail:
> call ceMethodAbort0Args
> entry:
> movq %rdx, %rax
> andq $7, %rax # SmallInteger is 2r001, Character is 2r010, SmallFloat is
> 2r100
> jnz Lcompare
> movq (%rdx), %rax
> andq $0x3fffff, %rax # cache tag is class index field
> Lcompare:
> cmpq %rcx, %rax
> jnz Lfail
> noCheckEntry:
>
> The 64-bit sequence is the cleanest.  But if 32-bit Spur had 30-bit
> SmallIntegers alongside its 30-bit Characters then that sequence would be
> equally simple.
>
> I suppose something like this could be quicker for 32-bit Spur with 31-bit
> SmallIntegers.  It would mean that sends to odd SmallIntegers were slower
> than sends to even ones, but would be quicker for Characters and even
> SmallIntegers:
>
> LstackOverflow: # first byte of code, aligned on an 8 byte boundary
> xorl %edx, %edx
> xorl %eax, %eax # necessary because %eax may have some other value than
> %edx bitAnd: $3 when branching to LstackOverflow
> Lfail:
> cmpl $3, %eax
>

Oops!! :

jnz Limmediate
>

I of course meant

jz Limmediate
>

> call ceMethodAbort0Args
> Limmediate:
> andl $1, %eax # map odd SmallInteger (2r11) to 2r01
> jmp Lcompare
> entry: # aligned on an 8 byte boundary
> movl %edx, %eax
> andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10.
> SmallInteger cache tag is 2r01
> jnz Lcompare
> movl (%edx), %eax
> andl $0x3fffff, %eax # cache tag is class index field
> Lcompare:
> cmpl %ecx, %eax
> jnz Lfail
> noCheckEntry:
>
>
> Second, I'm curious how Booleans are dealt with.  Boolean handling
>> must be fairly common, and at the Image level these are two different
>> classes, which pushes out of the monomorphic inline cache, which may
>> be a significant impact on performance.
>>
>> I started wondering if under the hood the VM could treat the two
>> booleans as one class and in the instance store "which-boolean" it
>> actually is.  Then for example the #ifTrue:ifFalse: method of each
>> class would be compiled into a single method conditioned at the start
>> on "which-boolean" as to which path is executed.  How feasible would
>> that be, and would it make much of a difference in performance of
>> booleans ?
>>
>
> You can see that the closeness in performance between monomorphic and
> closed polymorphic sends means that handling true and false as instances of
> two different classes costs relatively little in performance.
>
>
>> Except then I realized that this situation is already bypassed by
>> common boolean methods being inlined by the compiler.  Still curious,
>> if there are quick answers to my questions.
>>
>> cheers -ben
>>
>
> _,,,^..^,,,_
> best, Eliot
>

-- 
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20180803/3c75eb34/attachment-0001.html>