[Vm-dev] in-line method lookup caching of booleans

Sat Aug 4 00:35:45 UTC 2018

BTW, I have hanges for the suggested speedup that are not yet complete.
The compilation is ClosedPICs which need to check for the odd SmnallInteger
in a different place.  If you're burning with curiosity and want to try and
make this work find the changes I have so far attached...

On Fri, Aug 3, 2018 at 4:04 PM, Eliot Miranda <eliot.miranda at gmail.com>
wrote:

>
>
> On Fri, Aug 3, 2018 at 3:54 PM, Eliot Miranda <eliot.miranda at gmail.com>
> wrote:
>
>> Hi Ben,
>>
>> On Thu, Aug 2, 2018 at 7:28 PM, Ben Coman <btc at openinworld.com> wrote:
>>
>>>
>>> Just a brain twitch about something I'd like to understand better...
>>>
>>> At http://www.mirandabanda.org/cogblog/2011/03/01/build-me-a-ji
>>> t-as-fast-as-you-can/
>>> it says... "[Monomorphic] inline caching depends on the fact that in
>>> most programs at most send sites there is no polymorphism and that
>>> most sends bind to *exactly_one* class of receiver over some usefully
>>> long period of time. ... In Cog we implement inline caches in three
>>> forms ... monomorphic inline cache ... closed polymorphic inline cache
>>> ... open polymorphic cache.  What’s nice is that while these sends are
>>> increasingly expensive moving from monomorphic to megamorphic they are
>>> also decreasingly common in roughly a 90%, 9% 0.9% ratio, at least for
>>> typical Smalltalk programs"
>>>
>>> First, I'm curious what is the relative performance of the three
>>> different caches ?
>>>
>>
>> here's a measurement for Spur:
>>
>> | n null void homogenousImmediateA homogenousImmediateB homogenousNormal
>> polymorphic megamorphic |
>> Smalltalk garbageCollect.
>> n := 1000000.
>> void := Array new: 1024 withAll: nil.
>> homogenousImmediateA := Array new: 1024 withAll: 1.
>> homogenousImmediateB := Array new: 1024 withAll: $1.
>> homogenousNormal := Array new: 1024 withAll: 1 / 2.
>> polymorphic := Array new: 1024.
>> 1 to: polymorphic size by: 4 do:
>> [:i|
>> polymorphic
>> at: i put: i;
>> at: i + 1 put: i asFloat;
>> at: i + 2 put: i / 1025;
>> at: i + 3 put: i * 1.0s1 "scaled decimal"].
>> megamorphic := Array new: 1024.
>> 1 to: megamorphic size by: 8 do:
>> [:i|
>> megamorphic
>> at: i put: i;
>> at: i + 1 put: i asFloat;
>> at: i + 2 put: i / 1025;
>> at: i + 3 put: i * 1.0s1; "scaled decimal"
>> at: i + 4 put: i class;
>> at: i + 5 put: i asFloat class;
>> at: i + 6 put: (i / 1025) class;
>> at: i + 7 put: (i * 1.0s1) class].
>> null := [1 to: 1024 * n do: [:k| | e | e := void at: k \\ 1024 + 1. e
>> "yourself"]] timeToRun; timeToRun.
>> { homogenousImmediateA. homogenousImmediateB. homogenousNormal.
>> polymorphic. megamorphic } collect:
>> [:a|
>> "Smalltalk voidCogVMState."
>> ([1 to: 1024 * n do: [:k| | e | e := a at: k \\ 1024 + 1. e yourself". e
>> yourself"]] timeToRun; timeToRun) - null]
>>
>> And results on my 2.3GHz Intel Core i7 MacMini (in a far from quiet
>> state, so take these with a pinch of salt) are
>>
>> 64-bit: #(1510 1703 1612 1830 2451)
>> 32-bit: #(2852 4133 2843 3534 5317)
>>
>> or...
>> 64-bit 32-bit
>> monomorphic to 1 1510 2852
>> monomorphic to $1 1703 4133
>> monomorphic to obj 1612 2843
>> closed polymorphic 1830 3534
>> open polymorphic 2451 5317
>>
>> The sad thing is that Spur's monomorphic sends to immediate are very slow
>> because 32-bit Spur supports 31-bit SmallInteger immediates (for
>> compatibility with 32-bit V3).  Were it to use the simple tagging scheme of
>> two bit tags sends would look more like the 64-bit ones.
>>
>> The difference between monomorphic sends to 1 and sends to $1 on 64-bit
>> shows the variability in timing due to a heavily loaded machine.  There's
>> no difference in the instructions executed.
>>
>> Here's the  monomorphic cache checking sequence generated for the prolog
>> of each method in (32-bit) V3:
>>
>> LstackOverflow: # first byte of code, aligned on an 8 byte boundary
>> xorl %edx, %edx
>> Lfail:
>> call ceMethodAbort0Args
>> nop
>> entry: # aligned on an 8 byte boundary
>> movl %edx, %eax
>> andl $1, %eax # only immediate class is SmallInteger
>> jnz Lcompare
>> movl (%edx), %eax
>> andl $0x1f000, %eax # if non-zero, class is compact & compact class
>> index used as cache tag
>> jnz Lcompare
>> movl -4(%edx), %eax # normal class plus 2 bits of header type tag,
>> eliminate header type tag
>> andl $0xfffffffc, %eax
>> Lcompare
>> cmpl %ecx, %eax
>> jnz Lfail
>> noCheckEntry:
>>
>> Here's the monomorphic cache checking sequence generated for the prolog
>> of each method in 32-bit Spur:
>>
>> LstackOverflow: # first byte of code, aligned on an 8 byte boundary
>> xorl %edx, %edx
>> Lfail:
>> call ceMethodAbort0Args
>> nop
>> Limmediate: # aligned on an 8 byte boundary
>> andl $1, %eax # map SmallInteger to 1, Character to 0
>> jmp Lcompare
>> nop
>> nop
>> nop
>> entry: # aligned on an 8 byte boundary
>> movl %edx, %eax
>> andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10
>> jnz Limmediate
>> movl (%edx), %eax
>> andl $0x3fffff, %eax # cache tag is class index field
>> Lcompare:
>> cmpl %ecx, %eax
>> jnz Lfail
>> noCheckEntry:
>>
>> Here's the monomorphic cache checking sequence generated for the prolog
>> of each method in 32-bit Spur:
>>
>> LstackOverflow: # first byte of code, aligned on an 8 byte boundary
>> xorq %rdx, %rdx
>> Lfail:
>> call ceMethodAbort0Args
>> entry:
>> movq %rdx, %rax
>> andq $7, %rax # SmallInteger is 2r001, Character is 2r010, SmallFloat is
>> 2r100
>> jnz Lcompare
>> movq (%rdx), %rax
>> andq $0x3fffff, %rax # cache tag is class index field
>> Lcompare:
>> cmpq %rcx, %rax
>> jnz Lfail
>> noCheckEntry:
>>
>> The 64-bit sequence is the cleanest.  But if 32-bit Spur had 30-bit
>> SmallIntegers alongside its 30-bit Characters then that sequence would be
>> equally simple.
>>
>> I suppose something like this could be quicker for 32-bit Spur with
>> 31-bit SmallIntegers.  It would mean that sends to odd SmallIntegers were
>> slower than sends to even ones, but would be quicker for Characters and
>> even SmallIntegers:
>>
>> LstackOverflow: # first byte of code, aligned on an 8 byte boundary
>> xorl %edx, %edx
>> xorl %eax, %eax # necessary because %eax may have some other value than
>> %edx bitAnd: $3 when branching to LstackOverflow
>> Lfail:
>> cmpl $3, %eax
>>
>
> Oops!! :
>
> jnz Limmediate
>>
>
> I of course meant
>
> jz Limmediate
>>
>
>
>> call ceMethodAbort0Args
>> Limmediate:
>> andl $1, %eax # map odd SmallInteger (2r11) to 2r01
>> jmp Lcompare
>> entry: # aligned on an 8 byte boundary
>> movl %edx, %eax
>> andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10.
>> SmallInteger cache tag is 2r01
>> jnz Lcompare
>> movl (%edx), %eax
>> andl $0x3fffff, %eax # cache tag is class index field
>> Lcompare:
>> cmpl %ecx, %eax
>> jnz Lfail
>> noCheckEntry:
>>
>>
>> Second, I'm curious how Booleans are dealt with.  Boolean handling
>>> must be fairly common, and at the Image level these are two different
>>> classes, which pushes out of the monomorphic inline cache, which may
>>> be a significant impact on performance.
>>>
>>> I started wondering if under the hood the VM could treat the two
>>> booleans as one class and in the instance store "which-boolean" it
>>> actually is.  Then for example the #ifTrue:ifFalse: method of each
>>> class would be compiled into a single method conditioned at the start
>>> on "which-boolean" as to which path is executed.  How feasible would
>>> that be, and would it make much of a difference in performance of
>>> booleans ?
>>>
>>
>> You can see that the closeness in performance between monomorphic and
>> closed polymorphic sends means that handling true and false as instances of
>> two different classes costs relatively little in performance.
>>
>>
>>> Except then I realized that this situation is already bypassed by
>>> common boolean methods being inlined by the compiler.  Still curious,
>>> if there are quick answers to my questions.
>>>
>>> cheers -ben
>>>
>>
>> _,,,^..^,,,_
>> best, Eliot
>>
>
>
>
> --
> _,,,^..^,,,_
> best, Eliot
>

-- 
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20180803/48eb3c17/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: AlternativeInlineCacheProbeFor32BitSpur.st
Type: application/octet-stream
Size: 20492 bytes
Desc: not available
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20180803/48eb3c17/attachment-0001.obj>