[Vm-dev] in-line method lookup caching of booleans

Sat Aug 4 00:37:26 UTC 2018

BTW, I have changes for the suggested speedup that are not yet complete.
The complication is closed PICs which need to check for the odd
SmallInteger receiver in a different place.  If you're burning with
curiosity and want to try and make this work then find the changes I have
so far attached...

On Fri, Aug 3, 2018 at 5:35 PM, Eliot Miranda <eliot.miranda at gmail.com>
wrote:

> BTW, I have hanges for the suggested speedup that are not yet complete.
> The compilation is ClosedPICs which need to check for the odd SmnallInteger
> in a different place.  If you're burning with curiosity and want to try and
> make this work find the changes I have so far attached...
>
> On Fri, Aug 3, 2018 at 4:04 PM, Eliot Miranda <eliot.miranda at gmail.com>
> wrote:
>
>>
>>
>> On Fri, Aug 3, 2018 at 3:54 PM, Eliot Miranda <eliot.miranda at gmail.com>
>> wrote:
>>
>>> Hi Ben,
>>>
>>> On Thu, Aug 2, 2018 at 7:28 PM, Ben Coman <btc at openinworld.com> wrote:
>>>
>>>>
>>>> Just a brain twitch about something I'd like to understand better...
>>>>
>>>> At http://www.mirandabanda.org/cogblog/2011/03/01/build-me-a-ji
>>>> t-as-fast-as-you-can/
>>>> it says... "[Monomorphic] inline caching depends on the fact that in
>>>> most programs at most send sites there is no polymorphism and that
>>>> most sends bind to *exactly_one* class of receiver over some usefully
>>>> long period of time. ... In Cog we implement inline caches in three
>>>> forms ... monomorphic inline cache ... closed polymorphic inline cache
>>>> ... open polymorphic cache.  What’s nice is that while these sends are
>>>> increasingly expensive moving from monomorphic to megamorphic they are
>>>> also decreasingly common in roughly a 90%, 9% 0.9% ratio, at least for
>>>> typical Smalltalk programs"
>>>>
>>>> First, I'm curious what is the relative performance of the three
>>>> different caches ?
>>>>
>>>
>>> here's a measurement for Spur:
>>>
>>> | n null void homogenousImmediateA homogenousImmediateB homogenousNormal
>>> polymorphic megamorphic |
>>> Smalltalk garbageCollect.
>>> n := 1000000.
>>> void := Array new: 1024 withAll: nil.
>>> homogenousImmediateA := Array new: 1024 withAll: 1.
>>> homogenousImmediateB := Array new: 1024 withAll: $1.
>>> homogenousNormal := Array new: 1024 withAll: 1 / 2.
>>> polymorphic := Array new: 1024.
>>> 1 to: polymorphic size by: 4 do:
>>> [:i|
>>> polymorphic
>>> at: i put: i;
>>> at: i + 1 put: i asFloat;
>>> at: i + 2 put: i / 1025;
>>> at: i + 3 put: i * 1.0s1 "scaled decimal"].
>>> megamorphic := Array new: 1024.
>>> 1 to: megamorphic size by: 8 do:
>>> [:i|
>>> megamorphic
>>> at: i put: i;
>>> at: i + 1 put: i asFloat;
>>> at: i + 2 put: i / 1025;
>>> at: i + 3 put: i * 1.0s1; "scaled decimal"
>>> at: i + 4 put: i class;
>>> at: i + 5 put: i asFloat class;
>>> at: i + 6 put: (i / 1025) class;
>>> at: i + 7 put: (i * 1.0s1) class].
>>> null := [1 to: 1024 * n do: [:k| | e | e := void at: k \\ 1024 + 1. e
>>> "yourself"]] timeToRun; timeToRun.
>>> { homogenousImmediateA. homogenousImmediateB. homogenousNormal.
>>> polymorphic. megamorphic } collect:
>>> [:a|
>>> "Smalltalk voidCogVMState."
>>> ([1 to: 1024 * n do: [:k| | e | e := a at: k \\ 1024 + 1. e yourself". e
>>> yourself"]] timeToRun; timeToRun) - null]
>>>
>>> And results on my 2.3GHz Intel Core i7 MacMini (in a far from quiet
>>> state, so take these with a pinch of salt) are
>>>
>>> 64-bit: #(1510 1703 1612 1830 2451)
>>> 32-bit: #(2852 4133 2843 3534 5317)
>>>
>>> or...
>>> 64-bit 32-bit
>>> monomorphic to 1 1510 2852
>>> monomorphic to $1 1703 4133
>>> monomorphic to obj 1612 2843
>>> closed polymorphic 1830 3534
>>> open polymorphic 2451 5317
>>>
>>> The sad thing is that Spur's monomorphic sends to immediate are very
>>> slow because 32-bit Spur supports 31-bit SmallInteger immediates (for
>>> compatibility with 32-bit V3).  Were it to use the simple tagging scheme of
>>> two bit tags sends would look more like the 64-bit ones.
>>>
>>> The difference between monomorphic sends to 1 and sends to $1 on 64-bit
>>> shows the variability in timing due to a heavily loaded machine.  There's
>>> no difference in the instructions executed.
>>>
>>> Here's the  monomorphic cache checking sequence generated for the prolog
>>> of each method in (32-bit) V3:
>>>
>>> LstackOverflow: # first byte of code, aligned on an 8 byte boundary
>>> xorl %edx, %edx
>>> Lfail:
>>> call ceMethodAbort0Args
>>> nop
>>> entry: # aligned on an 8 byte boundary
>>> movl %edx, %eax
>>> andl $1, %eax # only immediate class is SmallInteger
>>> jnz Lcompare
>>> movl (%edx), %eax
>>> andl $0x1f000, %eax # if non-zero, class is compact & compact class
>>> index used as cache tag
>>> jnz Lcompare
>>> movl -4(%edx), %eax # normal class plus 2 bits of header type tag,
>>> eliminate header type tag
>>> andl $0xfffffffc, %eax
>>> Lcompare
>>> cmpl %ecx, %eax
>>> jnz Lfail
>>> noCheckEntry:
>>>
>>> Here's the monomorphic cache checking sequence generated for the prolog
>>> of each method in 32-bit Spur:
>>>
>>> LstackOverflow: # first byte of code, aligned on an 8 byte boundary
>>> xorl %edx, %edx
>>> Lfail:
>>> call ceMethodAbort0Args
>>> nop
>>> Limmediate: # aligned on an 8 byte boundary
>>> andl $1, %eax # map SmallInteger to 1, Character to 0
>>> jmp Lcompare
>>> nop
>>> nop
>>> nop
>>> entry: # aligned on an 8 byte boundary
>>> movl %edx, %eax
>>> andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10
>>> jnz Limmediate
>>> movl (%edx), %eax
>>> andl $0x3fffff, %eax # cache tag is class index field
>>> Lcompare:
>>> cmpl %ecx, %eax
>>> jnz Lfail
>>> noCheckEntry:
>>>
>>> Here's the monomorphic cache checking sequence generated for the prolog
>>> of each method in 32-bit Spur:
>>>
>>> LstackOverflow: # first byte of code, aligned on an 8 byte boundary
>>> xorq %rdx, %rdx
>>> Lfail:
>>> call ceMethodAbort0Args
>>> entry:
>>> movq %rdx, %rax
>>> andq $7, %rax # SmallInteger is 2r001, Character is 2r010, SmallFloat
>>> is 2r100
>>> jnz Lcompare
>>> movq (%rdx), %rax
>>> andq $0x3fffff, %rax # cache tag is class index field
>>> Lcompare:
>>> cmpq %rcx, %rax
>>> jnz Lfail
>>> noCheckEntry:
>>>
>>> The 64-bit sequence is the cleanest.  But if 32-bit Spur had 30-bit
>>> SmallIntegers alongside its 30-bit Characters then that sequence would be
>>> equally simple.
>>>
>>> I suppose something like this could be quicker for 32-bit Spur with
>>> 31-bit SmallIntegers.  It would mean that sends to odd SmallIntegers were
>>> slower than sends to even ones, but would be quicker for Characters and
>>> even SmallIntegers:
>>>
>>> LstackOverflow: # first byte of code, aligned on an 8 byte boundary
>>> xorl %edx, %edx
>>> xorl %eax, %eax # necessary because %eax may have some other value than
>>> %edx bitAnd: $3 when branching to LstackOverflow
>>> Lfail:
>>> cmpl $3, %eax
>>>
>>
>> Oops!! :
>>
>> jnz Limmediate
>>>
>>
>> I of course meant
>>
>> jz Limmediate
>>>
>>
>>
>>> call ceMethodAbort0Args
>>> Limmediate:
>>> andl $1, %eax # map odd SmallInteger (2r11) to 2r01
>>> jmp Lcompare
>>> entry: # aligned on an 8 byte boundary
>>> movl %edx, %eax
>>> andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10.
>>> SmallInteger cache tag is 2r01
>>> jnz Lcompare
>>> movl (%edx), %eax
>>> andl $0x3fffff, %eax # cache tag is class index field
>>> Lcompare:
>>> cmpl %ecx, %eax
>>> jnz Lfail
>>> noCheckEntry:
>>>
>>>
>>> Second, I'm curious how Booleans are dealt with.  Boolean handling
>>>> must be fairly common, and at the Image level these are two different
>>>> classes, which pushes out of the monomorphic inline cache, which may
>>>> be a significant impact on performance.
>>>>
>>>> I started wondering if under the hood the VM could treat the two
>>>> booleans as one class and in the instance store "which-boolean" it
>>>> actually is.  Then for example the #ifTrue:ifFalse: method of each
>>>> class would be compiled into a single method conditioned at the start
>>>> on "which-boolean" as to which path is executed.  How feasible would
>>>> that be, and would it make much of a difference in performance of
>>>> booleans ?
>>>>
>>>
>>> You can see that the closeness in performance between monomorphic and
>>> closed polymorphic sends means that handling true and false as instances of
>>> two different classes costs relatively little in performance.
>>>
>>>
>>>> Except then I realized that this situation is already bypassed by
>>>> common boolean methods being inlined by the compiler.  Still curious,
>>>> if there are quick answers to my questions.
>>>>
>>>> cheers -ben
>>>>
>>>
>>> _,,,^..^,,,_
>>> best, Eliot
>>>
>>
>>
>>
>> --
>> _,,,^..^,,,_
>> best, Eliot
>>
>
>
>
> --
> _,,,^..^,,,_
> best, Eliot
>

-- 
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20180803/599a42bd/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: AlternativeInlineCacheProbeFor32BitSpur.st
Type: application/octet-stream
Size: 20492 bytes
Desc: not available
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20180803/599a42bd/attachment-0001.obj>