[Vm-dev] in-line method lookup caching of booleans

Sat Aug 4 05:26:38 UTC 2018

Hi Eliot, Thanks for the great detail in your reply.

On 4 August 2018 at 08:35, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>
> BTW, I have hanges for the suggested speedup that are not yet complete.  The compilation is ClosedPICs which need to check for the odd SmnallInteger in a different place.  If you're burning with curiosity and want to try and make this work find the changes I have so far attached...
>
> On Fri, Aug 3, 2018 at 4:04 PM, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>>
>>
>>
>> On Fri, Aug 3, 2018 at 3:54 PM, Eliot Miranda <eliot.miranda at gmail.com> wrote:
>>>
>>> Hi Ben,
>>>
>>> On Thu, Aug 2, 2018 at 7:28 PM, Ben Coman <btc at openinworld.com> wrote:
>>>>
>>>>
>>>> Just a brain twitch about something I'd like to understand better...
>>>>
>>>> At http://www.mirandabanda.org/cogblog/2011/03/01/build-me-a-jit-as-fast-as-you-can/
>>>> it says... "[Monomorphic] inline caching depends on the fact that in
>>>> most programs at most send sites there is no polymorphism and that
>>>> most sends bind to *exactly_one* class of receiver over some usefully
>>>> long period of time. ... In Cog we implement inline caches in three
>>>> forms ... monomorphic inline cache ... closed polymorphic inline cache
>>>> ... open polymorphic cache.  What’s nice is that while these sends are
>>>> increasingly expensive moving from monomorphic to megamorphic they are
>>>> also decreasingly common in roughly a 90%, 9% 0.9% ratio, at least for
>>>> typical Smalltalk programs"
>>>>
>>>> First, I'm curious what is the relative performance of the three
>>>> different caches ?
>>>
>>>
>>> here's a measurement for Spur:
>>>
>>> | n null void homogenousImmediateA homogenousImmediateB homogenousNormal polymorphic megamorphic |
>>> Smalltalk garbageCollect.
>>> n := 1000000.
>>> void := Array new: 1024 withAll: nil.
>>> homogenousImmediateA := Array new: 1024 withAll: 1.
>>> homogenousImmediateB := Array new: 1024 withAll: $1.
>>> homogenousNormal := Array new: 1024 withAll: 1 / 2.
>>> polymorphic := Array new: 1024.
>>> 1 to: polymorphic size by: 4 do:
>>> [:i|
>>> polymorphic
>>> at: i put: i;
>>> at: i + 1 put: i asFloat;
>>> at: i + 2 put: i / 1025;
>>> at: i + 3 put: i * 1.0s1 "scaled decimal"].
>>> megamorphic := Array new: 1024.
>>> 1 to: megamorphic size by: 8 do:
>>> [:i|
>>> megamorphic
>>> at: i put: i;
>>> at: i + 1 put: i asFloat;
>>> at: i + 2 put: i / 1025;
>>> at: i + 3 put: i * 1.0s1; "scaled decimal"
>>> at: i + 4 put: i class;
>>> at: i + 5 put: i asFloat class;
>>> at: i + 6 put: (i / 1025) class;
>>> at: i + 7 put: (i * 1.0s1) class].
>>> null := [1 to: 1024 * n do: [:k| | e | e := void at: k \\ 1024 + 1. e "yourself"]] timeToRun; timeToRun.
>>> { homogenousImmediateA. homogenousImmediateB. homogenousNormal. polymorphic. megamorphic } collect:
>>> [:a|
>>> "Smalltalk voidCogVMState."
>>> ([1 to: 1024 * n do: [:k| | e | e := a at: k \\ 1024 + 1. e yourself". e yourself"]] timeToRun; timeToRun) - null]
>>>
>>> And results on my 2.3GHz Intel Core i7 MacMini (in a far from quiet state, so take these with a pinch of salt) are
>>>
>>> 64-bit: #(1510 1703 1612 1830 2451)
>>> 32-bit: #(2852 4133 2843 3534 5317)
>>>
>>> or...
>>> 64-bit 32-bit
1. >>> monomorphic to 1 1510 2852
2. >>> monomorphic to $1 1703 4133
3. >>> monomorphic to obj 1612 2843
4. >>> closed polymorphic 1830 3534
5. >>> open polymorphic 2451 5317

>>> You can see that the closeness in performance between monomorphic and closed polymorphic sends means that handling true and false as instances of two different classes costs relatively little in performance.

I presume those descriptions are a direct match for each of these
desciptions from the V3 article.
1. homogenous immediate          2.8 nsecs
2. homogenous compact                   8.6 nsecs
3. homogenous normal                   8.5 nsecs
4. polymorphic                                    11.2 nsecs
5. megamorphic                                    16.7 nsecs

The article says about V3 "the cost of class access is a significant
percentage of overall send costs: 8.6 – 2.8 = 5.8"

That cost seems to be gone from Spur.  I presume because...
>>> # cache tag is class index field
whereas (IIUC) in V3 the class tag was a pointer that needed to be
lookup up elsewhere.

So a random thought (I'm out of my depth, but curious...)
what is the cost (in machine code instructions) of distinguishing
between monomorphic and closed polymorphic caches?
If monomorphic and closed polymorphic performance is so close, could
it be faster to eliminate the monomorphic cache checks
and first look up the polymorphic cache?
I suppose its a matter of space taken up by a fixed size polymorphic
cache.  Could the size of polymorphic cache be dynamic?

And a provocative thought, while it seems *REALLY* counter-intuitive,
do we actually need immediates?
That would seem to eliminate two thirds of the 64bit prolog.
But maybe I'm mislead, if the stats above are only the method lookup
and not inclusive of getting hold of the values for calculation.

Or is an explicit check for immediates only required for certain
methods doing a direct calculation with the values?
Or perhaps Sista might be able to determine that immediate-objects are
not common at a particular send site,
and avoid the immediate-checks at that site (presuming that if an
immediate-object turns up, a normal class lookup would still work?).

I presume here you mean 64-bit Spur...
>>> Here's the monomorphic cache checking sequence generated for the prolog of each method in 32-bit Spur:
>>>
>>> LstackOverflow: # first byte of code, aligned on an 8 byte boundary
>>> xorq %rdx, %rdx
>>> Lfail:
>>> call ceMethodAbort0Args
>>> entry:
>>> movq %rdx, %rax
>>> andq $7, %rax # SmallInteger is 2r001, Character is 2r010, SmallFloat is 2r100
>>> jnz Lcompare
>>> movq (%rdx), %rax
>>> andq $0x3fffff, %rax # cache tag is class index field
>>> Lcompare:
>>> cmpq %rcx, %rax
>>> jnz Lfail
>>> noCheckEntry:
>>>
>>> The 64-bit sequence is the cleanest.

> BTW, I have changes for the suggested speedup that are not yet complete.  The complication is closed PICs which need to check for the odd SmallInteger receiver in a different place.  If you're burning with curiosity and want to try and make this work then find the changes I have so far attached...

I'm tempted, but I have a few other priorities atm.

cheers -ben