[Vm-dev] in-line method lookup caching of booleans

Fri Aug 3 14:31:30 UTC 2018

Hi Clément, Hi Ben,

_,,,^..^,,,_ (phone)

> On Aug 3, 2018, at 1:19 AM, Clément Béra <bera.clement at gmail.com> wrote:
> 
> Hi,
> 
>> On Fri, Aug 3, 2018 at 4:28 AM, Ben Coman <btc at openinworld.com> wrote:
>>  
>> Just a brain twitch about something I'd like to understand better...
>> 
>> At http://www.mirandabanda.org/cogblog/2011/03/01/build-me-a-jit-as-fast-as-you-can/
>> it says... "[Monomorphic] inline caching depends on the fact that in
>> most programs at most send sites there is no polymorphism and that
>> most sends bind to *exactly_one* class of receiver over some usefully
>> long period of time. ... In Cog we implement inline caches in three
>> forms ... monomorphic inline cache ... closed polymorphic inline cache
>> ... open polymorphic cache.  What’s nice is that while these sends are
>> increasingly expensive moving from monomorphic to megamorphic they are
>> also decreasingly common in roughly a 90%, 9% 0.9% ratio, at least for
>> typical Smalltalk programs"
> 
> Note that in my experience the ratio was, each time I measured in the Cog Simulator, 93%, 4%, 3% or something like that. Cliff Click told me recently he had the same feeling with Java code, sends are either mono or megamorphic, but rarely poly (through in the JVM polymorphism is up to 4). 
> 
>> 
>> First, I'm curious what is the relative performance of the three
>> different caches ?

Towards the bottom of the blog post is an example that measures relative performance.  The table I give is as follows, but the data are from a V3 image.  Ben, maybe you could run the example on Spur and see if things are much different.

homogenous immediate	         2.8 nsecs
homogenous compact             8.6 nsecs
homogenous normal                8.5 nsecs
polymorphic                             11.2 nsecs
megamorphic                          16.7 nsecs

> I guess you can measure that with micro benchs. Levente can help with that, I would write complete crap. Note that our polymorphism is up to 6.
> 
> In the Case of Cog, I would expect monomorphic caches to be almost as fast as direct calls, polymorphic caches to be almost as fast as monomorphic, and megamorphic caches to be considerably slower. To be confirmed.
> 
> Note also that for some reason in Cog PICs are called ClosedPICs and megamorphic caches are called OpenPICs.

Here’s my explanation from the blog post:

In Cog we implement inline caches in three forms:

– a monomorphic inline cache; a load of a cache tag and a call of a target method whose prolog checks that the current receiver’s class matches the cache tag

– a closed polymorphic inline cache; a growable but finite jump table of class checks and jumps to method bodies, (closed because it deals with a closed set of classes).

– an open polymorphic cache; a first-level method lookup probe specialized for a particular selector, (open since it deals with an open set of classes).

>> Second, I'm curious how Booleans are dealt with.  Boolean handling
>> must be fairly common, and at the Image level these are two different
>> classes, which pushes out of the monomorphic inline cache, which may
>> be a significant impact on performance.
>> 
> 
> Control flow operations are inlined by the bytecode compiler, and they're the most critical performance wise.
> 
> The VM fixes the addresses of specific objects (nil, true, false). Become and become forward don't work with those objects. The JIT can generate constants in machine code for the addresses of those objects. That allows to quicken inlined control flow operations.
> 
> Non inlined operations are usually dealt with PICs with 2 cases.
> 
> One issue I had in Sista was with early PIC promotion. Currently in Cog if there's already a megamorphic cache for a selector, monomorphic caches are rewritten to the megamorphic cache directly and not to PIC then megamorphic. This is a problem typically with sends such as #isNil. In some cases the isNil is megamorphic and using such cache is relevant. In other case there's just a given type and Undefined object, and there the VM currently uses a megamorphic cache instead of a PIC. This was especially annoying since megamorphic caches don't provide any runtime type feedback. Similar 2 cases issue, non boolean.

Hence we disable the optimisation in the Sista JIT right?

> 
>> I started wondering if under the hood the VM could treat the two
>> booleans as one class and in the instance store "which-boolean" it
>> actually is.  Then for example the #ifTrue:ifFalse: method of each
>> class would be compiled into a single method conditioned at the start
>> on "which-boolean" as to which path is executed.  How feasible would
>> that be, and would it make much of a difference in performance of
>> booleans ?
>> 
> 
> Some VMs have boolean as primitive types or as immediate objects. Honestly, given our optimization that the booleans can't move in memory and that we can use their address as a constant, I don't think any of these optimizations would make any significant difference in terms of performance. 
> 
> What might make a difference is to add as primitive operations / inlined operations some other boolean methods, such as #not. It's a trade-off between system flexibility, boolean and non boolean performance, etc. 
>  
>> Except then I realized that this situation is already bypassed by
>> common boolean methods being inlined by the compiler.  Still curious,
>> if there are quick answers to my questions.
> 
> ..
>  
>> cheers -ben
> 
> -- 
> Clément Béra
> https://clementbera.github.io/
> https://clementbera.wordpress.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20180803/1d29d27e/attachment.html>