[Vm-dev] in-line method lookup caching of booleans

Fri Aug 3 22:54:28 UTC 2018

Hi Ben,

On Thu, Aug 2, 2018 at 7:28 PM, Ben Coman <btc at openinworld.com> wrote:

>
> Just a brain twitch about something I'd like to understand better...
>
> At http://www.mirandabanda.org/cogblog/2011/03/01/build-me-a-
> jit-as-fast-as-you-can/
> it says... "[Monomorphic] inline caching depends on the fact that in
> most programs at most send sites there is no polymorphism and that
> most sends bind to *exactly_one* class of receiver over some usefully
> long period of time. ... In Cog we implement inline caches in three
> forms ... monomorphic inline cache ... closed polymorphic inline cache
> ... open polymorphic cache.  What’s nice is that while these sends are
> increasingly expensive moving from monomorphic to megamorphic they are
> also decreasingly common in roughly a 90%, 9% 0.9% ratio, at least for
> typical Smalltalk programs"
>
> First, I'm curious what is the relative performance of the three
> different caches ?
>

here's a measurement for Spur:

| n null void homogenousImmediateA homogenousImmediateB homogenousNormal
polymorphic megamorphic |
Smalltalk garbageCollect.
n := 1000000.
void := Array new: 1024 withAll: nil.
homogenousImmediateA := Array new: 1024 withAll: 1.
homogenousImmediateB := Array new: 1024 withAll: $1.
homogenousNormal := Array new: 1024 withAll: 1 / 2.
polymorphic := Array new: 1024.
1 to: polymorphic size by: 4 do:
[:i|
polymorphic
at: i put: i;
at: i + 1 put: i asFloat;
at: i + 2 put: i / 1025;
at: i + 3 put: i * 1.0s1 "scaled decimal"].
megamorphic := Array new: 1024.
1 to: megamorphic size by: 8 do:
[:i|
megamorphic
at: i put: i;
at: i + 1 put: i asFloat;
at: i + 2 put: i / 1025;
at: i + 3 put: i * 1.0s1; "scaled decimal"
at: i + 4 put: i class;
at: i + 5 put: i asFloat class;
at: i + 6 put: (i / 1025) class;
at: i + 7 put: (i * 1.0s1) class].
null := [1 to: 1024 * n do: [:k| | e | e := void at: k \\ 1024 + 1. e
"yourself"]] timeToRun; timeToRun.
{ homogenousImmediateA. homogenousImmediateB. homogenousNormal.
polymorphic. megamorphic } collect:
[:a|
"Smalltalk voidCogVMState."
([1 to: 1024 * n do: [:k| | e | e := a at: k \\ 1024 + 1. e yourself". e
yourself"]] timeToRun; timeToRun) - null]

And results on my 2.3GHz Intel Core i7 MacMini (in a far from quiet state,
so take these with a pinch of salt) are

64-bit: #(1510 1703 1612 1830 2451)
32-bit: #(2852 4133 2843 3534 5317)

or...
64-bit 32-bit
monomorphic to 1 1510 2852
monomorphic to $1 1703 4133
monomorphic to obj 1612 2843
closed polymorphic 1830 3534
open polymorphic 2451 5317

The sad thing is that Spur's monomorphic sends to immediate are very slow
because 32-bit Spur supports 31-bit SmallInteger immediates (for
compatibility with 32-bit V3).  Were it to use the simple tagging scheme of
two bit tags sends would look more like the 64-bit ones.

The difference between monomorphic sends to 1 and sends to $1 on 64-bit
shows the variability in timing due to a heavily loaded machine.  There's
no difference in the instructions executed.

Here's the  monomorphic cache checking sequence generated for the prolog of
each method in (32-bit) V3:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary
xorl %edx, %edx
Lfail:
call ceMethodAbort0Args
nop
entry: # aligned on an 8 byte boundary
movl %edx, %eax
andl $1, %eax # only immediate class is SmallInteger
jnz Lcompare
movl (%edx), %eax
andl $0x1f000, %eax # if non-zero, class is compact & compact class index
used as cache tag
jnz Lcompare
movl -4(%edx), %eax # normal class plus 2 bits of header type tag,
eliminate header type tag
andl $0xfffffffc, %eax
Lcompare
cmpl %ecx, %eax
jnz Lfail
noCheckEntry:

Here's the monomorphic cache checking sequence generated for the prolog of
each method in 32-bit Spur:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary
xorl %edx, %edx
Lfail:
call ceMethodAbort0Args
nop
Limmediate: # aligned on an 8 byte boundary
andl $1, %eax # map SmallInteger to 1, Character to 0
jmp Lcompare
nop
nop
nop
entry: # aligned on an 8 byte boundary
movl %edx, %eax
andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10
jnz Limmediate
movl (%edx), %eax
andl $0x3fffff, %eax # cache tag is class index field
Lcompare:
cmpl %ecx, %eax
jnz Lfail
noCheckEntry:

Here's the monomorphic cache checking sequence generated for the prolog of
each method in 32-bit Spur:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary
xorq %rdx, %rdx
Lfail:
call ceMethodAbort0Args
entry:
movq %rdx, %rax
andq $7, %rax # SmallInteger is 2r001, Character is 2r010, SmallFloat is
2r100
jnz Lcompare
movq (%rdx), %rax
andq $0x3fffff, %rax # cache tag is class index field
Lcompare:
cmpq %rcx, %rax
jnz Lfail
noCheckEntry:

The 64-bit sequence is the cleanest.  But if 32-bit Spur had 30-bit
SmallIntegers alongside its 30-bit Characters then that sequence would be
equally simple.

I suppose something like this could be quicker for 32-bit Spur with 31-bit
SmallIntegers.  It would mean that sends to odd SmallIntegers were slower
than sends to even ones, but would be quicker for Characters and even
SmallIntegers:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary
xorl %edx, %edx
xorl %eax, %eax # necessary because %eax may have some other value than
%edx bitAnd: $3 when branching to LstackOverflow
Lfail:
cmpl $3, %eax
jnz Limmediate
call ceMethodAbort0Args
Limmediate:
andl $1, %eax # map odd SmallInteger (2r11) to 2r01
jmp Lcompare
entry: # aligned on an 8 byte boundary
movl %edx, %eax
andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10.
SmallInteger cache tag is 2r01
jnz Lcompare
movl (%edx), %eax
andl $0x3fffff, %eax # cache tag is class index field
Lcompare:
cmpl %ecx, %eax
jnz Lfail
noCheckEntry:

Second, I'm curious how Booleans are dealt with.  Boolean handling
> must be fairly common, and at the Image level these are two different
> classes, which pushes out of the monomorphic inline cache, which may
> be a significant impact on performance.
>
> I started wondering if under the hood the VM could treat the two
> booleans as one class and in the instance store "which-boolean" it
> actually is.  Then for example the #ifTrue:ifFalse: method of each
> class would be compiled into a single method conditioned at the start
> on "which-boolean" as to which path is executed.  How feasible would
> that be, and would it make much of a difference in performance of
> booleans ?
>

You can see that the closeness in performance between monomorphic and
closed polymorphic sends means that handling true and false as instances of
two different classes costs relatively little in performance.

> Except then I realized that this situation is already bypassed by
> common boolean methods being inlined by the compiler.  Still curious,
> if there are quick answers to my questions.
>
> cheers -ben
>

_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20180803/0da3ce26/attachment-0001.html>