Adaptive PIC scheme (Re: [Vm-dev] InterpreterSimulator)

Eliot Miranda eliot.miranda at gmail.com
Tue Mar 15 19:28:12 UTC 2016


Hi Guys,

On Tue, Mar 15, 2016 at 11:46 AM, Levente Uzonyi <leves at caesar.elte.hu>
wrote:

>
> Hi Clément,
>
> On Tue, 15 Mar 2016, Clément Bera wrote:
>
> (no quote again, sorry)
>
> So, the idea is to add a counter to each cache entry, and increase it on
> each hit. Once a counter reaches a given limit, divide all counters by
> two[1], and optionally do housekeeping stuff.
> By housekeeping I mean removing entries with zero counter value, resorting
> the entries based on the counter values, and optionally reverting to a
> single method when there's only one cache entry left.
> By optionally I mean that if you don't want to do the housekeeping every
> time the limit is reached, then another counter can be introduced, so that
> only every nth time will housekeeping occur.
> Since there are at most six cache entries, sorting can be done by
> Insertion sort or a hardcoded Optimal sorting network. The former would be
> simple and efficient, the latter could be even more efficient but more
> complex as well.
>
> In practice this can be done (at least) two ways:
>
> 1. If there's a free byte at each entry, then the cache entries can store
> their own counters. If we use single byte counters, then it's practical to
> use 256 as the limit. That way division happens when the counter overflows.
> That takes at least 128 cache hits between two divisions, but if we
> eventually revert to a single method when there's only one entry left in
> the PIC, then the average number will be slightly higher than 128.
>
> Pros:
> - simple bookkeeping
> - sorting can be done in-place
> Cons:
> - there's probably no free byte
> - if we don't want to do the housekeeping too often, then the PIC will
> need a few more extra bytes to store another counter for that
>
> 2. Add a single 64-bit word to the PIC. The lower 6 bytes can be used for
> the 6 counters, one per cache entry. The remaining two bytes can hold the
> housekeeping counter.
>
> Pros:
> - still fairly simple bookkeeping, but overflows will have side-effects on
> the next byte. This can be worked around by leaving an empty bit
> between the counter bytes, at the cost of reducing the size of the
> housekeeping counter. Another option is to check the counter value before
> incrementing it to avoid overflows altogether.
> - dividing all counters by two can be done in a few machine instructions
> - 7-bit or 9-bit counters can be used as well
> Cons
> - incrementing a counter is a bit more complex
> - sorting is trickier because the counter values will have to be swapped
> as well
>
> Assuming there's no free byte, I'd start straight with 2., but even adding
> a 64-bit word to a PIC doesn't seem too easy.
>
> Levente
>
> [1] This scheme holds information about not that recent hits as well, but
> the effect of the past decreases exponentially, so it's a fairly good
> approximation of the most recent hits without the need to store any
> additional data. I came up with this method a few years ago, yet I still
> couldn't find out what it's called.
>


Reorganizing 6-element PICs doesn't really make sense.  The cost of the
counters would likely outweigh the benefits of reordering the PICs.  Look
at the time taken for a monomorphic send vs the worst case PIC send (6th
case), vs a megamorphic send:

| n |
n := 1000000000.
{[| v |
   v := 1.
   1 to: n do:
[:i|
i < 1 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]].
 [| v |
   v := 1.
   1 to: n do:
[:i|
v species.
i < 1 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]].
 [| v |
   v := 1.
   1 to: n do:
[:i|
v species.
i < 6 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]].
 [| v |
   v := 1.
   1 to: n do:
[:i|
v species.
i < 7 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]]} collect:
[:ea| ea timeToRun] #(2716 7013 7376 8350)


So on a 2.2GHz MacBook Pro Core i7 the monomorphic send of #species (block
2) takes (7.013 - 2.716) / 1,000,000,000, = 4.3 nanoseconds; the
polymorphic send in the 6th case of a PIC (block 3) takes (7.376 - 2.716) /
1,000,000,000, = 4.66 nanoseconds.  Only the open PIC send (full hashed
lookup) takes significantly longer, (8.35 - 2.716) / 1,000,000,000 = 5.63
nanoseconds.

Let's add a counter; this is probably a lot more expensive than a VM
counter implementation, but it gives us a sense of scale.

| n |
n := 1000000000.
{[| v c |
   c := 0. v := 1.
   1 to: n do:
[:i|
c := c + 1.
i < 1 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]].
 [| v c |
   c := 0. v := 1.
   1 to: n do:
[:i|
c := c + 1.
v species.
i < 1 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]].
 [| v c |
   c := 0. v := 1.
   1 to: n do:
[:i|
c := c + 1.
v species.
i < 6 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]].
 [| v c |
   c := 0. v := 1.
   1 to: n do:
[:i|
c := c + 1.
v species.
i < 7 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]]} collect:
[:ea| ea timeToRun] #(3399 7755 8361 9017)

So a counter adds approximately #(7755 8361 9017) sum - #(7013 7376 8350)
sum / (#(7013 7376 8350) collect: [:eq| ea - 2716]) sum = 16% overhead,
whereas the difference between a monomorphic send and the worst case
polymorphic send is 7376 - 2716 / (7013 - 2716) asFloat = 8% slower.  So
the maximum improvement counters could make would be 8%.  Even if their
cost fell to 1/4 of the cost above they would cost 50% of the maximum
return they could deliver. It's not worth the complexity; the gains would
be in the noise.


<N.B.>  The above PIC sends contain an error in my methodology.  Tim
redesigned PICs late last year so that a two element PIC does a test and
then jumps to the 6th case to perform the next test.  If the PIC is
extended that jump is modified to jump to the 5th case, the next extension
changes to the 4th case and so on.  So in fact, to measure the cost of a
send to the 6th element of the PIC we need to change v back to the have the
class of the 2nd case:

[| v |
  v := 1.
  1 to: n do:
[:i|
v species.
i < 7 ifTrue: [v := #(1.0 #one 'one' #[1]. 1.0. $1. #(1)) at: i]]].

If you try this the above numbers don't change significantly, so I've been
lazy and haven't redone my analysis, but here's an example run:

| n |
n := 1000000000.
{[| v |
   v := 1.
   1 to: n do:
[:i|
i < 1 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]].
 [| v |
   v := 1.
   1 to: n do:
[:i|
v species.
i < 1 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]].
 [| v |
   v := 1.
   1 to: n do:
[:i|
v species.
i < 7 ifTrue: [v := #(1.0 #one 'one' #[1]. 1.0. $1. #(1)) at: i]]].
 [| v |
   v := 1.
   1 to: n do:
[:i|
v species.
i < 7 ifTrue: [v := #(1.0 #one 'one' #[1]. $1. #(1)) at: i]]]} collect:
[:ea| ea timeToRun] #(2778 7370 7683 8683)
</N.B.>


Some remarks.

Counters are expensive.  Sista already goes to some effort to minimise
their costs.  We count conditional branches instead of full sends.  The
green book presents measurements on the Smalltalk-80 V2 image of 1983 that
says conditional branch bytecodes (as in inlined ifTrue:ifFalse:, and:
to:do: et al) are 6 times less frequent than send bytecodes, and in today's
code that's likely fallen a little as people write more OO code.

Architectural solutions such as Spur and Sista offer much mofre substantial
gains; Spur shows a -40% speedup across a range of benchmarks.  I expect
Sista will deliver more when we are more aggressive with register
allocation and the range of optmizations we implement although right now
we're seeing similar gains to Spur.  My expectation is based both on the
success of other adaptive optimizers, and my experience in tuning OO
systems, where the first impelmentation is almost universally slower than
the one it replaes, the full gains only cming after careful performance
measurement and tuning; Sista is very young as yet.

_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20160315/01ec0b39/attachment-0001.htm


More information about the Vm-dev mailing list