Hi Henrik,

On Tue, Jul 12, 2016 at 1:07 AM, Henrik Johansen <henrik.s.johansen@veloxit.no> wrote:

Could also be something gone wonky with the interpreters method cache, causing excessive misses and the fallback code (lookupOrdinary... and addNewMethod..) to run.
Initially I'd suspect there might've been changes that'd affect hashing (such as classTag - index in class table, being used instead of class pointer), but I guess that happened before the first version.

I would note, in the generated code for spur in the git repo, there's a case on top of addNewMethodToCache that doesn't actually add the found method to cache:
if (!(((GIV(newMethod) & (tagMask())) == 0)
&& (((((usqInt) (longAt(GIV(newMethod)))) >> (formatShift())) & (formatMask())) >= (firstCompiledMethodFormat())))) {
primitiveFunctionPointer = primitiveInvokeObjectAsMethod;
return;
}

I don't know exactly what firstCompiledMethodFormat() entails, but if it means found method that is a CompiledMethod, that would pretty much mean the MethodCache is entirely unused, no?

The expression is ! (not) "is it non-immediate with compiled method format". So this expression adds the primitiveInvokeObjectAsMethod as the primitive whenever what the lookup finds is /not/ a compiled method. This is the implementation of "objects as methods". All compiled methods are non-immediates and have a format >= firstCompiledMethodFormat:

Behavior>>instSpec

"Answer the instance specification part of the format that defines what kind of object

an instance of the receiver is. The formats are

0 = 0 sized objects (UndefinedObject True False et al)

1 = non-indexable objects with inst vars (Point et al)

2 = indexable objects with no inst vars (Array et al)

3 = indexable objects with inst vars (MethodContext AdditionalMethodState et al)

4 = weak indexable objects with inst vars (WeakArray et al)

5 = weak non-indexable objects with inst vars (ephemerons) (Ephemeron)

6 = unused

7 = immediates (SmallInteger, Character)

8 = unused

9 = 64-bit indexable

10-11 = 32-bit indexable (Bitmap) (plus one odd bit, unused in 32-bits)

12-15 = 16-bit indexable (plus two odd bits, one unused in 32-bits)

16-23 = 8-bit indexable (plus three odd bits, one unused in 32-bits)

24-31 = compiled methods (CompiledMethod) (plus three odd bits, one unused in 32-bits)

Note that in the VM instances also have a 5 bit format field that relates to their class's format.

Formats 11, 13-15, 17-23 & 25-31 are unused in classes but used in instances to define the

number of elements missing up to the slot size. For example, a 2-byte ByteString instance

has format 18 in 32-bits, since its size is one 32-bit slot - 2 bytes ((18 bitAnd: 3) = 2), and

22 in 64 bits, since its size is one 64-bit slot - 6 bytes ((22 bitAnd: 7) = 6)."

^(format bitShift: -16) bitAnd: 16r1F

So the format encodes both "it is a compiled method" and the number of unused trailing elements.

Cheers,
Henry

On 07 Jul 2016, at 10:04 , Clément Bera <bera.clement@gmail.com> wrote:

Yes. Number of samples and percentage of overall samples. You're right it's not that many. Here is the result when I run 5 times the bench (I usually use the profiler just as a hint to tweak performance, then the in-image results matter).

New VM:

% of vanilla vm code (% of total) (samples) (cumulative)
20.72% ( 9.98%) scavengeReferentsOf (2943) (20.72%)
10.52% ( 5.07%) lookupOrdinaryNoMNUEtcInClass (1494) (31.24%)
6.63% ( 3.19%) processWeakSurvivor (941) (37.87%)
5.62% ( 2.71%) copyAndForward (798) (43.49%)
4.94% ( 2.38%) addNewMethodToCache (702) (48.43%)
4.86% ( 2.34%) doScavenge (690) (53.29%)
3.20% ( 1.54%) primitiveStringReplace (455) (56.50%)
3.06% ( 1.48%) moveFramesInthroughtoPage (435) (59.56%)
2.94% ( 1.42%) compact (418) (62.50%)
2.87% ( 1.38%) interpret (407) (65.37%)
2.52% ( 1.21%) ceSendFromInLineCacheMiss (358) (67.89%)

Old VM:

% of vanilla vm code (% of total) (samples) (cumulative)
27.19% (10.97%) scavengeReferentsOf (2615) (27.19%)
9.80% ( 3.95%) processWeakSurvivor (943) (36.99%)
8.76% ( 3.54%) copyAndForward (843) (45.75%)
7.06% ( 2.85%) doScavenge (679) (52.81%)
4.81% ( 1.94%) primitiveStringReplace (463) (57.63%)
4.28% ( 1.73%) moveFramesInthroughtoPage (412) (61.91%)
4.15% ( 1.67%) lookupOrdinaryNoMNUEtcInClass (399) (66.06%)
2.44% ( 0.99%) marryFrameSP (235) (68.50%)
2.30% ( 0.93%) isWidowedContext (221) (70.80%)
2.19% ( 0.88%) handleStackOverflow (211) (72.99%)
1.64% ( 0.66%) primitiveCompareString (158) (74.63%)
1.50% ( 0.60%) findMethodWithPrimitiveFromContextUpToContext (144) (76.13%)
1.50% ( 0.60%) ceBaseFrameReturn (144) (77.63%)
1.45% ( 0.58%) ceStackOverflow (139) (79.07%)
1.16% ( 0.47%) ceNonLocalReturn (112) (80.24%)
1.11% ( 0.45%) allocateNewSpaceSlotsformatclassIndex (107) (81.35%)
1.00% ( 0.40%) isBytes (96) (82.35%)
0.95% ( 0.38%) fetchClassOfNonImm (91) (83.29%)
0.93% ( 0.37%) stackValue (89) (84.22%)
0.90% ( 0.36%) returnToExecutivepostContextSwitch (87) (85.12%)
0.90% ( 0.36%) interpret (87) (86.03%)
0.84% ( 0.34%) addNewMethodToCache (81) (86.87%)
0.83% ( 0.34%) ceSendFromInLineCacheMiss (80) (87.70%)

Now if I look at the 4 methods highlighted, instead of the 3 I looked at before, I see ~7% overhead. I'm not sure if compact got really slower or if it used to be inlined somewhere during slang compilation. I still think we should use these results as a hit and rely on the real benchmark for performance evaluation.

If it's really a slang inlining problem, it will have side-effect on other functions, many will get a little bit faster, and it really seems this is the problem.

On Thu, Jul 7, 2016 at 8:50 PM, Holger Freyther <holger@freyther.de> wrote:

> On 07 Jul 2016, at 19:41, Clément Bera <bera.clement@gmail.com> wrote:
>
> Hi Holger,

Hi!

> I'm sorry for the delay since you reported the bug. Everyone working on the VM is very busy (In my case, my time is consumed by my phd and sista, I'm trying to open an alpha of the image+runtime before ESUG running at ~1.5x).

great.

>
> Now that the VM profiler is fully working I was able to look into the regression.
>
> Old VM vanilla code:
>
> 27.94% (11.54%) scavengeReferentsOf (528) (27.94%)
> 10.21% ( 4.22%) processWeakSurvivor (193) (38.15%)
> 8.15% ( 3.36%) copyAndForward (154) (46.30%)
> 7.41% ( 3.06%) doScavenge (140) (53.70%)
> 5.19% ( 2.14%) lookupOrdinaryNoMNUEtcInClass (98) (58.89%)
> 4.71% ( 1.94%) primitiveStringReplace (89) (63.60%)
> 4.29% ( 1.77%) moveFramesInthroughtoPage (81) (67.88%)
> 2.06% ( 0.85%) isWidowedContext (39) (69.95%)
> [...]
> 0.90% ( 0.37%) ceSendFromInLineCacheMiss (17) (85.61%)
> 0.85% ( 0.35%) addNewMethodToCache (16) (86.46%)
>
> New VM vanilla code:
>
> % of vanilla vm code (% of total) (samples) (cumulative)
> 22.41% (10.44%) scavengeReferentsOf (609) (22.41%)
> 14.46% ( 6.74%) lookupOrdinaryNoMNUEtcInClass (393) (36.87%)
>

> I highlighted the problematic methods. It is likely that it has to do with slang inlining.
>
> The 3 methods highlighted seems to be responsible for an overhead of ~8.5% in the overall runtime. You seem to have an overhead of ~14% on my machine. There's quite a difference. The slang inlining overhead may have impacted other functions, and C profilers are never 100% accurate, so we will see when the problem is fixed if something else is also problematic. Now that we know what the problem is, I believe it will be fixed in the incoming weeks when someone (likely Eliot, maybe me) has time. Thanks for reporting the regression.

What does (393) and (98) mean? Is it number of samples?

holger

_,,,^..^,,,_

best, Eliot