Re: [squeak-dev] The Inbox: Kernel-cmm.1198.mcz

25 Nov 2018


      Hi Levente,
Just a reminder, the original question I asked was:
...
...
...
...
Do you think the system would be noticably slower if all the sends to
#class became a message send?  ...
and your response:
...
...
...
Yes, the bytecode is way quicker than the primitive or a primitive + a
send which is exactly what you suggested.
So even though you answered a different question, I was still curious
by your claim, and remembered that you're one has liked to communicate
with benchmarks.  That's why I ran and presented them to you, but I'm
not sure if we're interpreting the results relative to my question or
some other question...
...
...
It saves one send.  One.  That's only infinitesimally quicker:
_________
{ [1 xxxClass] bench.
[ 1 class ] bench.  }
----> #('99,000,000 per second. 10.1 nanoseconds per run.'
'126,000,000 per second. 7.93 nanoseconds per run.')
________
2 nanoseconds per send faster.  Inconsequential in any real-world
sense.  Furthermore, as soon as the message sent to the class does
*any work* whatsoever, that good-sounding 27% improvement is quickly
wiped out.  Look how much of the gain is lost doing as little as
creating one single Rectangle from another one:

"Compare creating a single Rectangle with inlined #class vs. a
(proposed) message-send of #class."
| someRectangle |   someRectangle := 100@50 corner: 320@200.
{  [someRectangle xxxClass origin: someRectangle topLeft corner:
someRectangle bottomRight ] bench.
  [someRectangle class      origin: someRectangle topLeft corner:
someRectangle bottomRight ] bench.   }
--->  #('37,200,000 per second. 26.9 nanoseconds per run.'

'38,000,000 per second. 26.3 nanoseconds per run.')
____________
Real-world gain by the inlined send was reduced to...  whew!  I just
had to go learn about "Picosecond" because nanoseconds aren't even
small enough to measure the improvement.
So, amplify.  Crank it up to 100K:
__________
"Compare creating a 100,000 Rectangles with inlined #class vs. a
message-send of #class."
| someRectangle |   someRectangle := 100@50 corner: 320@200.
{  [ 100000 timesRepeat: [someRectangle xxxClass origin: someRectangle
topLeft corner: someRectangle bottomRight] ] bench.
  [ 100000 timesRepeat: [someRectangle class origin: someRectangle
topLeft corner: someRectangle bottomRight] ] bench.   }
---> #('364 per second. 2.75 milliseconds per run.' '369 per

second. 2.71 milliseconds per run.')
_________
Nothing times 100K is still nothing.
That's not the right way to measure things that are so quick, because the
overhead of block activation is comparable to the runtime of the code
inside the block. Also, #timesRepeat: is not a good choice for
measurements for the very same reason: block creation + lots of block
activation.
Also, the nearby bytecodes affect what the JIT does. When more things can
be executed without performing a send, the overall performance gains
will be higher.
There are three benchmarks, did you notice the first two?
- The first one measures the single-unit cost of #xxxClass over
#class.  This captures your theoretical maximum benefit of 27%, which
is terrible, because it can't come close to that in real code.
- The second demonstrates how 90% of that 27% benefit is wiped out
with no more than a single simple allocation -- what the vast majority
of class methods are responsible for.
- The third one measures "real world impact", and shows that this
particular in-line doesn't help the system in any way that helps any
human anywhere.
...
...
...
Also, removing the bytecode will make #class lose its atomicity. Any code
that relies on that behavior will silently break.
If THAT exists it needs a more intention-revealing selector than
#class that would let his peers know atomicity mattered there.
#basicClass is his friend.
All special selectors do the same e.g. #==, #ifNil:, #ifTrue:. Do you
think all of those need #basicXXX methods?
No just #class.  An identity-check should be an identity-check, even
against a Proxy.  And does that example help illustrate how using #==
when you DON'T need an identity-check is a breakage of encapsulation?
It makes false assumptions and enforces type-conformance in a system
that wants to be empowered by messaging.
...
...
...
...
...  I am surprised to see we have so many senders of #class in
trunk, but I have a feeling most rarely ever called.
I doubt that. People don't sprinkle #class sends for no reason, do they?
Sorry, I should not have said "ever".  I was trying to say the system
probably spends most of its time sending to instance-side methods than
class-side methods.
It's a common pattern to have instance-independent code on the class side.
Quick access to that is always a good thing.
It's still quick!  Levente, I challenge you to back up your claim by
identifying any one single method in the image which reports even only
a meaningfully better *bench* performance (much less real-world) by
calling it via #class instead of #xxxClass.
Anything whose performance matters at a level of one send is going to
use #basicClass anyway, just like we may have a few that we send
#basicNew instead of #new to.
...
...
...
...
Not remove it, redirect it to #basicClass.
Right, but while the bytecode is in effect, you just can't redirect
it.
I'm racking my brain trying to understand this -- sorry...   By
"redirect" I just meant change the Compiler to generate bytecode 199
for sends to #basicClass, and just the regular "send" bytecode for
sends to #class.  Then, recompile all methods.  Would that work?
It might work, but you would need to identify and rewrite senders of
#class which rely on the presence of the bytecode. In my image there are
2174 senders, which is simply too much review in my opinion.
I repeat my challenge above!
...
I did some measurements and found that the JIT makes the numbered
primitive almost as quick as the bytecode. The slowdown is only about 10%.
Your suggestion, which is send + bytecode is about 85% slower and loses
the atomicity of the message. So, you'd better leave the implementation of
#class as it is right now, because that would be quicker and would
preserve the atomicity as long as nothing overrides it.
Huh?  No, you're only 27% faster in the *benchmark*, but near zero in
anything real-world.
My challenge above, stands.  I would love to be wrong, so I could shed
my suspicion of whether this is about something else not mentioned...
 :(
...
...
...
...
This is a reasonable and familiar pattern, right?  It provides users
full control and WYSIWIG between source and bytecodes due to a crystal
clear selector name.  No magic.
So, if
  performance is not really hurt, and
  we can keep sending #class if so insisted, and
  we still have #basicClass, just in case, together
  delineating an elegant seam between system-level vs. user-level access
  in a classic Smalltalky way that even *I* can understand and use,
  and give Squeak better Proxy support that helps Magma
then
  would you let me have this?
As I wrote it a few emails earlier, I'd rather have a "switch" for this
than forcing it on everyone who don't use proxies at all (I presume that's
the current majority of Squeak users).
Whoa, hold on there.  You only ever made one argument -- "performance"
-- which was obliterated by the benchmarks.  Squeezing 27% more out of
a microbench of something called 0.0001% of the time results no
benefit to anyone anywhere.
I see MY position is the pro user position, and yours as the... pro
fastest-lab-result position, but hurts this Squeak user.  I'm sad that
that alone isn't enough to support this.    :(
_______
Do you remember when Behavior>>#new didn't always make a call to
#initialize?  But at a time when Squeak was 10X slower than it is now,
the people then had the wisdom to understand that the computer and
software exists to eventually serve _users_, and that spiting users to
save one single send, even when it was a much greater percentage of
impact back then, was still way worth it.