Apple hyping java...

Mon Apr 1 17:02:31 UTC 2002

At 02:35 PM 4/1/2002 +0200, Andreas Raab wrote:
>> That got noticeably painful and slow when things had to work on an
>> MP, and lo-and-behold, unsynchronized versions of Vector and Hashmap
>> appeared.  It's now one of the standard Java performance gotchas --
>> "oh, you used the synchronized data structures when you didn't need
>> to, and it was slow".  Is that something you want for Squeak?
>
>Most certainly not. But I'm interested if you got some data points on
>how the performance shifted when switching away from the green threads.
>Are there any publications on this?

None published.  I've taken various measurements over the years,
I can give you some from memory.  This stuff varies from platform
to platform -- the one I know well is non-Xeon Pentium.

1. In an experiment, I added a single (call to a) locked-compare-and-
   swap in the copy-an-object code in a full stop-and-copy collector.
   GC took 60% longer.  Yes, it is that expensive.

2. Embedded in a benchmark, this piece of code takes 875 ms (iterated
   many times, run with a particular Java VM I know well).

  static int mul1(int a, int b) {
    int result=0;
    for (int i = 0; i < 32; i++) {
      result += result;
      if (a < 0) {
        gratuitousCounter++;
        result += b;
      }
      a = a + a;
    }
    return result;
  }

   If I add synchronization and run on a 2-processor, it takes 5100
   milliseconds.  (That's what I call expensive.)  No contention at
   all, just one thread running hard.

  static int mul6(int a, int b) {
    int result = 0;
    for (int i = 0; i < 32; i++)
      synchronized (a_lock) {
      result += result;
      if (a < 0) {
        gratuitousCounter++;
        result += b;
      }
      a = a + a;
    }
    return result;
  }

   If I run that code in 1-processor mode (uses compare-and-swap,
   but does not lock) it takes 2400 milliseconds.  The lock is
   that costly (that is, 3x the cost of the original loop body,
   just for the lock).

   If I run that code with the lock held outside the loop (the
   recursive lock is detected and goes much faster) it takes
   1640 milliseconds.  This is closer to what you might expect
   with green threads in the uncontended case.

What I've just described here is an honest benchmark, in that
I am telling you exactly what optimizations are taking place,
and there's nothing hidden.  An uncontended lock acquire is
a load, some bit-fiddling, a couple of tests, and perhaps a
locked compare-and-swap.  A release is similar.

I hope this is helpful; I can provide a bit more details, if 
I know what you want.  Other processors have different costs
for doing this sort of thing.  The difference between plain old
Pentium and Xeon is that Pentium locks the whole memory bus,
whereas Xeon only locks the relevant cache line in memory.

I'll reply to the other bit later -- I've got a lunch appointment.

David Chase