Magma High Availability Shutdown a Node gets a timeout

Wed Nov 25 18:39:33 UTC 2009

Hi,

> The answer to the other question I've asked:
> <snippet>
>  Furthermore: why has Node2 have to beWarmupBackupFor: aPrimaryLocation if
> it is already a warmup for that primary location. Is it normal that he tries
> to do that again?

Note the guard in that method does check whether it is already a
warm-backup and, if so, avoids the #catchUp:to:, which is where the
bulk of the work would only be done if necessary to do so.

> Furthermore: if there is more than 3 nodes (say for
> instance 10 or more) each of them is again beWarmBackupFor the primary.
> </snippet>
> is still not clear to me. Is there a specific reason that this node2 again
> tries to beWarmupBackupFor: aPrimaryLocation even if it is already one? I've

The main reason is for uniformity of the implementation.  See the
comment in #ensureCorrectNodeConfiguration.  HA must handle a variety
of scenarios * a variety of pre-conditions * variety of timings of
possible events..  the idempotent property permits this relatively
uniform recovery process for all situations:

  1) assess and verify a client complaint
  2) adjust the Node object accordingly
  3) call #ensureCorrectNodeConfiguration - the entire Node is righted

> noticed that it takes quite a lot of cpu time to establish that connection.

But yes, I do take your point, that even these 2-3 CPU seconds per
server related to creation of the adminSession, connection, and
assessment, after all that, that everything is a-ok, seems a bit
expensive when all you want to do is shut down one secondary.
Therefore, I've posted new packages with a special check for only
removing a secondary and, if so, skips step 3, above, of the recovery
process.

> Trying to break it in an other way, I still found another (possible) issue.
> I can still trigger a timeout during shutdown in the following way: if i for
> instance have a primary and 3 secondary servers; and i shutdown immediately
> after each other secondary 2 and 3. Then a async request from the primary to
> all the secondary servers is issued to do
> MagmaEnsureCorrectNodeConfiguration. So secondary 2 and 3 are at the same
> time shutting down & receiving a warmup request.

The new versions of Magma client and Magma server, which I've posted
to the "Magma tester" project of squeaksource, should address this
issue as well.

I welcome your attempt to break it with these new packages loaded.

Regards,
  Chris