Magma High Availability Shutdown a Node gets a timeout

Thu Nov 26 11:32:15 UTC 2009

Chris,

Thanks again for your quick fixes.

I've attempted to trigger a timeout while stopping and starting nodes;
however I failed to do so. :-).

I've been testing the high availability and I've some bad news and
(possibly) some good news about it.

I'll start with the bad news: I've noticed the following:

I have 1 primary server with 3 secondary nodes on it. Furthermore I have a
client which connects to this magma node. In the server images I loaded the
latest fixes, in the client nodes loaded the r1_43.  I've noticed the
following :

   - If I only start the primary and connect from the client (create a new
   magma session): it takes around 5-6 seconds; which is fine.
   - If I start the first secondary and connect : around 10 seconds, which
   is still fine.
   - However if I start a second secondary and connect: it takes a lot more
   time: near a minute to connect.
   - It gets worse if I started a third secondary. Then the time to connect
   would raise sometimes to two minutes or more.

This is off course very bad, since adding more nodes increases the connect
time from a client to the magma node quite a lot; and also consumes a lot of
cpu on all the nodes.
Now for the good news: if I load the latest changes into my client images
also, the connect times are back OK! Connect only to a primary takes around
6 seconds. If any of the secondary nodes are up (one or all three) it always
takes around 10 seconds. So with the recent changes you've made, you've also
fixed this connection timeout. I suppose it is due to the fact that
<snippet>
Name: Magma client-cmm.449
Author: cmm
Time: 23 November 2009, 2:26:08 pm
UUID: bd41d4e9-f1e2-4859-b35b-475fceaea2ba
Ancestors: Magma client-cmm.448

- MagmaEnsureCorrectNodeConfiguration was supposed to be an *async*
response!
</snippet>
that before the client connecting to a server triggered
a MagmaEnsureCorrectNodeConfiguration to the node, which was synchronous and
now it is async.
I haven't checked this hypothesis however.

Furthermore, given that:
<snippet>
Name: Magma client-cmm.450
Author: cmm
Time: 24 November 2009, 10:34:33 pm
UUID: b1d8bfe8-b928-4464-ac65-c23e34cbdf99
Ancestors: Magma client-cmm.449

- Optimization for removing a secondary server from a Node. Before, removing
a Node would trigger a #ensureCorrectNodeConfiguration which lead to a
#beWarmBackupFor: sent to each other secondary, on account of nothing other
than some other secondary shutdown. Thanks to Bart Gauquie for complaining!
</snippet>

I suppose that this is also a case in which #ensureCorrectNodeConfiguration
is not necessary; but I'm not sure; you should also check this.

Any hows, thanks again for your good support.

Kind regards,

Bart

On Wed, Nov 25, 2009 at 7:39 PM, Chris Muller <asqueaker at gmail.com> wrote:

> Hi,
>
> > The answer to the other question I've asked:
> > <snippet>
> >  Furthermore: why has Node2 have to beWarmupBackupFor: aPrimaryLocation
> if
> > it is already a warmup for that primary location. Is it normal that he
> tries
> > to do that again?
>
> Note the guard in that method does check whether it is already a
> warm-backup and, if so, avoids the #catchUp:to:, which is where the
> bulk of the work would only be done if necessary to do so.
>
> > Furthermore: if there is more than 3 nodes (say for
> > instance 10 or more) each of them is again beWarmBackupFor the primary.
> > </snippet>
> > is still not clear to me. Is there a specific reason that this node2
> again
> > tries to beWarmupBackupFor: aPrimaryLocation even if it is already one?
> I've
>
> The main reason is for uniformity of the implementation.  See the
> comment in #ensureCorrectNodeConfiguration.  HA must handle a variety
> of scenarios * a variety of pre-conditions * variety of timings of
> possible events..  the idempotent property permits this relatively
> uniform recovery process for all situations:
>
>  1) assess and verify a client complaint
>  2) adjust the Node object accordingly
>  3) call #ensureCorrectNodeConfiguration - the entire Node is righted
>
> > noticed that it takes quite a lot of cpu time to establish that
> connection.
>
> But yes, I do take your point, that even these 2-3 CPU seconds per
> server related to creation of the adminSession, connection, and
> assessment, after all that, that everything is a-ok, seems a bit
> expensive when all you want to do is shut down one secondary.
> Therefore, I've posted new packages with a special check for only
> removing a secondary and, if so, skips step 3, above, of the recovery
> process.
>
> > Trying to break it in an other way, I still found another (possible)
> issue.
> > I can still trigger a timeout during shutdown in the following way: if i
> for
> > instance have a primary and 3 secondary servers; and i shutdown
> immediately
> > after each other secondary 2 and 3. Then a async request from the primary
> to
> > all the secondary servers is issued to do
> > MagmaEnsureCorrectNodeConfiguration. So secondary 2 and 3 are at the same
> > time shutting down & receiving a warmup request.
>
> The new versions of Magma client and Magma server, which I've posted
> to the "Magma tester" project of squeaksource, should address this
> issue as well.
>
> I welcome your attempt to break it with these new packages loaded.
>
> Regards,
>   Chris
>

-- 
imagination is more important than knowledge - Albert Einstein
Logic will get you from A to B. Imagination will take you everywhere -
Albert Einstein
Learn from yesterday, live for today, hope for tomorrow. The important thing
is not to stop questioning. - Albert Einstein
The true sign of intelligence is not knowledge but imagination. - Albert
Einstein
Gravitation is not responsible for people falling in love. - Albert Einstein
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/magma/attachments/20091126/57ac7624/attachment.htm