Magma High Availability Shutdown a Node gets a timeout

Mon Nov 23 21:04:25 UTC 2009

Thanks for the great note Bart.  An impressive analysis, it appears
you have indeed uncovered a bug.  I do have a fix, but first, please
let me clarify the term "Node" as it relates to Magma.  A MagmaNode
represents a collection of servers all supporting _one_ repository.
Each server maintains its own copy of that one repository.  Their goal
of a "Node" is to provide connecting MagmaSessions the illusion of one
single repository that never goes down.  Each member of the Node is
simply referred to as a "server", either "the primary" or "a
secondary".

Incidentally, multiple Nodes are introduced by applications
specifically written to connect objects _between_ repositories via
MagmaForwardingProxy's.  It's an advanced feature permitting Magma
applications to scale along an additional dimension than that provided
by multi-server MagmaNodes, by the applications creating "bookmarks"
to objects in other physical repositories, they can be handled by
separate cpus..  But that is a separate subject and something I doubt
you are yet using.

So, your assessment of the problem is spot-on.  However, the correct
solution is to implement the missing method:

  MagmaEnsureCorrectNodeConfiguration>>#wantsReponse
	^ false

The group of servers that make up a MagmaNode communicate with each
other for administrative tasks via a client/server model just like
those used between a MagmaSession and a Magma server.  In this c/s
model, the primary is the "server," and the secondary's are the
"clients".  Secondary's may make synchronous requests to the primary
(e.g., wait for a response), but the primary must only send async
requests to the secondary's, otherwise a dead-lock could potentially
occur.

The "Ma client server" framework allows any request to be processed
asynchronously by answering false to #wantsResponse.

=====

Ok, I have posted new packages to MagmaTester with the above-mentioned
fix.  Please load the (3) updated packages and let me know if you have
further problems.  I think I smell an r44 around the corner..

 - Chris

On Sun, Nov 22, 2009 at 7:55 AM, Bart Gauquie <bart.gauquie at gmail.com> wrote:
> Dear all,
>
> I'm using Pharo1.0rc1 Latest update: #10493, with Magma r43final.
>
> I've been experimenting with Magma High availability. Its working for me
> except for shutting down a node always throws a timeout exception.
> If i have 1 root server & 1 node , everything works.
> If i have 1 root server & 2 attached nodes, and shutdown one of them a
> timeout is thrown.
> I've been looking into it and i have some questions about how things work in
> magma.
> Let me explain the flow I've seen and where if fails.
> I have a node with following configuration: 'a MagmaNode
> magma at craptop:51001, magma at craptop:51003, magma at craptop:51004' ;
> in which
>
> magma at craptop:51001 is the primary,
> magma at craptop:51003 is Node 2,
> magma at craptop:51004 is Node 3
>
> If i shutdown Node 3 by calling shutdown on the serverconsole a
> 'MaRemoveSecondaryLocationRequest' is sent to the primary. On the primary a
> MagmaNodeUpdate is initialized with as remove field the Node 3. This is
> applied to the Magma node of the primary, and committed to each Node also
> (MagmaNodeUpdate processUsing: aMagmaServerConsole). I can check this
> because on primary, Node 2 and Node3 a new commitxxx.log appears with a new
> timestamp.
>
> Then MagmaServerConsole>>ensureCorrectNodeConfiguration is executed on the
> primary.  Since it is the primary it also executes:
> 'self sessionsForOtherLocationsDo: [ : each | each
> ensureCorrectNodeConfiguration ] ', which happens only on the Node 2 (Node 3
> was successfully removed from the Magma Node).
> If i then debug in the Node 2, it again executes
> MagmaServerConsole>>ensureCorrectNodeConfiguration, but since this is not a
> primary, it executes:
> beWarmBackupFor: primaryLocation . This sets up a adminsession to the
> primary and registers itself as a warm backup for. However this takes a lot
> of time, and in the meantime, Node 3, which was still waiting on a reply for
> the original 'MaRemoveSecondaryLocationRequest' request, timeouts.
> Furthermore: why has Node2 have to beWarmupBackupFor: aPrimaryLocation if it
> is already a warmup for that primary location. Is it normal that he tries to
> do that again? Furthermore: if there is more than 3 nodes (say for instance
> 10 or more) each of them is again beWarmBackupFor the primary.
> The way i fixed it is:
> i added following:
> MagmaServerConsole>>isWarmBackupFor: primaryLocation
> ^primaryLocation = self node primaryLocation
>
> which returns if this serverconsole already is a warmbackup for some primary
> location.
> And added following:
> MagmaServerConsole>>beWarmBackupFor: primaryLocation
>   (self isWarmBackupFor: primaryLocation)
>     ifTrue: [^nil].
>
> which is a guard clause which checks if the node is already a warmbackup for
> the given primarylocation, if so, just bail out early and do nothing.
> With this fix, the shutdown of a Node3 works.
> Is this a known issue? Is my solution correct? I do not know enough about
> the internals of Magma to correctly judge about it.
> Thanks in advance for any help.
> I've attached a change set for both changes methods. Did not write any test
> for it :-(, and did not run other tests of magma.
> Kind regards,
> Bart
> --
> imagination is more important than knowledge - Albert Einstein
> Logic will get you from A to B. Imagination will take you everywhere -
> Albert Einstein
> Learn from yesterday, live for today, hope for tomorrow. The important thing
> is not to stop questioning. - Albert Einstein
> The true sign of intelligence is not knowledge but imagination. - Albert
> Einstein
> Gravitation is not responsible for people falling in love. - Albert Einstein
>
> _______________________________________________
> Magma mailing list
> Magma at lists.squeakfoundation.org
> http://lists.squeakfoundation.org/mailman/listinfo/magma
>
>