[squeak-dev] trunk process resilience

David T. Lewis lewis at mail.msen.com
Fri Nov 8 18:26:36 UTC 2013


On Thu, Nov 07, 2013 at 03:07:59PM -0600, Chris Muller wrote:
> Lately we've had some problems with the SqueakSource server that supports
> our vital trunk process.  Ken and I burned several hours on it this week.
>  The experience has caused me to consider an idea for improved continuity
> of our trunk repository.
> 
> Very simply, it's a second running copy of trunk (and inbox, et al).  Each
> instance keeps itself up to date from the other.  If one goes down, the
> other can be pointed to for updates AND commits to minimize disruption.
> 
> Right now, we actually already have two trunks.  Now, I'm pleased to
> announce that new-trunk running on box4.squeak.org is now a *full-copy* of
> old-trunk on box2.  (Before it was only trunk, now it includes Inbox,
> Etoys, etc.).  Using newer and better code and VM and also Magma, this copy
> of trunk was originally brought up simply to provide MC method history
> directly into the IDE, but now I can see its role being to improve trunk
> process stability so that community development can be continuous until it
> eventually becomes the defacto trunk (e.g., running source.squeak.org).
> 
> There are other side-benefits too, like the ability to move or upgrade the
> trunk without a service interruption.  We are assured to be ready to move
> to a different server on a moments notice, e.g., break the link with
> Hetzner.
> 

I like the idea of building some resilience into the SqueakSource servers.
I also like the idea of using Magma to support this, because I know that
Magma has been used to address similar issues on much larger scale systems.

I do have some concerns of a non-technical nature:

1) From an operational point of view, we need to keep our systems as simple
as possible. There are very few people supporting the servers, and their
availability comes and goes over time, so we need to keep things simple
enough that any box-admins person can always figure out how to get things
running even if the expert is not available.

2) We need to be careful not to add more failure modes than we remove. This
is a painfully common mistake, in which people add high availability features
to an existing system with the result that new failure modes are introduced
that turn out to be worse than the failure modes that they were attempting
to mitigate.

As an example, I would point to the recent downtime on SmalltalkHub
(see the excellent recap provided by Philippe Marschall at
https://github.com/blog/1346-network-problems-last-friday). The system
had availability problems for an extended period of time, and the cause
was a (human error induced) failure in some redundant networking gear.
The high availability networking introduced additional failure modes, and
the combination of human error and system complexity reduced the resilience
of the system as a whole.

This is meant only as a cautionary note. I really *do* like the idea of
building in some redundancy, and I think that the work you (Chris) have
done with box4.squeak.org might be a good way to do it.

> 
> So, I guess I'm proposing that we have some elements in the image "aware"
> of a second trunk.  But before wrangling out exactly what form that
> awareness would take, what do you think so far?
> 

We should keep any changes in the image to a minimum, but the general idea
sounds good to me.

Dave



More information about the Squeak-dev mailing list