[Box-Admins] Source.Squeak.org how-to-fix guidelines

Sat Dec 21 17:44:57 UTC 2013

I wish I could provide a TLDR, but really I ask you to try to persist in 
reading through this.  I think, more than I originally expected when I 
started to write this, that it contains a fair amount of my philosophy 
regarding the maintenance of the Squeak servers and services.  It 
certainly turned out much longer than I expected and could use an editor.

Back to the original email:

I mentioned in my recent email about an issue with squeaksource.com that 
it helps if those who know give guidance to everyone else about easy 
things that can be done to fix issues and how to make decisions about 
what to do.

In that vein I will do my best to provide the same for 
source.squeak.org.  Presumably much of this is also true for 
squeaksource.com but I'm not going to assume it.

So the scenario is this: source.squeak.org is unresponsive or the page 
does not render in full; the first is much more common.

First, because it has happened before, you need to ensure the problem is 
not at a higher/later level, that is that it is not a problem with 
Apache on the box2 server.  The easiest way to do this is to check any 
other service, perhaps most other services, that run on box2 and see if 
they are working OK or not.

Here is a list (probably not exhaustive):

bugs.squeak.org (Apache/FastCGI PHP)
www.squeak.org (Apache/AIDA Squeak)
lists.squeakfoundation.org (Apache/C & Python CGI)
ezmlm.squeak.org (Apache, defunct but still exists)

The last two are notable in that they rely on very little outside of 
Apache to work.  If Apache is working but those services are not working 
then the server is in bad shape indeed.

So in brief if you go to ezmlm.squeak.org and get a page that says:

Some old Mailing Lists

Click Here for List Archives and Information

and you can click the second line and get a list of defunct/dead mailing 
lists then Apache is probably not the source of the problem.

If Apache is not working then the thing to do is:

sudo /etc/init.d/apache2 restart

If this gives a problem, or just doesn't work (be patient) then

sudo /etc/init.d/apache2 stop

Note the message printed and wait a moment, if it appears to have 
stopped fine then

sudo /etc/init.d/apache2 start

Again wait a moment, and then check ezmlm.squeak.org again.

If this is is still not working then I guess it is time to reboot the 
server.  However, I would ask that you really only do this as a last 
resort and that you not do so as a quick decision.  First email 
box-admins, feel free to Cc me directly if you want.  Wait some time, 30 
minutes, an hour, two hours.  You can make your own judgment call on 
that.  I have and it's not always consistent.

The point is that I or someone else may want to look at the situation 
first if only for information gathering purposes.  Ultimately restarting 
the server once or twice without waiting for others to chime in is not 
going to get you in any trouble.  A history of it is likely to start to 
annoy me, I assume it would annoy others as well.  The reasoning is that 
while the system is in its broken state there is the possibility of 
gathering information about the problem that is not recoverable, at 
least not easily, once the system has been rebooted.

OK but if rebooting is the answer then it is simply

sudo reboot

Be aware that the server does tend to take multiple minutes (It has been 
a while, I remember it seems like a long time, I don't remember how long 
it really tends to be) to be responsive again.  My habit after I have 
been booted off the server is to

ping squeak

And you say 'Huh'?

Yeah, this is beginning to digress, but I will persist nonetheless.  For 
my own convenience some time ago I modified my /etc/hosts file. 
(Clearly we are getting off into the bushes and this is only directly 
applicable if you run Linux and friends locally, if you run MacOSX this 
may still apply to you pretty closely, but I don't know.  Those of you 
on Windows: the fundamental facts are all true but the details have been 
changed.  Google it.)

$ cat /etc/hosts
<snip>
# utility
85.10.195.197	squeak
173.246.101.237 box3
173.246.104.42  box4
<snip>

 From the naming pattern you can guess that I started this before box3 
and box4 existed.  If you choose to do the same you can use any names 
you like, just don't mask any names used in your local network if there 
are any.

Back to the point at hand:

ken at neue:~$ ping squeak
PING squeak (85.10.195.197) 56(84) bytes of data.
64 bytes from squeak (85.10.195.197): icmp_seq=1 ttl=47 time=128 ms
64 bytes from squeak (85.10.195.197): icmp_seq=2 ttl=47 time=128 ms
64 bytes from squeak (85.10.195.197): icmp_seq=3 ttl=47 time=127 ms

This is what you want to see.  While the server is restarting though, 
and the TCP/IP stack is down, ping is just going to be silent; but it 
will keep trying.  So leave this and go back to whatever else you were 
doing.  Check it occasionally and unless catastrophe has occurred you 
should in time see the above (of course your ping times will vary).

At this point then the server has or is restarting.  It has reached the 
point that the echo service is working.  That however does not mean 
everything is yet working, that includes sshd and apache.  Nonetheless 
you can go ahead and try to ssh to the server or check any of the web 
services.  But if they don't immediately work, don't despair, minutes of 
time is required for everything to come back up at the best of times. 
Ultimately everything should start back up as normal.  If it doesn't 
then it is time to call for help.

OK, so end of the 'reboot the server' digression.

The scenario is now this:  source.squeak.org is not working but we have 
checked and we believe that Apache is fine.  This tells us that the 
problem is almost certainly isolated to the source.squeak.org Squeak 
process.  Probably the thing to do is kill it and let daemontools 
restart it.  But don't do that without checking out the image that will 
be used first.

$ ls -lh ~squeaksource/Squeak3.11-8824-SS.image
-rw-r--r--  1 squeaksource squeaksource 35M Dec 21 16:25 
/home/squeaksource/Squeak3.11-8824-SS.image

The main thing to look at here is the file size (5th column).  The size 
of the file should be in this vicinity, it does grow but slowly.  That 
is it grows slowly under normal conditions, sometimes, and this is a 
danger of the easy save-the-world persistence strategy we use, the image 
is saved when something has gone wrong and the heap has grown 
tremendously.  I believe I have seen this file saved at between 
150-200MB before.  When restarting the image does not work then it is 
nearly always the case that this file is much larger than expected.  I 
can't remember any case in which it has restarted properly when the file 
is larger than normal.

If you forget to look and just kill the process (I will get to that 
shortly) it is not the end of the world.  It may just mean that killing 
it and having it restarted does not fix the problem and in my opinion it 
is better to look at the image first and have some confidence that it is 
not corrupted.

If it is corrupted you can find recent backups of the bulk of the 
filesystem under /var/cache/rsnapshot/.  This directory will look like

$ ls /var/cache/rsnapshot/
daily.0  daily.1  daily.2  daily.3  daily.4  daily.5  daily.6

daily.0 is the most recent backup (within the last 24 hours), daily.1 
the next most recent, etc.

What I might do then if I'm looking for a good backup image is this

~$ ls -lh 
/var/cache/rsnapshot/*/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r--  1 squeaksource squeaksource 35M Dec 21 16:25 
/var/cache/rsnapshot/daily.0/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r--  1 squeaksource squeaksource 37M Dec 20 17:25 
/var/cache/rsnapshot/daily.1/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r--  1 squeaksource squeaksource 35M Dec 19 15:24 
/var/cache/rsnapshot/daily.2/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r--  1 squeaksource squeaksource 35M Dec 18 16:24 
/var/cache/rsnapshot/daily.3/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r--  1 squeaksource squeaksource 35M Dec 17 12:23 
/var/cache/rsnapshot/daily.4/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r--  1 squeaksource squeaksource 35M Dec 16 12:23 
/var/cache/rsnapshot/daily.5/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r--  1 squeaksource squeaksource 35M Dec 15 11:22 
/var/cache/rsnapshot/daily.6/localhost/home/squeaksource/Squeak3.11-8824-SS.image

It's not well formatted in this email but I hope you get the idea.  If 
possible you want to use the most recent backup which will be found 
within the daily.0 directory.  But it is very possible that the backup 
backed it up after it was corrupted, in which case you consider daily.1, 
and so on.  In any case once you find a copy that looks like it is 
probably OK then make a backup copy of the corrupted image and changes 
just in case someone wants to take a look at it, then copy over both the 
image and changes from the backup you picked into the squeaksource home 
directory.

Hopefully now killing any existing process and having daemontools start 
it back up will work.  And to do this you first find have to identify 
the relevant process.  One way is

$ ps auwx | grep squeaks

which should produce a list something like

root      2150  0.0  0.0  1360  268 ?        S    Nov28   0:00 supervise 
squeaksource
squeaks   2176  3.6  8.0 1051344 77724 ?     S    Nov28 1256:48 
/usr/local/lib/squeak/3.11.3-2135/squeakvm -vm-display-none 
/home/squeaksource/Squeak3.11-8824-SS.image
website  30990 25.2 10.8 1051420 105060 ?    S    16:59   9:31 
/usr/bin/squeakvm -vm-display=none /home/website/website/squeaksite.image
kencaus   1409  0.0  0.0  1552  524 pts/0    S+   17:37   0:00 grep squeaks

The relevant one is the squeakvm process referencing the proper image of 
course, the second one in this list.  At which point you would take the 
process ID (second column) and do

$ kill 2176

for example.  The ID will of course vary.  Check again and if the 
process is stuck, the one with the given ID does not disappear from the 
list, then you may have to

$ kill -9 2176

In any case if you repeatedly look at the relevant list of running 
processes you should soon see another ... squeakvm ... 
Squeak3.11-8824-SS.image process running with a new process ID and 
hopefully if you check http://source.squeak.org you will get what you 
expect.

Let me note that the filtered list of processes above also includes, on 
the first line, the daemontools process that 'supervises' the 
source.squeak.org service.  Note that if you don't see this in the list 
then there is probably a problem with daemontools itself and in any case 
when you kill the process I don't expect that a new one will be started.

However I'm going to draw this to a close here and leave that for 
another time.

Ken