[Box-Admins] Source.Squeak.org how-to-fix guidelines
Ken Causey
ken at kencausey.com
Sat Dec 21 17:44:57 UTC 2013
I wish I could provide a TLDR, but really I ask you to try to persist in
reading through this. I think, more than I originally expected when I
started to write this, that it contains a fair amount of my philosophy
regarding the maintenance of the Squeak servers and services. It
certainly turned out much longer than I expected and could use an editor.
Back to the original email:
I mentioned in my recent email about an issue with squeaksource.com that
it helps if those who know give guidance to everyone else about easy
things that can be done to fix issues and how to make decisions about
what to do.
In that vein I will do my best to provide the same for
source.squeak.org. Presumably much of this is also true for
squeaksource.com but I'm not going to assume it.
So the scenario is this: source.squeak.org is unresponsive or the page
does not render in full; the first is much more common.
First, because it has happened before, you need to ensure the problem is
not at a higher/later level, that is that it is not a problem with
Apache on the box2 server. The easiest way to do this is to check any
other service, perhaps most other services, that run on box2 and see if
they are working OK or not.
Here is a list (probably not exhaustive):
bugs.squeak.org (Apache/FastCGI PHP)
www.squeak.org (Apache/AIDA Squeak)
lists.squeakfoundation.org (Apache/C & Python CGI)
ezmlm.squeak.org (Apache, defunct but still exists)
The last two are notable in that they rely on very little outside of
Apache to work. If Apache is working but those services are not working
then the server is in bad shape indeed.
So in brief if you go to ezmlm.squeak.org and get a page that says:
Some old Mailing Lists
Click Here for List Archives and Information
and you can click the second line and get a list of defunct/dead mailing
lists then Apache is probably not the source of the problem.
If Apache is not working then the thing to do is:
sudo /etc/init.d/apache2 restart
If this gives a problem, or just doesn't work (be patient) then
sudo /etc/init.d/apache2 stop
Note the message printed and wait a moment, if it appears to have
stopped fine then
sudo /etc/init.d/apache2 start
Again wait a moment, and then check ezmlm.squeak.org again.
If this is is still not working then I guess it is time to reboot the
server. However, I would ask that you really only do this as a last
resort and that you not do so as a quick decision. First email
box-admins, feel free to Cc me directly if you want. Wait some time, 30
minutes, an hour, two hours. You can make your own judgment call on
that. I have and it's not always consistent.
The point is that I or someone else may want to look at the situation
first if only for information gathering purposes. Ultimately restarting
the server once or twice without waiting for others to chime in is not
going to get you in any trouble. A history of it is likely to start to
annoy me, I assume it would annoy others as well. The reasoning is that
while the system is in its broken state there is the possibility of
gathering information about the problem that is not recoverable, at
least not easily, once the system has been rebooted.
OK but if rebooting is the answer then it is simply
sudo reboot
Be aware that the server does tend to take multiple minutes (It has been
a while, I remember it seems like a long time, I don't remember how long
it really tends to be) to be responsive again. My habit after I have
been booted off the server is to
ping squeak
And you say 'Huh'?
Yeah, this is beginning to digress, but I will persist nonetheless. For
my own convenience some time ago I modified my /etc/hosts file.
(Clearly we are getting off into the bushes and this is only directly
applicable if you run Linux and friends locally, if you run MacOSX this
may still apply to you pretty closely, but I don't know. Those of you
on Windows: the fundamental facts are all true but the details have been
changed. Google it.)
$ cat /etc/hosts
<snip>
# utility
85.10.195.197 squeak
173.246.101.237 box3
173.246.104.42 box4
<snip>
From the naming pattern you can guess that I started this before box3
and box4 existed. If you choose to do the same you can use any names
you like, just don't mask any names used in your local network if there
are any.
Back to the point at hand:
ken at neue:~$ ping squeak
PING squeak (85.10.195.197) 56(84) bytes of data.
64 bytes from squeak (85.10.195.197): icmp_seq=1 ttl=47 time=128 ms
64 bytes from squeak (85.10.195.197): icmp_seq=2 ttl=47 time=128 ms
64 bytes from squeak (85.10.195.197): icmp_seq=3 ttl=47 time=127 ms
This is what you want to see. While the server is restarting though,
and the TCP/IP stack is down, ping is just going to be silent; but it
will keep trying. So leave this and go back to whatever else you were
doing. Check it occasionally and unless catastrophe has occurred you
should in time see the above (of course your ping times will vary).
At this point then the server has or is restarting. It has reached the
point that the echo service is working. That however does not mean
everything is yet working, that includes sshd and apache. Nonetheless
you can go ahead and try to ssh to the server or check any of the web
services. But if they don't immediately work, don't despair, minutes of
time is required for everything to come back up at the best of times.
Ultimately everything should start back up as normal. If it doesn't
then it is time to call for help.
OK, so end of the 'reboot the server' digression.
The scenario is now this: source.squeak.org is not working but we have
checked and we believe that Apache is fine. This tells us that the
problem is almost certainly isolated to the source.squeak.org Squeak
process. Probably the thing to do is kill it and let daemontools
restart it. But don't do that without checking out the image that will
be used first.
$ ls -lh ~squeaksource/Squeak3.11-8824-SS.image
-rw-r--r-- 1 squeaksource squeaksource 35M Dec 21 16:25
/home/squeaksource/Squeak3.11-8824-SS.image
The main thing to look at here is the file size (5th column). The size
of the file should be in this vicinity, it does grow but slowly. That
is it grows slowly under normal conditions, sometimes, and this is a
danger of the easy save-the-world persistence strategy we use, the image
is saved when something has gone wrong and the heap has grown
tremendously. I believe I have seen this file saved at between
150-200MB before. When restarting the image does not work then it is
nearly always the case that this file is much larger than expected. I
can't remember any case in which it has restarted properly when the file
is larger than normal.
If you forget to look and just kill the process (I will get to that
shortly) it is not the end of the world. It may just mean that killing
it and having it restarted does not fix the problem and in my opinion it
is better to look at the image first and have some confidence that it is
not corrupted.
If it is corrupted you can find recent backups of the bulk of the
filesystem under /var/cache/rsnapshot/. This directory will look like
$ ls /var/cache/rsnapshot/
daily.0 daily.1 daily.2 daily.3 daily.4 daily.5 daily.6
daily.0 is the most recent backup (within the last 24 hours), daily.1
the next most recent, etc.
What I might do then if I'm looking for a good backup image is this
~$ ls -lh
/var/cache/rsnapshot/*/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r-- 1 squeaksource squeaksource 35M Dec 21 16:25
/var/cache/rsnapshot/daily.0/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r-- 1 squeaksource squeaksource 37M Dec 20 17:25
/var/cache/rsnapshot/daily.1/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r-- 1 squeaksource squeaksource 35M Dec 19 15:24
/var/cache/rsnapshot/daily.2/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r-- 1 squeaksource squeaksource 35M Dec 18 16:24
/var/cache/rsnapshot/daily.3/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r-- 1 squeaksource squeaksource 35M Dec 17 12:23
/var/cache/rsnapshot/daily.4/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r-- 1 squeaksource squeaksource 35M Dec 16 12:23
/var/cache/rsnapshot/daily.5/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r-- 1 squeaksource squeaksource 35M Dec 15 11:22
/var/cache/rsnapshot/daily.6/localhost/home/squeaksource/Squeak3.11-8824-SS.image
It's not well formatted in this email but I hope you get the idea. If
possible you want to use the most recent backup which will be found
within the daily.0 directory. But it is very possible that the backup
backed it up after it was corrupted, in which case you consider daily.1,
and so on. In any case once you find a copy that looks like it is
probably OK then make a backup copy of the corrupted image and changes
just in case someone wants to take a look at it, then copy over both the
image and changes from the backup you picked into the squeaksource home
directory.
Hopefully now killing any existing process and having daemontools start
it back up will work. And to do this you first find have to identify
the relevant process. One way is
$ ps auwx | grep squeaks
which should produce a list something like
root 2150 0.0 0.0 1360 268 ? S Nov28 0:00 supervise
squeaksource
squeaks 2176 3.6 8.0 1051344 77724 ? S Nov28 1256:48
/usr/local/lib/squeak/3.11.3-2135/squeakvm -vm-display-none
/home/squeaksource/Squeak3.11-8824-SS.image
website 30990 25.2 10.8 1051420 105060 ? S 16:59 9:31
/usr/bin/squeakvm -vm-display=none /home/website/website/squeaksite.image
kencaus 1409 0.0 0.0 1552 524 pts/0 S+ 17:37 0:00 grep squeaks
The relevant one is the squeakvm process referencing the proper image of
course, the second one in this list. At which point you would take the
process ID (second column) and do
$ kill 2176
for example. The ID will of course vary. Check again and if the
process is stuck, the one with the given ID does not disappear from the
list, then you may have to
$ kill -9 2176
In any case if you repeatedly look at the relevant list of running
processes you should soon see another ... squeakvm ...
Squeak3.11-8824-SS.image process running with a new process ID and
hopefully if you check http://source.squeak.org you will get what you
expect.
Let me note that the filtered list of processes above also includes, on
the first line, the daemontools process that 'supervises' the
source.squeak.org service. Note that if you don't see this in the list
then there is probably a problem with daemontools itself and in any case
when you kill the process I don't expect that a new one will be started.
However I'm going to draw this to a close here and leave that for
another time.
Ken
More information about the Box-Admins
mailing list