[Box-Admins] Source.Squeak.org how-to-fix guidelines
ken at kencausey.com
Sat Dec 21 17:52:25 UTC 2013
See an 'edit' below:
On 12/21/2013 11:44 AM, Ken Causey wrote:
> I wish I could provide a TLDR, but really I ask you to try to persist in
> reading through this. I think, more than I originally expected when I
> started to write this, that it contains a fair amount of my philosophy
> regarding the maintenance of the Squeak servers and services. It
> certainly turned out much longer than I expected and could use an editor.
> Back to the original email:
> I mentioned in my recent email about an issue with squeaksource.com that
> it helps if those who know give guidance to everyone else about easy
> things that can be done to fix issues and how to make decisions about
> what to do.
> In that vein I will do my best to provide the same for
> source.squeak.org. Presumably much of this is also true for
> squeaksource.com but I'm not going to assume it.
> So the scenario is this: source.squeak.org is unresponsive or the page
> does not render in full; the first is much more common.
> First, because it has happened before, you need to ensure the problem is
> not at a higher/later level, that is that it is not a problem with
> Apache on the box2 server. The easiest way to do this is to check any
> other service, perhaps most other services, that run on box2 and see if
> they are working OK or not.
> Here is a list (probably not exhaustive):
> bugs.squeak.org (Apache/FastCGI PHP)
> www.squeak.org (Apache/AIDA Squeak)
> lists.squeakfoundation.org (Apache/C & Python CGI)
> ezmlm.squeak.org (Apache, defunct but still exists)
> The last two are notable in that they rely on very little outside of
> Apache to work. If Apache is working but those services are not working
> then the server is in bad shape indeed.
> So in brief if you go to ezmlm.squeak.org and get a page that says:
> Some old Mailing Lists
> Click Here for List Archives and Information
> and you can click the second line and get a list of defunct/dead mailing
> lists then Apache is probably not the source of the problem.
> If Apache is not working then the thing to do is:
> sudo /etc/init.d/apache2 restart
> If this gives a problem, or just doesn't work (be patient) then
> sudo /etc/init.d/apache2 stop
> Note the message printed and wait a moment, if it appears to have
> stopped fine then
> sudo /etc/init.d/apache2 start
> Again wait a moment, and then check ezmlm.squeak.org again.
> If this is is still not working then I guess it is time to reboot the
> server. However, I would ask that you really only do this as a last
> resort and that you not do so as a quick decision. First email
> box-admins, feel free to Cc me directly if you want. Wait some time, 30
> minutes, an hour, two hours. You can make your own judgment call on
> that. I have and it's not always consistent.
> The point is that I or someone else may want to look at the situation
> first if only for information gathering purposes. Ultimately restarting
> the server once or twice without waiting for others to chime in is not
> going to get you in any trouble. A history of it is likely to start to
> annoy me, I assume it would annoy others as well. The reasoning is that
> while the system is in its broken state there is the possibility of
> gathering information about the problem that is not recoverable, at
> least not easily, once the system has been rebooted.
> OK but if rebooting is the answer then it is simply
> sudo reboot
> Be aware that the server does tend to take multiple minutes (It has been
> a while, I remember it seems like a long time, I don't remember how long
> it really tends to be) to be responsive again. My habit after I have
> been booted off the server is to
> ping squeak
> And you say 'Huh'?
> Yeah, this is beginning to digress, but I will persist nonetheless. For
> my own convenience some time ago I modified my /etc/hosts file. (Clearly
> we are getting off into the bushes and this is only directly applicable
> if you run Linux and friends locally, if you run MacOSX this may still
> apply to you pretty closely, but I don't know. Those of you on Windows:
> the fundamental facts are all true but the details have been changed.
> Google it.)
> $ cat /etc/hosts
> # utility
> 184.108.40.206 squeak
> 220.127.116.11 box3
> 18.104.22.168 box4
> From the naming pattern you can guess that I started this before box3
> and box4 existed. If you choose to do the same you can use any names
> you like, just don't mask any names used in your local network if there
> are any.
> Back to the point at hand:
> ken at neue:~$ ping squeak
> PING squeak (22.214.171.124) 56(84) bytes of data.
> 64 bytes from squeak (126.96.36.199): icmp_seq=1 ttl=47 time=128 ms
> 64 bytes from squeak (188.8.131.52): icmp_seq=2 ttl=47 time=128 ms
> 64 bytes from squeak (184.108.40.206): icmp_seq=3 ttl=47 time=127 ms
> This is what you want to see. While the server is restarting though,
> and the TCP/IP stack is down, ping is just going to be silent; but it
> will keep trying. So leave this and go back to whatever else you were
> doing. Check it occasionally and unless catastrophe has occurred you
> should in time see the above (of course your ping times will vary).
> At this point then the server has or is restarting. It has reached the
> point that the echo service is working. That however does not mean
> everything is yet working, that includes sshd and apache. Nonetheless
> you can go ahead and try to ssh to the server or check any of the web
> services. But if they don't immediately work, don't despair, minutes of
> time is required for everything to come back up at the best of times.
> Ultimately everything should start back up as normal. If it doesn't
> then it is time to call for help.
> OK, so end of the 'reboot the server' digression.
> The scenario is now this: source.squeak.org is not working but we have
> checked and we believe that Apache is fine. This tells us that the
> problem is almost certainly isolated to the source.squeak.org Squeak
> process. Probably the thing to do is kill it and let daemontools
> restart it. But don't do that without checking out the image that will
> be used first.
> $ ls -lh ~squeaksource/Squeak3.11-8824-SS.image
> -rw-r--r-- 1 squeaksource squeaksource 35M Dec 21 16:25
> The main thing to look at here is the file size (5th column). The size
> of the file should be in this vicinity, it does grow but slowly. That
> is it grows slowly under normal conditions, sometimes, and this is a
> danger of the easy save-the-world persistence strategy we use, the image
> is saved when something has gone wrong and the heap has grown
> tremendously. I believe I have seen this file saved at between
> 150-200MB before. When restarting the image does not work then it is
> nearly always the case that this file is much larger than expected. I
> can't remember any case in which it has restarted properly when the file
> is larger than normal.
> If you forget to look and just kill the process (I will get to that
> shortly) it is not the end of the world. It may just mean that killing
> it and having it restarted does not fix the problem and in my opinion it
> is better to look at the image first and have some confidence that it is
> not corrupted.
> If it is corrupted you can find recent backups of the bulk of the
> filesystem under /var/cache/rsnapshot/. This directory will look like
> $ ls /var/cache/rsnapshot/
> daily.0 daily.1 daily.2 daily.3 daily.4 daily.5 daily.6
> daily.0 is the most recent backup (within the last 24 hours), daily.1
> the next most recent, etc.
> What I might do then if I'm looking for a good backup image is this
> ~$ ls -lh
> -rw-r--r-- 1 squeaksource squeaksource 35M Dec 21 16:25
> -rw-r--r-- 1 squeaksource squeaksource 37M Dec 20 17:25
> -rw-r--r-- 1 squeaksource squeaksource 35M Dec 19 15:24
> -rw-r--r-- 1 squeaksource squeaksource 35M Dec 18 16:24
> -rw-r--r-- 1 squeaksource squeaksource 35M Dec 17 12:23
> -rw-r--r-- 1 squeaksource squeaksource 35M Dec 16 12:23
> -rw-r--r-- 1 squeaksource squeaksource 35M Dec 15 11:22
> It's not well formatted in this email but I hope you get the idea. If
> possible you want to use the most recent backup which will be found
> within the daily.0 directory. But it is very possible that the backup
> backed it up after it was corrupted, in which case you consider daily.1,
> and so on. In any case once you find a copy that looks like it is
> probably OK then make a backup copy of the corrupted image and changes
> just in case someone wants to take a look at it, then copy over both the
> image and changes from the backup you picked into the squeaksource home
> Hopefully now killing any existing process and having daemontools start
> it back up will work. And to do this you first find have to identify
> the relevant process. One way is
David, in an email that came in just after I hit send on this one,
reminded me that there is a simpler way (than that shown below). The
quick way to restart any daemontools monitored service is
$ sudo svc -t <service name>
in this case
$ sudo svc -t squeaksource
Don't remember the service name?
$ ls -l /service/ | grep squeaksource
lrwxrwxrwx 1 root root 26 Oct 10 2006 squeaksource ->
I specified squeaksource above because that is the name of the user/home
directory under which the service info resides for the service in
question. Of course if you are trying to restart a different service
then substitute the other username as appropriate.
The info below is still of some value if svc -t does not seem to be
working. 98% of the time though, I expect it will.
> $ ps auwx | grep squeaks
> which should produce a list something like
> root 2150 0.0 0.0 1360 268 ? S Nov28 0:00 supervise
> squeaks 2176 3.6 8.0 1051344 77724 ? S Nov28 1256:48
> /usr/local/lib/squeak/3.11.3-2135/squeakvm -vm-display-none
> website 30990 25.2 10.8 1051420 105060 ? S 16:59 9:31
> /usr/bin/squeakvm -vm-display=none /home/website/website/squeaksite.image
> kencaus 1409 0.0 0.0 1552 524 pts/0 S+ 17:37 0:00 grep squeaks
> The relevant one is the squeakvm process referencing the proper image of
> course, the second one in this list. At which point you would take the
> process ID (second column) and do
> $ kill 2176
> for example. The ID will of course vary. Check again and if the
> process is stuck, the one with the given ID does not disappear from the
> list, then you may have to
> $ kill -9 2176
> In any case if you repeatedly look at the relevant list of running
> processes you should soon see another ... squeakvm ...
> Squeak3.11-8824-SS.image process running with a new process ID and
> hopefully if you check http://source.squeak.org you will get what you
> Let me note that the filtered list of processes above also includes, on
> the first line, the daemontools process that 'supervises' the
> source.squeak.org service. Note that if you don't see this in the list
> then there is probably a problem with daemontools itself and in any case
> when you kill the process I don't expect that a new one will be started.
> However I'm going to draw this to a close here and leave that for
> another time.
More information about the Box-Admins