[Box-Admins] Source.Squeak.org how-to-fix guidelines
David T. Lewis
lewis at mail.msen.com
Sat Dec 21 23:23:00 UTC 2013
Thanks for providing this Ken.
I have read through this quickly and will read more carefully again later.
You quite rightly raised the question of what we as box-admins should
do to document our processes, and in that light I think it would be very
helpful if you can save the information in this note in a README file
(or README-box-admins or something of that sort). I guess in this case
the README might be located in ~website on box2.
I know this is not a very high-tech solution, but pretty much everybody
understands the convention, regardless of their computing background or
level of experience. So I think this might be a small thing that would
really help. The same applies to build.squeak.org or anything else that
one of us sets up on the servers - if we some sort of README breadcrumbs
behind, it can be a big help to whomever needs to figure out how to fix
a problem under pressure when the original expert is not available.
Regarding the current squeaksource.com problem:
The problem with our squeaksource.com web page rendering is internal to
the squeaksource image itself. I get the same symptoms running the image
on my PC at home, and have no problems with a two week old backup copy
of the same image.
I just posted a question to the seaside and squeak-dev lists asking if
anyone has seen this kind of problem before.
For now, the SSC repository is fine, but the web interface is a mess.
If I cannot figure out the problem by tomorrow, I will restart from the
two week old backup image and tidy up whatever problems might arise from
On Sat, Dec 21, 2013 at 11:44:57AM -0600, Ken Causey wrote:
> I wish I could provide a TLDR, but really I ask you to try to persist in
> reading through this. I think, more than I originally expected when I
> started to write this, that it contains a fair amount of my philosophy
> regarding the maintenance of the Squeak servers and services. It
> certainly turned out much longer than I expected and could use an editor.
> Back to the original email:
> I mentioned in my recent email about an issue with squeaksource.com that
> it helps if those who know give guidance to everyone else about easy
> things that can be done to fix issues and how to make decisions about
> what to do.
> In that vein I will do my best to provide the same for
> source.squeak.org. Presumably much of this is also true for
> squeaksource.com but I'm not going to assume it.
> So the scenario is this: source.squeak.org is unresponsive or the page
> does not render in full; the first is much more common.
> First, because it has happened before, you need to ensure the problem is
> not at a higher/later level, that is that it is not a problem with
> Apache on the box2 server. The easiest way to do this is to check any
> other service, perhaps most other services, that run on box2 and see if
> they are working OK or not.
> Here is a list (probably not exhaustive):
> bugs.squeak.org (Apache/FastCGI PHP)
> www.squeak.org (Apache/AIDA Squeak)
> lists.squeakfoundation.org (Apache/C & Python CGI)
> ezmlm.squeak.org (Apache, defunct but still exists)
> The last two are notable in that they rely on very little outside of
> Apache to work. If Apache is working but those services are not working
> then the server is in bad shape indeed.
> So in brief if you go to ezmlm.squeak.org and get a page that says:
> Some old Mailing Lists
> Click Here for List Archives and Information
> and you can click the second line and get a list of defunct/dead mailing
> lists then Apache is probably not the source of the problem.
> If Apache is not working then the thing to do is:
> sudo /etc/init.d/apache2 restart
> If this gives a problem, or just doesn't work (be patient) then
> sudo /etc/init.d/apache2 stop
> Note the message printed and wait a moment, if it appears to have
> stopped fine then
> sudo /etc/init.d/apache2 start
> Again wait a moment, and then check ezmlm.squeak.org again.
> If this is is still not working then I guess it is time to reboot the
> server. However, I would ask that you really only do this as a last
> resort and that you not do so as a quick decision. First email
> box-admins, feel free to Cc me directly if you want. Wait some time, 30
> minutes, an hour, two hours. You can make your own judgment call on
> that. I have and it's not always consistent.
> The point is that I or someone else may want to look at the situation
> first if only for information gathering purposes. Ultimately restarting
> the server once or twice without waiting for others to chime in is not
> going to get you in any trouble. A history of it is likely to start to
> annoy me, I assume it would annoy others as well. The reasoning is that
> while the system is in its broken state there is the possibility of
> gathering information about the problem that is not recoverable, at
> least not easily, once the system has been rebooted.
> OK but if rebooting is the answer then it is simply
> sudo reboot
> Be aware that the server does tend to take multiple minutes (It has been
> a while, I remember it seems like a long time, I don't remember how long
> it really tends to be) to be responsive again. My habit after I have
> been booted off the server is to
> ping squeak
> And you say 'Huh'?
> Yeah, this is beginning to digress, but I will persist nonetheless. For
> my own convenience some time ago I modified my /etc/hosts file.
> (Clearly we are getting off into the bushes and this is only directly
> applicable if you run Linux and friends locally, if you run MacOSX this
> may still apply to you pretty closely, but I don't know. Those of you
> on Windows: the fundamental facts are all true but the details have been
> changed. Google it.)
> $ cat /etc/hosts
> # utility
> 126.96.36.199 squeak
> 188.8.131.52 box3
> 184.108.40.206 box4
> From the naming pattern you can guess that I started this before box3
> and box4 existed. If you choose to do the same you can use any names
> you like, just don't mask any names used in your local network if there
> are any.
> Back to the point at hand:
> ken at neue:~$ ping squeak
> PING squeak (220.127.116.11) 56(84) bytes of data.
> 64 bytes from squeak (18.104.22.168): icmp_seq=1 ttl=47 time=128 ms
> 64 bytes from squeak (22.214.171.124): icmp_seq=2 ttl=47 time=128 ms
> 64 bytes from squeak (126.96.36.199): icmp_seq=3 ttl=47 time=127 ms
> This is what you want to see. While the server is restarting though,
> and the TCP/IP stack is down, ping is just going to be silent; but it
> will keep trying. So leave this and go back to whatever else you were
> doing. Check it occasionally and unless catastrophe has occurred you
> should in time see the above (of course your ping times will vary).
> At this point then the server has or is restarting. It has reached the
> point that the echo service is working. That however does not mean
> everything is yet working, that includes sshd and apache. Nonetheless
> you can go ahead and try to ssh to the server or check any of the web
> services. But if they don't immediately work, don't despair, minutes of
> time is required for everything to come back up at the best of times.
> Ultimately everything should start back up as normal. If it doesn't
> then it is time to call for help.
> OK, so end of the 'reboot the server' digression.
> The scenario is now this: source.squeak.org is not working but we have
> checked and we believe that Apache is fine. This tells us that the
> problem is almost certainly isolated to the source.squeak.org Squeak
> process. Probably the thing to do is kill it and let daemontools
> restart it. But don't do that without checking out the image that will
> be used first.
> $ ls -lh ~squeaksource/Squeak3.11-8824-SS.image
> -rw-r--r-- 1 squeaksource squeaksource 35M Dec 21 16:25
> The main thing to look at here is the file size (5th column). The size
> of the file should be in this vicinity, it does grow but slowly. That
> is it grows slowly under normal conditions, sometimes, and this is a
> danger of the easy save-the-world persistence strategy we use, the image
> is saved when something has gone wrong and the heap has grown
> tremendously. I believe I have seen this file saved at between
> 150-200MB before. When restarting the image does not work then it is
> nearly always the case that this file is much larger than expected. I
> can't remember any case in which it has restarted properly when the file
> is larger than normal.
> If you forget to look and just kill the process (I will get to that
> shortly) it is not the end of the world. It may just mean that killing
> it and having it restarted does not fix the problem and in my opinion it
> is better to look at the image first and have some confidence that it is
> not corrupted.
> If it is corrupted you can find recent backups of the bulk of the
> filesystem under /var/cache/rsnapshot/. This directory will look like
> $ ls /var/cache/rsnapshot/
> daily.0 daily.1 daily.2 daily.3 daily.4 daily.5 daily.6
> daily.0 is the most recent backup (within the last 24 hours), daily.1
> the next most recent, etc.
> What I might do then if I'm looking for a good backup image is this
> ~$ ls -lh
> -rw-r--r-- 1 squeaksource squeaksource 35M Dec 21 16:25
> -rw-r--r-- 1 squeaksource squeaksource 37M Dec 20 17:25
> -rw-r--r-- 1 squeaksource squeaksource 35M Dec 19 15:24
> -rw-r--r-- 1 squeaksource squeaksource 35M Dec 18 16:24
> -rw-r--r-- 1 squeaksource squeaksource 35M Dec 17 12:23
> -rw-r--r-- 1 squeaksource squeaksource 35M Dec 16 12:23
> -rw-r--r-- 1 squeaksource squeaksource 35M Dec 15 11:22
> It's not well formatted in this email but I hope you get the idea. If
> possible you want to use the most recent backup which will be found
> within the daily.0 directory. But it is very possible that the backup
> backed it up after it was corrupted, in which case you consider daily.1,
> and so on. In any case once you find a copy that looks like it is
> probably OK then make a backup copy of the corrupted image and changes
> just in case someone wants to take a look at it, then copy over both the
> image and changes from the backup you picked into the squeaksource home
> Hopefully now killing any existing process and having daemontools start
> it back up will work. And to do this you first find have to identify
> the relevant process. One way is
> $ ps auwx | grep squeaks
> which should produce a list something like
> root 2150 0.0 0.0 1360 268 ? S Nov28 0:00 supervise
> squeaks 2176 3.6 8.0 1051344 77724 ? S Nov28 1256:48
> /usr/local/lib/squeak/3.11.3-2135/squeakvm -vm-display-none
> website 30990 25.2 10.8 1051420 105060 ? S 16:59 9:31
> /usr/bin/squeakvm -vm-display=none /home/website/website/squeaksite.image
> kencaus 1409 0.0 0.0 1552 524 pts/0 S+ 17:37 0:00 grep squeaks
> The relevant one is the squeakvm process referencing the proper image of
> course, the second one in this list. At which point you would take the
> process ID (second column) and do
> $ kill 2176
> for example. The ID will of course vary. Check again and if the
> process is stuck, the one with the given ID does not disappear from the
> list, then you may have to
> $ kill -9 2176
> In any case if you repeatedly look at the relevant list of running
> processes you should soon see another ... squeakvm ...
> Squeak3.11-8824-SS.image process running with a new process ID and
> hopefully if you check http://source.squeak.org you will get what you
> Let me note that the filtered list of processes above also includes, on
> the first line, the daemontools process that 'supervises' the
> source.squeak.org service. Note that if you don't see this in the list
> then there is probably a problem with daemontools itself and in any case
> when you kill the process I don't expect that a new one will be started.
> However I'm going to draw this to a close here and leave that for
> another time.
More information about the Box-Admins