[Box-Admins] Source.Squeak.org how-to-fix guidelines

Sat Dec 21 23:23:00 UTC 2013

Thanks for providing this Ken.

I have read through this quickly and will read more carefully again later.
You quite rightly raised the question of what we as box-admins should
do to document our processes, and in that light I think it would be very
helpful if you can save the information in this note in a README file
(or README-box-admins or something of that sort). I guess in this case
the README might be located in ~website on box2.

I know this is not a very high-tech solution, but pretty much everybody
understands the convention, regardless of their computing background or
level of experience. So I think this might be a small thing that would
really help. The same applies to build.squeak.org or anything else that
one of us sets up on the servers - if we some sort of README breadcrumbs
behind, it can be a big help to whomever needs to figure out how to fix
a problem under pressure when the original expert is not available.

Regarding the current squeaksource.com problem:

The problem with our squeaksource.com web page rendering is internal to
the squeaksource image itself. I get the same symptoms running the image
on my PC at home, and have no problems with a two week old backup copy
of the same image.

I just posted a question to the seaside and squeak-dev lists asking if
anyone has seen this kind of problem before.

For now, the SSC repository is fine, but the web interface is a mess.
If I cannot figure out the problem by tomorrow, I will restart from the
two week old backup image and tidy up whatever problems might arise from
there.

Dave

On Sat, Dec 21, 2013 at 11:44:57AM -0600, Ken Causey wrote:
> I wish I could provide a TLDR, but really I ask you to try to persist in 
> reading through this.  I think, more than I originally expected when I 
> started to write this, that it contains a fair amount of my philosophy 
> regarding the maintenance of the Squeak servers and services.  It 
> certainly turned out much longer than I expected and could use an editor.
> 
> Back to the original email:
> 
> I mentioned in my recent email about an issue with squeaksource.com that 
> it helps if those who know give guidance to everyone else about easy 
> things that can be done to fix issues and how to make decisions about 
> what to do.
> 
> In that vein I will do my best to provide the same for 
> source.squeak.org.  Presumably much of this is also true for 
> squeaksource.com but I'm not going to assume it.
> 
> So the scenario is this: source.squeak.org is unresponsive or the page 
> does not render in full; the first is much more common.
> 
> First, because it has happened before, you need to ensure the problem is 
> not at a higher/later level, that is that it is not a problem with 
> Apache on the box2 server.  The easiest way to do this is to check any 
> other service, perhaps most other services, that run on box2 and see if 
> they are working OK or not.
> 
> Here is a list (probably not exhaustive):
> 
> bugs.squeak.org (Apache/FastCGI PHP)
> www.squeak.org (Apache/AIDA Squeak)
> lists.squeakfoundation.org (Apache/C & Python CGI)
> ezmlm.squeak.org (Apache, defunct but still exists)
> 
> The last two are notable in that they rely on very little outside of 
> Apache to work.  If Apache is working but those services are not working 
> then the server is in bad shape indeed.
> 
> So in brief if you go to ezmlm.squeak.org and get a page that says:
> 
> Some old Mailing Lists
> 
> Click Here for List Archives and Information
> 
> and you can click the second line and get a list of defunct/dead mailing 
> lists then Apache is probably not the source of the problem.
> 
> If Apache is not working then the thing to do is:
> 
> sudo /etc/init.d/apache2 restart
> 
> If this gives a problem, or just doesn't work (be patient) then
> 
> sudo /etc/init.d/apache2 stop
> 
> Note the message printed and wait a moment, if it appears to have 
> stopped fine then
> 
> sudo /etc/init.d/apache2 start
> 
> Again wait a moment, and then check ezmlm.squeak.org again.
> 
> If this is is still not working then I guess it is time to reboot the 
> server.  However, I would ask that you really only do this as a last 
> resort and that you not do so as a quick decision.  First email 
> box-admins, feel free to Cc me directly if you want.  Wait some time, 30 
> minutes, an hour, two hours.  You can make your own judgment call on 
> that.  I have and it's not always consistent.
> 
> The point is that I or someone else may want to look at the situation 
> first if only for information gathering purposes.  Ultimately restarting 
> the server once or twice without waiting for others to chime in is not 
> going to get you in any trouble.  A history of it is likely to start to 
> annoy me, I assume it would annoy others as well.  The reasoning is that 
> while the system is in its broken state there is the possibility of 
> gathering information about the problem that is not recoverable, at 
> least not easily, once the system has been rebooted.
> 
> OK but if rebooting is the answer then it is simply
> 
> sudo reboot
> 
> Be aware that the server does tend to take multiple minutes (It has been 
> a while, I remember it seems like a long time, I don't remember how long 
> it really tends to be) to be responsive again.  My habit after I have 
> been booted off the server is to
> 
> ping squeak
> 
> And you say 'Huh'?
> 
> Yeah, this is beginning to digress, but I will persist nonetheless.  For 
> my own convenience some time ago I modified my /etc/hosts file. 
> (Clearly we are getting off into the bushes and this is only directly 
> applicable if you run Linux and friends locally, if you run MacOSX this 
> may still apply to you pretty closely, but I don't know.  Those of you 
> on Windows: the fundamental facts are all true but the details have been 
> changed.  Google it.)
> 
> $ cat /etc/hosts
> <snip>
> # utility
> 85.10.195.197	squeak
> 173.246.101.237 box3
> 173.246.104.42  box4
> <snip>
> 
> From the naming pattern you can guess that I started this before box3 
> and box4 existed.  If you choose to do the same you can use any names 
> you like, just don't mask any names used in your local network if there 
> are any.
> 
> Back to the point at hand:
> 
> ken at neue:~$ ping squeak
> PING squeak (85.10.195.197) 56(84) bytes of data.
> 64 bytes from squeak (85.10.195.197): icmp_seq=1 ttl=47 time=128 ms
> 64 bytes from squeak (85.10.195.197): icmp_seq=2 ttl=47 time=128 ms
> 64 bytes from squeak (85.10.195.197): icmp_seq=3 ttl=47 time=127 ms
> 
> This is what you want to see.  While the server is restarting though, 
> and the TCP/IP stack is down, ping is just going to be silent; but it 
> will keep trying.  So leave this and go back to whatever else you were 
> doing.  Check it occasionally and unless catastrophe has occurred you 
> should in time see the above (of course your ping times will vary).
> 
> At this point then the server has or is restarting.  It has reached the 
> point that the echo service is working.  That however does not mean 
> everything is yet working, that includes sshd and apache.  Nonetheless 
> you can go ahead and try to ssh to the server or check any of the web 
> services.  But if they don't immediately work, don't despair, minutes of 
> time is required for everything to come back up at the best of times. 
> Ultimately everything should start back up as normal.  If it doesn't 
> then it is time to call for help.
> 
> OK, so end of the 'reboot the server' digression.
> 
> The scenario is now this:  source.squeak.org is not working but we have 
> checked and we believe that Apache is fine.  This tells us that the 
> problem is almost certainly isolated to the source.squeak.org Squeak 
> process.  Probably the thing to do is kill it and let daemontools 
> restart it.  But don't do that without checking out the image that will 
> be used first.
> 
> $ ls -lh ~squeaksource/Squeak3.11-8824-SS.image
> -rw-r--r--  1 squeaksource squeaksource 35M Dec 21 16:25 
> /home/squeaksource/Squeak3.11-8824-SS.image
> 
> The main thing to look at here is the file size (5th column).  The size 
> of the file should be in this vicinity, it does grow but slowly.  That 
> is it grows slowly under normal conditions, sometimes, and this is a 
> danger of the easy save-the-world persistence strategy we use, the image 
> is saved when something has gone wrong and the heap has grown 
> tremendously.  I believe I have seen this file saved at between 
> 150-200MB before.  When restarting the image does not work then it is 
> nearly always the case that this file is much larger than expected.  I 
> can't remember any case in which it has restarted properly when the file 
> is larger than normal.
> 
> If you forget to look and just kill the process (I will get to that 
> shortly) it is not the end of the world.  It may just mean that killing 
> it and having it restarted does not fix the problem and in my opinion it 
> is better to look at the image first and have some confidence that it is 
> not corrupted.
> 
> If it is corrupted you can find recent backups of the bulk of the 
> filesystem under /var/cache/rsnapshot/.  This directory will look like
> 
> $ ls /var/cache/rsnapshot/
> daily.0  daily.1  daily.2  daily.3  daily.4  daily.5  daily.6
> 
> daily.0 is the most recent backup (within the last 24 hours), daily.1 
> the next most recent, etc.
> 
> What I might do then if I'm looking for a good backup image is this
> 
> ~$ ls -lh 
> /var/cache/rsnapshot/*/localhost/home/squeaksource/Squeak3.11-8824-SS.image
> -rw-r--r--  1 squeaksource squeaksource 35M Dec 21 16:25 
> /var/cache/rsnapshot/daily.0/localhost/home/squeaksource/Squeak3.11-8824-SS.image
> -rw-r--r--  1 squeaksource squeaksource 37M Dec 20 17:25 
> /var/cache/rsnapshot/daily.1/localhost/home/squeaksource/Squeak3.11-8824-SS.image
> -rw-r--r--  1 squeaksource squeaksource 35M Dec 19 15:24 
> /var/cache/rsnapshot/daily.2/localhost/home/squeaksource/Squeak3.11-8824-SS.image
> -rw-r--r--  1 squeaksource squeaksource 35M Dec 18 16:24 
> /var/cache/rsnapshot/daily.3/localhost/home/squeaksource/Squeak3.11-8824-SS.image
> -rw-r--r--  1 squeaksource squeaksource 35M Dec 17 12:23 
> /var/cache/rsnapshot/daily.4/localhost/home/squeaksource/Squeak3.11-8824-SS.image
> -rw-r--r--  1 squeaksource squeaksource 35M Dec 16 12:23 
> /var/cache/rsnapshot/daily.5/localhost/home/squeaksource/Squeak3.11-8824-SS.image
> -rw-r--r--  1 squeaksource squeaksource 35M Dec 15 11:22 
> /var/cache/rsnapshot/daily.6/localhost/home/squeaksource/Squeak3.11-8824-SS.image
> 
> It's not well formatted in this email but I hope you get the idea.  If 
> possible you want to use the most recent backup which will be found 
> within the daily.0 directory.  But it is very possible that the backup 
> backed it up after it was corrupted, in which case you consider daily.1, 
> and so on.  In any case once you find a copy that looks like it is 
> probably OK then make a backup copy of the corrupted image and changes 
> just in case someone wants to take a look at it, then copy over both the 
> image and changes from the backup you picked into the squeaksource home 
> directory.
> 
> Hopefully now killing any existing process and having daemontools start 
> it back up will work.  And to do this you first find have to identify 
> the relevant process.  One way is
> 
> $ ps auwx | grep squeaks
> 
> which should produce a list something like
> 
> root      2150  0.0  0.0  1360  268 ?        S    Nov28   0:00 supervise 
> squeaksource
> squeaks   2176  3.6  8.0 1051344 77724 ?     S    Nov28 1256:48 
> /usr/local/lib/squeak/3.11.3-2135/squeakvm -vm-display-none 
> /home/squeaksource/Squeak3.11-8824-SS.image
> website  30990 25.2 10.8 1051420 105060 ?    S    16:59   9:31 
> /usr/bin/squeakvm -vm-display=none /home/website/website/squeaksite.image
> kencaus   1409  0.0  0.0  1552  524 pts/0    S+   17:37   0:00 grep squeaks
> 
> The relevant one is the squeakvm process referencing the proper image of 
> course, the second one in this list.  At which point you would take the 
> process ID (second column) and do
> 
> $ kill 2176
> 
> for example.  The ID will of course vary.  Check again and if the 
> process is stuck, the one with the given ID does not disappear from the 
> list, then you may have to
> 
> $ kill -9 2176
> 
> In any case if you repeatedly look at the relevant list of running 
> processes you should soon see another ... squeakvm ... 
> Squeak3.11-8824-SS.image process running with a new process ID and 
> hopefully if you check http://source.squeak.org you will get what you 
> expect.
> 
> Let me note that the filtered list of processes above also includes, on 
> the first line, the daemontools process that 'supervises' the 
> source.squeak.org service.  Note that if you don't see this in the list 
> then there is probably a problem with daemontools itself and in any case 
> when you kill the process I don't expect that a new one will be started.
> 
> However I'm going to draw this to a close here and leave that for 
> another time.
> 
> Ken