[Box-Admins] Source.Squeak.org how-to-fix guidelines

Sat Dec 21 17:52:25 UTC 2013

See an 'edit' below:

On 12/21/2013 11:44 AM, Ken Causey wrote:
> I wish I could provide a TLDR, but really I ask you to try to persist in
> reading through this.  I think, more than I originally expected when I
> started to write this, that it contains a fair amount of my philosophy
> regarding the maintenance of the Squeak servers and services.  It
> certainly turned out much longer than I expected and could use an editor.
>
> Back to the original email:
>
> I mentioned in my recent email about an issue with squeaksource.com that
> it helps if those who know give guidance to everyone else about easy
> things that can be done to fix issues and how to make decisions about
> what to do.
>
> In that vein I will do my best to provide the same for
> source.squeak.org.  Presumably much of this is also true for
> squeaksource.com but I'm not going to assume it.
>
> So the scenario is this: source.squeak.org is unresponsive or the page
> does not render in full; the first is much more common.
>
> First, because it has happened before, you need to ensure the problem is
> not at a higher/later level, that is that it is not a problem with
> Apache on the box2 server.  The easiest way to do this is to check any
> other service, perhaps most other services, that run on box2 and see if
> they are working OK or not.
>
> Here is a list (probably not exhaustive):
>
> bugs.squeak.org (Apache/FastCGI PHP)
> www.squeak.org (Apache/AIDA Squeak)
> lists.squeakfoundation.org (Apache/C & Python CGI)
> ezmlm.squeak.org (Apache, defunct but still exists)
>
> The last two are notable in that they rely on very little outside of
> Apache to work.  If Apache is working but those services are not working
> then the server is in bad shape indeed.
>
> So in brief if you go to ezmlm.squeak.org and get a page that says:
>
> Some old Mailing Lists
>
> Click Here for List Archives and Information
>
> and you can click the second line and get a list of defunct/dead mailing
> lists then Apache is probably not the source of the problem.
>
> If Apache is not working then the thing to do is:
>
> sudo /etc/init.d/apache2 restart
>
> If this gives a problem, or just doesn't work (be patient) then
>
> sudo /etc/init.d/apache2 stop
>
> Note the message printed and wait a moment, if it appears to have
> stopped fine then
>
> sudo /etc/init.d/apache2 start
>
> Again wait a moment, and then check ezmlm.squeak.org again.
>
> If this is is still not working then I guess it is time to reboot the
> server.  However, I would ask that you really only do this as a last
> resort and that you not do so as a quick decision.  First email
> box-admins, feel free to Cc me directly if you want.  Wait some time, 30
> minutes, an hour, two hours.  You can make your own judgment call on
> that.  I have and it's not always consistent.
>
> The point is that I or someone else may want to look at the situation
> first if only for information gathering purposes.  Ultimately restarting
> the server once or twice without waiting for others to chime in is not
> going to get you in any trouble.  A history of it is likely to start to
> annoy me, I assume it would annoy others as well.  The reasoning is that
> while the system is in its broken state there is the possibility of
> gathering information about the problem that is not recoverable, at
> least not easily, once the system has been rebooted.
>
> OK but if rebooting is the answer then it is simply
>
> sudo reboot
>
> Be aware that the server does tend to take multiple minutes (It has been
> a while, I remember it seems like a long time, I don't remember how long
> it really tends to be) to be responsive again.  My habit after I have
> been booted off the server is to
>
> ping squeak
>
> And you say 'Huh'?
>
> Yeah, this is beginning to digress, but I will persist nonetheless.  For
> my own convenience some time ago I modified my /etc/hosts file. (Clearly
> we are getting off into the bushes and this is only directly applicable
> if you run Linux and friends locally, if you run MacOSX this may still
> apply to you pretty closely, but I don't know.  Those of you on Windows:
> the fundamental facts are all true but the details have been changed.
> Google it.)
>
> $ cat /etc/hosts
> <snip>
> # utility
> 85.10.195.197    squeak
> 173.246.101.237 box3
> 173.246.104.42  box4
> <snip>
>
>  From the naming pattern you can guess that I started this before box3
> and box4 existed.  If you choose to do the same you can use any names
> you like, just don't mask any names used in your local network if there
> are any.
>
> Back to the point at hand:
>
> ken at neue:~$ ping squeak
> PING squeak (85.10.195.197) 56(84) bytes of data.
> 64 bytes from squeak (85.10.195.197): icmp_seq=1 ttl=47 time=128 ms
> 64 bytes from squeak (85.10.195.197): icmp_seq=2 ttl=47 time=128 ms
> 64 bytes from squeak (85.10.195.197): icmp_seq=3 ttl=47 time=127 ms
>
> This is what you want to see.  While the server is restarting though,
> and the TCP/IP stack is down, ping is just going to be silent; but it
> will keep trying.  So leave this and go back to whatever else you were
> doing.  Check it occasionally and unless catastrophe has occurred you
> should in time see the above (of course your ping times will vary).
>
> At this point then the server has or is restarting.  It has reached the
> point that the echo service is working.  That however does not mean
> everything is yet working, that includes sshd and apache.  Nonetheless
> you can go ahead and try to ssh to the server or check any of the web
> services.  But if they don't immediately work, don't despair, minutes of
> time is required for everything to come back up at the best of times.
> Ultimately everything should start back up as normal.  If it doesn't
> then it is time to call for help.
>
> OK, so end of the 'reboot the server' digression.
>
> The scenario is now this:  source.squeak.org is not working but we have
> checked and we believe that Apache is fine.  This tells us that the
> problem is almost certainly isolated to the source.squeak.org Squeak
> process.  Probably the thing to do is kill it and let daemontools
> restart it.  But don't do that without checking out the image that will
> be used first.
>
> $ ls -lh ~squeaksource/Squeak3.11-8824-SS.image
> -rw-r--r--  1 squeaksource squeaksource 35M Dec 21 16:25
> /home/squeaksource/Squeak3.11-8824-SS.image
>
> The main thing to look at here is the file size (5th column).  The size
> of the file should be in this vicinity, it does grow but slowly.  That
> is it grows slowly under normal conditions, sometimes, and this is a
> danger of the easy save-the-world persistence strategy we use, the image
> is saved when something has gone wrong and the heap has grown
> tremendously.  I believe I have seen this file saved at between
> 150-200MB before.  When restarting the image does not work then it is
> nearly always the case that this file is much larger than expected.  I
> can't remember any case in which it has restarted properly when the file
> is larger than normal.
>
> If you forget to look and just kill the process (I will get to that
> shortly) it is not the end of the world.  It may just mean that killing
> it and having it restarted does not fix the problem and in my opinion it
> is better to look at the image first and have some confidence that it is
> not corrupted.
>
> If it is corrupted you can find recent backups of the bulk of the
> filesystem under /var/cache/rsnapshot/.  This directory will look like
>
> $ ls /var/cache/rsnapshot/
> daily.0  daily.1  daily.2  daily.3  daily.4  daily.5  daily.6
>
> daily.0 is the most recent backup (within the last 24 hours), daily.1
> the next most recent, etc.
>
> What I might do then if I'm looking for a good backup image is this
>
> ~$ ls -lh
> /var/cache/rsnapshot/*/localhost/home/squeaksource/Squeak3.11-8824-SS.image
> -rw-r--r--  1 squeaksource squeaksource 35M Dec 21 16:25
> /var/cache/rsnapshot/daily.0/localhost/home/squeaksource/Squeak3.11-8824-SS.image
>
> -rw-r--r--  1 squeaksource squeaksource 37M Dec 20 17:25
> /var/cache/rsnapshot/daily.1/localhost/home/squeaksource/Squeak3.11-8824-SS.image
>
> -rw-r--r--  1 squeaksource squeaksource 35M Dec 19 15:24
> /var/cache/rsnapshot/daily.2/localhost/home/squeaksource/Squeak3.11-8824-SS.image
>
> -rw-r--r--  1 squeaksource squeaksource 35M Dec 18 16:24
> /var/cache/rsnapshot/daily.3/localhost/home/squeaksource/Squeak3.11-8824-SS.image
>
> -rw-r--r--  1 squeaksource squeaksource 35M Dec 17 12:23
> /var/cache/rsnapshot/daily.4/localhost/home/squeaksource/Squeak3.11-8824-SS.image
>
> -rw-r--r--  1 squeaksource squeaksource 35M Dec 16 12:23
> /var/cache/rsnapshot/daily.5/localhost/home/squeaksource/Squeak3.11-8824-SS.image
>
> -rw-r--r--  1 squeaksource squeaksource 35M Dec 15 11:22
> /var/cache/rsnapshot/daily.6/localhost/home/squeaksource/Squeak3.11-8824-SS.image
>
>
> It's not well formatted in this email but I hope you get the idea.  If
> possible you want to use the most recent backup which will be found
> within the daily.0 directory.  But it is very possible that the backup
> backed it up after it was corrupted, in which case you consider daily.1,
> and so on.  In any case once you find a copy that looks like it is
> probably OK then make a backup copy of the corrupted image and changes
> just in case someone wants to take a look at it, then copy over both the
> image and changes from the backup you picked into the squeaksource home
> directory.
>
> Hopefully now killing any existing process and having daemontools start
> it back up will work.  And to do this you first find have to identify
> the relevant process.  One way is

David, in an email that came in just after I hit send on this one, 
reminded me that there is a simpler way (than that shown below).  The 
quick way to restart any daemontools monitored service is

$ sudo svc -t <service name>

in this case

$ sudo svc -t squeaksource

Don't remember the service name?

$ ls -l /service/ | grep squeaksource
lrwxrwxrwx  1 root root 26 Oct 10  2006 squeaksource -> 
/home/squeaksource/service

I specified squeaksource above because that is the name of the user/home 
directory under which the service info resides for the service in 
question.  Of course if you are trying to restart a different service 
then substitute the other username as appropriate.

The info below is still of some value if svc -t does not seem to be 
working.  98% of the time though, I expect it will.

>
> $ ps auwx | grep squeaks
>
> which should produce a list something like
>
> root      2150  0.0  0.0  1360  268 ?        S    Nov28   0:00 supervise
> squeaksource
> squeaks   2176  3.6  8.0 1051344 77724 ?     S    Nov28 1256:48
> /usr/local/lib/squeak/3.11.3-2135/squeakvm -vm-display-none
> /home/squeaksource/Squeak3.11-8824-SS.image
> website  30990 25.2 10.8 1051420 105060 ?    S    16:59   9:31
> /usr/bin/squeakvm -vm-display=none /home/website/website/squeaksite.image
> kencaus   1409  0.0  0.0  1552  524 pts/0    S+   17:37   0:00 grep squeaks
>
> The relevant one is the squeakvm process referencing the proper image of
> course, the second one in this list.  At which point you would take the
> process ID (second column) and do
>
> $ kill 2176
>
> for example.  The ID will of course vary.  Check again and if the
> process is stuck, the one with the given ID does not disappear from the
> list, then you may have to
>
> $ kill -9 2176
>
> In any case if you repeatedly look at the relevant list of running
> processes you should soon see another ... squeakvm ...
> Squeak3.11-8824-SS.image process running with a new process ID and
> hopefully if you check http://source.squeak.org you will get what you
> expect.
>
> Let me note that the filtered list of processes above also includes, on
> the first line, the daemontools process that 'supervises' the
> source.squeak.org service.  Note that if you don't see this in the list
> then there is probably a problem with daemontools itself and in any case
> when you kill the process I don't expect that a new one will be started.
>
> However I'm going to draw this to a close here and leave that for
> another time.
>
> Ken
>
>