I wish I could provide a TLDR, but really I ask you to try to persist in reading through this. I think, more than I originally expected when I started to write this, that it contains a fair amount of my philosophy regarding the maintenance of the Squeak servers and services. It certainly turned out much longer than I expected and could use an editor.
Back to the original email:
I mentioned in my recent email about an issue with squeaksource.com that it helps if those who know give guidance to everyone else about easy things that can be done to fix issues and how to make decisions about what to do.
In that vein I will do my best to provide the same for source.squeak.org. Presumably much of this is also true for squeaksource.com but I'm not going to assume it.
So the scenario is this: source.squeak.org is unresponsive or the page does not render in full; the first is much more common.
First, because it has happened before, you need to ensure the problem is not at a higher/later level, that is that it is not a problem with Apache on the box2 server. The easiest way to do this is to check any other service, perhaps most other services, that run on box2 and see if they are working OK or not.
Here is a list (probably not exhaustive):
bugs.squeak.org (Apache/FastCGI PHP) www.squeak.org (Apache/AIDA Squeak) lists.squeakfoundation.org (Apache/C & Python CGI) ezmlm.squeak.org (Apache, defunct but still exists)
The last two are notable in that they rely on very little outside of Apache to work. If Apache is working but those services are not working then the server is in bad shape indeed.
So in brief if you go to ezmlm.squeak.org and get a page that says:
Some old Mailing Lists
Click Here for List Archives and Information
and you can click the second line and get a list of defunct/dead mailing lists then Apache is probably not the source of the problem.
If Apache is not working then the thing to do is:
sudo /etc/init.d/apache2 restart
If this gives a problem, or just doesn't work (be patient) then
sudo /etc/init.d/apache2 stop
Note the message printed and wait a moment, if it appears to have stopped fine then
sudo /etc/init.d/apache2 start
Again wait a moment, and then check ezmlm.squeak.org again.
If this is is still not working then I guess it is time to reboot the server. However, I would ask that you really only do this as a last resort and that you not do so as a quick decision. First email box-admins, feel free to Cc me directly if you want. Wait some time, 30 minutes, an hour, two hours. You can make your own judgment call on that. I have and it's not always consistent.
The point is that I or someone else may want to look at the situation first if only for information gathering purposes. Ultimately restarting the server once or twice without waiting for others to chime in is not going to get you in any trouble. A history of it is likely to start to annoy me, I assume it would annoy others as well. The reasoning is that while the system is in its broken state there is the possibility of gathering information about the problem that is not recoverable, at least not easily, once the system has been rebooted.
OK but if rebooting is the answer then it is simply
sudo reboot
Be aware that the server does tend to take multiple minutes (It has been a while, I remember it seems like a long time, I don't remember how long it really tends to be) to be responsive again. My habit after I have been booted off the server is to
ping squeak
And you say 'Huh'?
Yeah, this is beginning to digress, but I will persist nonetheless. For my own convenience some time ago I modified my /etc/hosts file. (Clearly we are getting off into the bushes and this is only directly applicable if you run Linux and friends locally, if you run MacOSX this may still apply to you pretty closely, but I don't know. Those of you on Windows: the fundamental facts are all true but the details have been changed. Google it.)
$ cat /etc/hosts <snip> # utility 85.10.195.197 squeak 173.246.101.237 box3 173.246.104.42 box4 <snip>
From the naming pattern you can guess that I started this before box3 and box4 existed. If you choose to do the same you can use any names you like, just don't mask any names used in your local network if there are any.
Back to the point at hand:
ken@neue:~$ ping squeak PING squeak (85.10.195.197) 56(84) bytes of data. 64 bytes from squeak (85.10.195.197): icmp_seq=1 ttl=47 time=128 ms 64 bytes from squeak (85.10.195.197): icmp_seq=2 ttl=47 time=128 ms 64 bytes from squeak (85.10.195.197): icmp_seq=3 ttl=47 time=127 ms
This is what you want to see. While the server is restarting though, and the TCP/IP stack is down, ping is just going to be silent; but it will keep trying. So leave this and go back to whatever else you were doing. Check it occasionally and unless catastrophe has occurred you should in time see the above (of course your ping times will vary).
At this point then the server has or is restarting. It has reached the point that the echo service is working. That however does not mean everything is yet working, that includes sshd and apache. Nonetheless you can go ahead and try to ssh to the server or check any of the web services. But if they don't immediately work, don't despair, minutes of time is required for everything to come back up at the best of times. Ultimately everything should start back up as normal. If it doesn't then it is time to call for help.
OK, so end of the 'reboot the server' digression.
The scenario is now this: source.squeak.org is not working but we have checked and we believe that Apache is fine. This tells us that the problem is almost certainly isolated to the source.squeak.org Squeak process. Probably the thing to do is kill it and let daemontools restart it. But don't do that without checking out the image that will be used first.
$ ls -lh ~squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 35M Dec 21 16:25 /home/squeaksource/Squeak3.11-8824-SS.image
The main thing to look at here is the file size (5th column). The size of the file should be in this vicinity, it does grow but slowly. That is it grows slowly under normal conditions, sometimes, and this is a danger of the easy save-the-world persistence strategy we use, the image is saved when something has gone wrong and the heap has grown tremendously. I believe I have seen this file saved at between 150-200MB before. When restarting the image does not work then it is nearly always the case that this file is much larger than expected. I can't remember any case in which it has restarted properly when the file is larger than normal.
If you forget to look and just kill the process (I will get to that shortly) it is not the end of the world. It may just mean that killing it and having it restarted does not fix the problem and in my opinion it is better to look at the image first and have some confidence that it is not corrupted.
If it is corrupted you can find recent backups of the bulk of the filesystem under /var/cache/rsnapshot/. This directory will look like
$ ls /var/cache/rsnapshot/ daily.0 daily.1 daily.2 daily.3 daily.4 daily.5 daily.6
daily.0 is the most recent backup (within the last 24 hours), daily.1 the next most recent, etc.
What I might do then if I'm looking for a good backup image is this
~$ ls -lh /var/cache/rsnapshot/*/localhost/home/squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 35M Dec 21 16:25 /var/cache/rsnapshot/daily.0/localhost/home/squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 37M Dec 20 17:25 /var/cache/rsnapshot/daily.1/localhost/home/squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 35M Dec 19 15:24 /var/cache/rsnapshot/daily.2/localhost/home/squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 35M Dec 18 16:24 /var/cache/rsnapshot/daily.3/localhost/home/squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 35M Dec 17 12:23 /var/cache/rsnapshot/daily.4/localhost/home/squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 35M Dec 16 12:23 /var/cache/rsnapshot/daily.5/localhost/home/squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 35M Dec 15 11:22 /var/cache/rsnapshot/daily.6/localhost/home/squeaksource/Squeak3.11-8824-SS.image
It's not well formatted in this email but I hope you get the idea. If possible you want to use the most recent backup which will be found within the daily.0 directory. But it is very possible that the backup backed it up after it was corrupted, in which case you consider daily.1, and so on. In any case once you find a copy that looks like it is probably OK then make a backup copy of the corrupted image and changes just in case someone wants to take a look at it, then copy over both the image and changes from the backup you picked into the squeaksource home directory.
Hopefully now killing any existing process and having daemontools start it back up will work. And to do this you first find have to identify the relevant process. One way is
$ ps auwx | grep squeaks
which should produce a list something like
root 2150 0.0 0.0 1360 268 ? S Nov28 0:00 supervise squeaksource squeaks 2176 3.6 8.0 1051344 77724 ? S Nov28 1256:48 /usr/local/lib/squeak/3.11.3-2135/squeakvm -vm-display-none /home/squeaksource/Squeak3.11-8824-SS.image website 30990 25.2 10.8 1051420 105060 ? S 16:59 9:31 /usr/bin/squeakvm -vm-display=none /home/website/website/squeaksite.image kencaus 1409 0.0 0.0 1552 524 pts/0 S+ 17:37 0:00 grep squeaks
The relevant one is the squeakvm process referencing the proper image of course, the second one in this list. At which point you would take the process ID (second column) and do
$ kill 2176
for example. The ID will of course vary. Check again and if the process is stuck, the one with the given ID does not disappear from the list, then you may have to
$ kill -9 2176
In any case if you repeatedly look at the relevant list of running processes you should soon see another ... squeakvm ... Squeak3.11-8824-SS.image process running with a new process ID and hopefully if you check http://source.squeak.org you will get what you expect.
Let me note that the filtered list of processes above also includes, on the first line, the daemontools process that 'supervises' the source.squeak.org service. Note that if you don't see this in the list then there is probably a problem with daemontools itself and in any case when you kill the process I don't expect that a new one will be started.
However I'm going to draw this to a close here and leave that for another time.
Ken
See an 'edit' below:
On 12/21/2013 11:44 AM, Ken Causey wrote:
I wish I could provide a TLDR, but really I ask you to try to persist in reading through this. I think, more than I originally expected when I started to write this, that it contains a fair amount of my philosophy regarding the maintenance of the Squeak servers and services. It certainly turned out much longer than I expected and could use an editor.
Back to the original email:
I mentioned in my recent email about an issue with squeaksource.com that it helps if those who know give guidance to everyone else about easy things that can be done to fix issues and how to make decisions about what to do.
In that vein I will do my best to provide the same for source.squeak.org. Presumably much of this is also true for squeaksource.com but I'm not going to assume it.
So the scenario is this: source.squeak.org is unresponsive or the page does not render in full; the first is much more common.
First, because it has happened before, you need to ensure the problem is not at a higher/later level, that is that it is not a problem with Apache on the box2 server. The easiest way to do this is to check any other service, perhaps most other services, that run on box2 and see if they are working OK or not.
Here is a list (probably not exhaustive):
bugs.squeak.org (Apache/FastCGI PHP) www.squeak.org (Apache/AIDA Squeak) lists.squeakfoundation.org (Apache/C & Python CGI) ezmlm.squeak.org (Apache, defunct but still exists)
The last two are notable in that they rely on very little outside of Apache to work. If Apache is working but those services are not working then the server is in bad shape indeed.
So in brief if you go to ezmlm.squeak.org and get a page that says:
Some old Mailing Lists
Click Here for List Archives and Information
and you can click the second line and get a list of defunct/dead mailing lists then Apache is probably not the source of the problem.
If Apache is not working then the thing to do is:
sudo /etc/init.d/apache2 restart
If this gives a problem, or just doesn't work (be patient) then
sudo /etc/init.d/apache2 stop
Note the message printed and wait a moment, if it appears to have stopped fine then
sudo /etc/init.d/apache2 start
Again wait a moment, and then check ezmlm.squeak.org again.
If this is is still not working then I guess it is time to reboot the server. However, I would ask that you really only do this as a last resort and that you not do so as a quick decision. First email box-admins, feel free to Cc me directly if you want. Wait some time, 30 minutes, an hour, two hours. You can make your own judgment call on that. I have and it's not always consistent.
The point is that I or someone else may want to look at the situation first if only for information gathering purposes. Ultimately restarting the server once or twice without waiting for others to chime in is not going to get you in any trouble. A history of it is likely to start to annoy me, I assume it would annoy others as well. The reasoning is that while the system is in its broken state there is the possibility of gathering information about the problem that is not recoverable, at least not easily, once the system has been rebooted.
OK but if rebooting is the answer then it is simply
sudo reboot
Be aware that the server does tend to take multiple minutes (It has been a while, I remember it seems like a long time, I don't remember how long it really tends to be) to be responsive again. My habit after I have been booted off the server is to
ping squeak
And you say 'Huh'?
Yeah, this is beginning to digress, but I will persist nonetheless. For my own convenience some time ago I modified my /etc/hosts file. (Clearly we are getting off into the bushes and this is only directly applicable if you run Linux and friends locally, if you run MacOSX this may still apply to you pretty closely, but I don't know. Those of you on Windows: the fundamental facts are all true but the details have been changed. Google it.)
$ cat /etc/hosts
<snip> # utility 85.10.195.197 squeak 173.246.101.237 box3 173.246.104.42 box4 <snip>
From the naming pattern you can guess that I started this before box3 and box4 existed. If you choose to do the same you can use any names you like, just don't mask any names used in your local network if there are any.
Back to the point at hand:
ken@neue:~$ ping squeak PING squeak (85.10.195.197) 56(84) bytes of data. 64 bytes from squeak (85.10.195.197): icmp_seq=1 ttl=47 time=128 ms 64 bytes from squeak (85.10.195.197): icmp_seq=2 ttl=47 time=128 ms 64 bytes from squeak (85.10.195.197): icmp_seq=3 ttl=47 time=127 ms
This is what you want to see. While the server is restarting though, and the TCP/IP stack is down, ping is just going to be silent; but it will keep trying. So leave this and go back to whatever else you were doing. Check it occasionally and unless catastrophe has occurred you should in time see the above (of course your ping times will vary).
At this point then the server has or is restarting. It has reached the point that the echo service is working. That however does not mean everything is yet working, that includes sshd and apache. Nonetheless you can go ahead and try to ssh to the server or check any of the web services. But if they don't immediately work, don't despair, minutes of time is required for everything to come back up at the best of times. Ultimately everything should start back up as normal. If it doesn't then it is time to call for help.
OK, so end of the 'reboot the server' digression.
The scenario is now this: source.squeak.org is not working but we have checked and we believe that Apache is fine. This tells us that the problem is almost certainly isolated to the source.squeak.org Squeak process. Probably the thing to do is kill it and let daemontools restart it. But don't do that without checking out the image that will be used first.
$ ls -lh ~squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 35M Dec 21 16:25 /home/squeaksource/Squeak3.11-8824-SS.image
The main thing to look at here is the file size (5th column). The size of the file should be in this vicinity, it does grow but slowly. That is it grows slowly under normal conditions, sometimes, and this is a danger of the easy save-the-world persistence strategy we use, the image is saved when something has gone wrong and the heap has grown tremendously. I believe I have seen this file saved at between 150-200MB before. When restarting the image does not work then it is nearly always the case that this file is much larger than expected. I can't remember any case in which it has restarted properly when the file is larger than normal.
If you forget to look and just kill the process (I will get to that shortly) it is not the end of the world. It may just mean that killing it and having it restarted does not fix the problem and in my opinion it is better to look at the image first and have some confidence that it is not corrupted.
If it is corrupted you can find recent backups of the bulk of the filesystem under /var/cache/rsnapshot/. This directory will look like
$ ls /var/cache/rsnapshot/ daily.0 daily.1 daily.2 daily.3 daily.4 daily.5 daily.6
daily.0 is the most recent backup (within the last 24 hours), daily.1 the next most recent, etc.
What I might do then if I'm looking for a good backup image is this
~$ ls -lh /var/cache/rsnapshot/*/localhost/home/squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 35M Dec 21 16:25 /var/cache/rsnapshot/daily.0/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r-- 1 squeaksource squeaksource 37M Dec 20 17:25 /var/cache/rsnapshot/daily.1/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r-- 1 squeaksource squeaksource 35M Dec 19 15:24 /var/cache/rsnapshot/daily.2/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r-- 1 squeaksource squeaksource 35M Dec 18 16:24 /var/cache/rsnapshot/daily.3/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r-- 1 squeaksource squeaksource 35M Dec 17 12:23 /var/cache/rsnapshot/daily.4/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r-- 1 squeaksource squeaksource 35M Dec 16 12:23 /var/cache/rsnapshot/daily.5/localhost/home/squeaksource/Squeak3.11-8824-SS.image
-rw-r--r-- 1 squeaksource squeaksource 35M Dec 15 11:22 /var/cache/rsnapshot/daily.6/localhost/home/squeaksource/Squeak3.11-8824-SS.image
It's not well formatted in this email but I hope you get the idea. If possible you want to use the most recent backup which will be found within the daily.0 directory. But it is very possible that the backup backed it up after it was corrupted, in which case you consider daily.1, and so on. In any case once you find a copy that looks like it is probably OK then make a backup copy of the corrupted image and changes just in case someone wants to take a look at it, then copy over both the image and changes from the backup you picked into the squeaksource home directory.
Hopefully now killing any existing process and having daemontools start it back up will work. And to do this you first find have to identify the relevant process. One way is
David, in an email that came in just after I hit send on this one, reminded me that there is a simpler way (than that shown below). The quick way to restart any daemontools monitored service is
$ sudo svc -t <service name>
in this case
$ sudo svc -t squeaksource
Don't remember the service name?
$ ls -l /service/ | grep squeaksource lrwxrwxrwx 1 root root 26 Oct 10 2006 squeaksource -> /home/squeaksource/service
I specified squeaksource above because that is the name of the user/home directory under which the service info resides for the service in question. Of course if you are trying to restart a different service then substitute the other username as appropriate.
The info below is still of some value if svc -t does not seem to be working. 98% of the time though, I expect it will.
$ ps auwx | grep squeaks
which should produce a list something like
root 2150 0.0 0.0 1360 268 ? S Nov28 0:00 supervise squeaksource squeaks 2176 3.6 8.0 1051344 77724 ? S Nov28 1256:48 /usr/local/lib/squeak/3.11.3-2135/squeakvm -vm-display-none /home/squeaksource/Squeak3.11-8824-SS.image website 30990 25.2 10.8 1051420 105060 ? S 16:59 9:31 /usr/bin/squeakvm -vm-display=none /home/website/website/squeaksite.image kencaus 1409 0.0 0.0 1552 524 pts/0 S+ 17:37 0:00 grep squeaks
The relevant one is the squeakvm process referencing the proper image of course, the second one in this list. At which point you would take the process ID (second column) and do
$ kill 2176
for example. The ID will of course vary. Check again and if the process is stuck, the one with the given ID does not disappear from the list, then you may have to
$ kill -9 2176
In any case if you repeatedly look at the relevant list of running processes you should soon see another ... squeakvm ... Squeak3.11-8824-SS.image process running with a new process ID and hopefully if you check http://source.squeak.org you will get what you expect.
Let me note that the filtered list of processes above also includes, on the first line, the daemontools process that 'supervises' the source.squeak.org service. Note that if you don't see this in the list then there is probably a problem with daemontools itself and in any case when you kill the process I don't expect that a new one will be started.
However I'm going to draw this to a close here and leave that for another time.
Ken
Thanks for providing this Ken.
I have read through this quickly and will read more carefully again later. You quite rightly raised the question of what we as box-admins should do to document our processes, and in that light I think it would be very helpful if you can save the information in this note in a README file (or README-box-admins or something of that sort). I guess in this case the README might be located in ~website on box2.
I know this is not a very high-tech solution, but pretty much everybody understands the convention, regardless of their computing background or level of experience. So I think this might be a small thing that would really help. The same applies to build.squeak.org or anything else that one of us sets up on the servers - if we some sort of README breadcrumbs behind, it can be a big help to whomever needs to figure out how to fix a problem under pressure when the original expert is not available.
Regarding the current squeaksource.com problem:
The problem with our squeaksource.com web page rendering is internal to the squeaksource image itself. I get the same symptoms running the image on my PC at home, and have no problems with a two week old backup copy of the same image.
I just posted a question to the seaside and squeak-dev lists asking if anyone has seen this kind of problem before.
For now, the SSC repository is fine, but the web interface is a mess. If I cannot figure out the problem by tomorrow, I will restart from the two week old backup image and tidy up whatever problems might arise from there.
Dave
On Sat, Dec 21, 2013 at 11:44:57AM -0600, Ken Causey wrote:
I wish I could provide a TLDR, but really I ask you to try to persist in reading through this. I think, more than I originally expected when I started to write this, that it contains a fair amount of my philosophy regarding the maintenance of the Squeak servers and services. It certainly turned out much longer than I expected and could use an editor.
Back to the original email:
I mentioned in my recent email about an issue with squeaksource.com that it helps if those who know give guidance to everyone else about easy things that can be done to fix issues and how to make decisions about what to do.
In that vein I will do my best to provide the same for source.squeak.org. Presumably much of this is also true for squeaksource.com but I'm not going to assume it.
So the scenario is this: source.squeak.org is unresponsive or the page does not render in full; the first is much more common.
First, because it has happened before, you need to ensure the problem is not at a higher/later level, that is that it is not a problem with Apache on the box2 server. The easiest way to do this is to check any other service, perhaps most other services, that run on box2 and see if they are working OK or not.
Here is a list (probably not exhaustive):
bugs.squeak.org (Apache/FastCGI PHP) www.squeak.org (Apache/AIDA Squeak) lists.squeakfoundation.org (Apache/C & Python CGI) ezmlm.squeak.org (Apache, defunct but still exists)
The last two are notable in that they rely on very little outside of Apache to work. If Apache is working but those services are not working then the server is in bad shape indeed.
So in brief if you go to ezmlm.squeak.org and get a page that says:
Some old Mailing Lists
Click Here for List Archives and Information
and you can click the second line and get a list of defunct/dead mailing lists then Apache is probably not the source of the problem.
If Apache is not working then the thing to do is:
sudo /etc/init.d/apache2 restart
If this gives a problem, or just doesn't work (be patient) then
sudo /etc/init.d/apache2 stop
Note the message printed and wait a moment, if it appears to have stopped fine then
sudo /etc/init.d/apache2 start
Again wait a moment, and then check ezmlm.squeak.org again.
If this is is still not working then I guess it is time to reboot the server. However, I would ask that you really only do this as a last resort and that you not do so as a quick decision. First email box-admins, feel free to Cc me directly if you want. Wait some time, 30 minutes, an hour, two hours. You can make your own judgment call on that. I have and it's not always consistent.
The point is that I or someone else may want to look at the situation first if only for information gathering purposes. Ultimately restarting the server once or twice without waiting for others to chime in is not going to get you in any trouble. A history of it is likely to start to annoy me, I assume it would annoy others as well. The reasoning is that while the system is in its broken state there is the possibility of gathering information about the problem that is not recoverable, at least not easily, once the system has been rebooted.
OK but if rebooting is the answer then it is simply
sudo reboot
Be aware that the server does tend to take multiple minutes (It has been a while, I remember it seems like a long time, I don't remember how long it really tends to be) to be responsive again. My habit after I have been booted off the server is to
ping squeak
And you say 'Huh'?
Yeah, this is beginning to digress, but I will persist nonetheless. For my own convenience some time ago I modified my /etc/hosts file. (Clearly we are getting off into the bushes and this is only directly applicable if you run Linux and friends locally, if you run MacOSX this may still apply to you pretty closely, but I don't know. Those of you on Windows: the fundamental facts are all true but the details have been changed. Google it.)
$ cat /etc/hosts
<snip> # utility 85.10.195.197 squeak 173.246.101.237 box3 173.246.104.42 box4 <snip>
From the naming pattern you can guess that I started this before box3 and box4 existed. If you choose to do the same you can use any names you like, just don't mask any names used in your local network if there are any.
Back to the point at hand:
ken@neue:~$ ping squeak PING squeak (85.10.195.197) 56(84) bytes of data. 64 bytes from squeak (85.10.195.197): icmp_seq=1 ttl=47 time=128 ms 64 bytes from squeak (85.10.195.197): icmp_seq=2 ttl=47 time=128 ms 64 bytes from squeak (85.10.195.197): icmp_seq=3 ttl=47 time=127 ms
This is what you want to see. While the server is restarting though, and the TCP/IP stack is down, ping is just going to be silent; but it will keep trying. So leave this and go back to whatever else you were doing. Check it occasionally and unless catastrophe has occurred you should in time see the above (of course your ping times will vary).
At this point then the server has or is restarting. It has reached the point that the echo service is working. That however does not mean everything is yet working, that includes sshd and apache. Nonetheless you can go ahead and try to ssh to the server or check any of the web services. But if they don't immediately work, don't despair, minutes of time is required for everything to come back up at the best of times. Ultimately everything should start back up as normal. If it doesn't then it is time to call for help.
OK, so end of the 'reboot the server' digression.
The scenario is now this: source.squeak.org is not working but we have checked and we believe that Apache is fine. This tells us that the problem is almost certainly isolated to the source.squeak.org Squeak process. Probably the thing to do is kill it and let daemontools restart it. But don't do that without checking out the image that will be used first.
$ ls -lh ~squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 35M Dec 21 16:25 /home/squeaksource/Squeak3.11-8824-SS.image
The main thing to look at here is the file size (5th column). The size of the file should be in this vicinity, it does grow but slowly. That is it grows slowly under normal conditions, sometimes, and this is a danger of the easy save-the-world persistence strategy we use, the image is saved when something has gone wrong and the heap has grown tremendously. I believe I have seen this file saved at between 150-200MB before. When restarting the image does not work then it is nearly always the case that this file is much larger than expected. I can't remember any case in which it has restarted properly when the file is larger than normal.
If you forget to look and just kill the process (I will get to that shortly) it is not the end of the world. It may just mean that killing it and having it restarted does not fix the problem and in my opinion it is better to look at the image first and have some confidence that it is not corrupted.
If it is corrupted you can find recent backups of the bulk of the filesystem under /var/cache/rsnapshot/. This directory will look like
$ ls /var/cache/rsnapshot/ daily.0 daily.1 daily.2 daily.3 daily.4 daily.5 daily.6
daily.0 is the most recent backup (within the last 24 hours), daily.1 the next most recent, etc.
What I might do then if I'm looking for a good backup image is this
~$ ls -lh /var/cache/rsnapshot/*/localhost/home/squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 35M Dec 21 16:25 /var/cache/rsnapshot/daily.0/localhost/home/squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 37M Dec 20 17:25 /var/cache/rsnapshot/daily.1/localhost/home/squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 35M Dec 19 15:24 /var/cache/rsnapshot/daily.2/localhost/home/squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 35M Dec 18 16:24 /var/cache/rsnapshot/daily.3/localhost/home/squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 35M Dec 17 12:23 /var/cache/rsnapshot/daily.4/localhost/home/squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 35M Dec 16 12:23 /var/cache/rsnapshot/daily.5/localhost/home/squeaksource/Squeak3.11-8824-SS.image -rw-r--r-- 1 squeaksource squeaksource 35M Dec 15 11:22 /var/cache/rsnapshot/daily.6/localhost/home/squeaksource/Squeak3.11-8824-SS.image
It's not well formatted in this email but I hope you get the idea. If possible you want to use the most recent backup which will be found within the daily.0 directory. But it is very possible that the backup backed it up after it was corrupted, in which case you consider daily.1, and so on. In any case once you find a copy that looks like it is probably OK then make a backup copy of the corrupted image and changes just in case someone wants to take a look at it, then copy over both the image and changes from the backup you picked into the squeaksource home directory.
Hopefully now killing any existing process and having daemontools start it back up will work. And to do this you first find have to identify the relevant process. One way is
$ ps auwx | grep squeaks
which should produce a list something like
root 2150 0.0 0.0 1360 268 ? S Nov28 0:00 supervise squeaksource squeaks 2176 3.6 8.0 1051344 77724 ? S Nov28 1256:48 /usr/local/lib/squeak/3.11.3-2135/squeakvm -vm-display-none /home/squeaksource/Squeak3.11-8824-SS.image website 30990 25.2 10.8 1051420 105060 ? S 16:59 9:31 /usr/bin/squeakvm -vm-display=none /home/website/website/squeaksite.image kencaus 1409 0.0 0.0 1552 524 pts/0 S+ 17:37 0:00 grep squeaks
The relevant one is the squeakvm process referencing the proper image of course, the second one in this list. At which point you would take the process ID (second column) and do
$ kill 2176
for example. The ID will of course vary. Check again and if the process is stuck, the one with the given ID does not disappear from the list, then you may have to
$ kill -9 2176
In any case if you repeatedly look at the relevant list of running processes you should soon see another ... squeakvm ... Squeak3.11-8824-SS.image process running with a new process ID and hopefully if you check http://source.squeak.org you will get what you expect.
Let me note that the filtered list of processes above also includes, on the first line, the daemontools process that 'supervises' the source.squeak.org service. Note that if you don't see this in the list then there is probably a problem with daemontools itself and in any case when you kill the process I don't expect that a new one will be started.
However I'm going to draw this to a close here and leave that for another time.
Ken
box-admins@lists.squeakfoundation.org