TL;DR: oom at 7am, Too many open files at 11am.
I cycled the image, lets wait…
To quote the log:
2017-02-16 07:31:22.318429500 2017-02-16 07:31:22.318431500 out of memory 2017-02-16 07:31:22.318431500 2017-02-16 07:31:22.318431500 1066476624 Behavior>new 2017-02-16 07:31:22.318432500 1066476532 Exception class>signal 2017-02-16 07:31:22.318432500 1066476440 Behavior>basicNew 2017-02-16 07:31:22.318432500 1066476348 Behavior>new 2017-02-16 07:31:22.318433500 1066476256 Exception class>signal 2017-02-16 07:31:22.318433500 1066476164 Behavior>basicNew 2017-02-16 07:31:22.318433500 1066476072 Behavior>new 2017-02-16 07:31:22.318434500 1066475980 Exception class>signal 2017-02-16 07:31:22.318465500 1066475796 Behavior>basicNew 2017-02-16 07:31:22.318466500 1066475704 Behavior>new 2017-02-16 07:31:22.318466500 1066475612 Exception class>signal 2017-02-16 07:31:22.318466500 1066475520 Behavior>basicNew 2017-02-16 07:31:22.318467500 1066475428 Behavior>new 2017-02-16 07:31:22.318467500 1066475336 Exception class>signal 2017-02-16 07:31:22.318467500 1066475244 Behavior>basicNew 2017-02-16 07:31:22.318468500 1066475152 Behavior>new 2017-02-16 07:31:22.318468500 1066475060 Exception class>signal 2017-02-16 07:31:22.318471500 1066474968 Behavior>basicNew 2017-02-16 07:31:22.318471500 1066474876 Behavior>new 2017-02-16 07:31:22.318471500 1066474784 Exception class>signal 2017-02-16 07:31:22.318472500 1066474692 Behavior>basicNew: 2017-02-16 07:31:22.318472500 1066474600 Array class>new: 2017-02-16 07:31:22.318472500 1066474508 EventSensor>fetchMoreEvents 2017-02-16 07:31:22.318473500 1066473864 EventSensor>eventTickler 2017-02-16 07:31:22.318473500 1066473716 BlockClosure>on:do: 2017-02-16 07:31:22.318475500 817936580 EventSensor>eventTickler 2017-02-16 07:31:22.318475500 817936488 EventSensor>installEventTickler 2017-02-16 07:31:22.318476500 817936672 BlockClosure>newProcess 2017-02-16 07:31:22.388240500 starting ./run in pid 15028 2017-02-16 07:31:22.388284500 cd /…/squeaksourcecom/SqueakSource 2017-02-16 07:31:22.388379500 exec setuidgid squeaksourcecom /usr/local/bin/squeak -vm-display-null /srv/squeaksourcecom/SqueakSource/squeaksource.6.image 2017-02-16 10:41:23.678772500 acceptHandler: Too many open files 2017-02-16 10:41:23.680035500 acceptHandler: aborting server 7 pss=0x1a136a0 2017-02-16 10:42:13.734814500 socketStatus: freeing invalidated pss=0x1a136a0
Thanks Tobias,
There is definitely something wrong with the image. It is growing in memory and currently using a lot of CPU. I will attempt an off-line fix, but brief outages are possible over the next hour or two.
The image normally saves itself every hour, and it looks like this last happened successfully three days ago:
squeaksourcecom@dan:~/SqueakSource$ ls -lt squeaksource.6.* -rw-r--r-- 1 squeaksourcecom www-data 16317608 Feb 16 14:13 squeaksource.6.changes -rw-r--r-- 1 squeaksourcecom www-data 814463464 Feb 14 08:33 squeaksource.6.image
That means that today's restart would have begun with the Feb 14 copy of the image, so there may be some data loss related to this (I'll check it later).
Dave
On Thu, Feb 16, 2017 at 03:16:43PM +0100, Tobias Pape wrote:
TL;DR: oom at 7am, Too many open files at 11am.
I cycled the image, lets wait?
To quote the log:
2017-02-16 07:31:22.318429500 2017-02-16 07:31:22.318431500 out of memory 2017-02-16 07:31:22.318431500 2017-02-16 07:31:22.318431500 1066476624 Behavior>new 2017-02-16 07:31:22.318432500 1066476532 Exception class>signal 2017-02-16 07:31:22.318432500 1066476440 Behavior>basicNew 2017-02-16 07:31:22.318432500 1066476348 Behavior>new 2017-02-16 07:31:22.318433500 1066476256 Exception class>signal 2017-02-16 07:31:22.318433500 1066476164 Behavior>basicNew 2017-02-16 07:31:22.318433500 1066476072 Behavior>new 2017-02-16 07:31:22.318434500 1066475980 Exception class>signal 2017-02-16 07:31:22.318465500 1066475796 Behavior>basicNew 2017-02-16 07:31:22.318466500 1066475704 Behavior>new 2017-02-16 07:31:22.318466500 1066475612 Exception class>signal 2017-02-16 07:31:22.318466500 1066475520 Behavior>basicNew 2017-02-16 07:31:22.318467500 1066475428 Behavior>new 2017-02-16 07:31:22.318467500 1066475336 Exception class>signal 2017-02-16 07:31:22.318467500 1066475244 Behavior>basicNew 2017-02-16 07:31:22.318468500 1066475152 Behavior>new 2017-02-16 07:31:22.318468500 1066475060 Exception class>signal 2017-02-16 07:31:22.318471500 1066474968 Behavior>basicNew 2017-02-16 07:31:22.318471500 1066474876 Behavior>new 2017-02-16 07:31:22.318471500 1066474784 Exception class>signal 2017-02-16 07:31:22.318472500 1066474692 Behavior>basicNew: 2017-02-16 07:31:22.318472500 1066474600 Array class>new: 2017-02-16 07:31:22.318472500 1066474508 EventSensor>fetchMoreEvents 2017-02-16 07:31:22.318473500 1066473864 EventSensor>eventTickler 2017-02-16 07:31:22.318473500 1066473716 BlockClosure>on:do: 2017-02-16 07:31:22.318475500 817936580 EventSensor>eventTickler 2017-02-16 07:31:22.318475500 817936488 EventSensor>installEventTickler 2017-02-16 07:31:22.318476500 817936672 BlockClosure>newProcess 2017-02-16 07:31:22.388240500 starting ./run in pid 15028 2017-02-16 07:31:22.388284500 cd /?/squeaksourcecom/SqueakSource 2017-02-16 07:31:22.388379500 exec setuidgid squeaksourcecom /usr/local/bin/squeak -vm-display-null /srv/squeaksourcecom/SqueakSource/squeaksource.6.image 2017-02-16 10:41:23.678772500 acceptHandler: Too many open files 2017-02-16 10:41:23.680035500 acceptHandler: aborting server 7 pss=0x1a136a0 2017-02-16 10:42:13.734814500 socketStatus: freeing invalidated pss=0x1a136a0
On Thu, Feb 16, 2017 at 07:22:22PM -0500, David T. Lewis wrote:
Thanks Tobias,
There is definitely something wrong with the image. It is growing in memory and currently using a lot of CPU. I will attempt an off-line fix, but brief outages are possible over the next hour or two.
The image normally saves itself every hour, and it looks like this last happened successfully three days ago:
squeaksourcecom@dan:~/SqueakSource$ ls -lt squeaksource.6.* -rw-r--r-- 1 squeaksourcecom www-data 16317608 Feb 16 14:13 squeaksource.6.changes -rw-r--r-- 1 squeaksourcecom www-data 814463464 Feb 14 08:33 squeaksource.6.image
That means that today's restart would have begun with the Feb 14 copy of the image, so there may be some data loss related to this (I'll check it later).
I fixed the image off line and restarted it. Memory and CPU are back to normal. Notes added to dan.box.squeak.org:~squeaksourcecom/README:
----------------- Fri Feb 17 00:55:29 UTC 2017 dtl Now running squeaksource.7.image from the run script. There is some sort of problem that is occasionally running the image out of memory. It appears to be an error handler in Seaside that tries to send an email via smtp to mail.squeak.org, but is unable to connect to smtp for some reason. The memory usage is all related to that single process. Although the image runs out of memory on this server, I am able to download and run it locally on another machine, terminate the process, and save the image for upload back to dan.box.squeak.org.
The download-and-fix procedure is needed because of a separate and unrelated issue. The VNC server in image is not working when running on dan.box.squeak.org. I am not sure, but this may be related to name server lookup of localhost on current Linux distros, which is a bug that was fixed in less ancient Squeak images. It is probably time to update the image a bit, at least to Squeak 4.6. Note, we should move to 64-bit Spur but I would prefer to wait for source.squeak.org to do that first.
Summary - healthy for now, but we need to check about once a week or so to make sure the memory size (size of squeaksource.7.image on disk) is not getting out of hand. -----------------
Dave
On 17.02.2017, at 02:15, David T. Lewis lewis@mail.msen.com wrote:
On Thu, Feb 16, 2017 at 07:22:22PM -0500, David T. Lewis wrote:
Thanks Tobias,
There is definitely something wrong with the image. It is growing in memory and currently using a lot of CPU. I will attempt an off-line fix, but brief outages are possible over the next hour or two.
The image normally saves itself every hour, and it looks like this last happened successfully three days ago:
squeaksourcecom@dan:~/SqueakSource$ ls -lt squeaksource.6.* -rw-r--r-- 1 squeaksourcecom www-data 16317608 Feb 16 14:13 squeaksource.6.changes -rw-r--r-- 1 squeaksourcecom www-data 814463464 Feb 14 08:33 squeaksource.6.image
That means that today's restart would have begun with the Feb 14 copy of the image, so there may be some data loss related to this (I'll check it later).
I fixed the image off line and restarted it. Memory and CPU are back to normal. Notes added to dan.box.squeak.org:~squeaksourcecom/README:
Fri Feb 17 00:55:29 UTC 2017 dtl Now running squeaksource.7.image from the run script. There is some sort of problem that is occasionally running the image out of memory. It appears to be an error handler in Seaside that tries to send an email via smtp to mail.squeak.org, but is unable to connect to smtp for some reason. The memory usage is all related to that single process. Although the image runs out of memory on this server, I am able to download and run it locally on another machine, terminate the process, and save the image for upload back to dan.box.squeak.org.
The download-and-fix procedure is needed because of a separate and unrelated issue. The VNC server in image is not working when running on dan.box.squeak.org. I am not sure, but this may be related to name server lookup of localhost on current Linux distros, which is a bug that was fixed in less ancient Squeak images. It is probably time to update the image a bit, at least to Squeak 4.6. Note, we should move to 64-bit Spur but I would prefer to wait for source.squeak.org to do that first.
Summary - healthy for now, but we need to check about once a week or so to make sure the memory size (size of squeaksource.7.image on disk) is not getting out of hand.
Hi Dave, thanks for taking onto this :)
Dave
Ugh.
I am away with no system access for the next 8-10 hours. I will fix it at that time if it's still down then.
Background:
A previous failure, which is very likely recurring, was a problem related to the image trying to send mail, with failed processes consuming image memory until things locked up. I changed an smtp parameter in hopes of correcting it, apparently that did not work or I just misunderstood the problem.
How I fixed it last time: Zip up the image/changes files and preserve them. Download to my PC, run on a locally built interpreter VM, which was able to open the image without running out of memory (as was happening on the server). Terminate runaway processes and resave image. Copy it back to the server, then kill the running squeaksource VM to restart things.
If all else fails: Reasonably recent backups are in a ./BACKUPS directory, so restart using the most recent image from backup. Check file dates and look in the ./ss/ss.log file to see what may have been lost. Most stuff just comes back automatically from the file system, but things like user account changes live in the image.
Dave
box-admins@lists.squeakfoundation.org