Has anyone had an issue with a deployment of Seaside locking up, and killing access to the box via networking when hosting on windows server 2003? I'm running a 3.9 image in prod and on occasion it stops working, but when it dies, it seems to take out the network and other scheduled processes that copy files across the network start failing as well. Killing Squeak and restarting it fixes it and allows the network to start working again as well, I haven't a clue what's going on, and Squeak itself, goes black and I loose the UI, so I can't even debug the issue. I've had to resort to running Squeak as a service and resetting it on a schedule to give the appearance of stability, but I'm not too happy with that solution. Anyone have any ideas?
Ramon Leon http://onsmalltalk.com
From: Ramon Leon Has anyone had an issue with a deployment of Seaside locking up, and killing access to the box via networking when hosting on windows
server
2003?
[...]
I've had to resort to running Squeak as a service and resetting it on a schedule to give the appearance of stability, but I'm not too happy with that solution. Anyone have any ideas?
Sounds like exhaustion of some OS network-related resource, that is then released when the process exits. This is reinforced by your observation that regular restarts of the process remove the problem. Naively I'd suggest monitoring the handle count for the Squeak process as a first step, but Andreas probably has some much better ideas for monitoring!
- Peter
Sounds like exhaustion of some OS network-related resource, that is then released when the process exits. This is reinforced by your observation that regular restarts of the process remove the problem. Naively I'd suggest monitoring the handle count for the Squeak process as a first step, but Andreas probably has some much better ideas for monitoring!
- Peter
Looking at the handle count, I'm seeing a fresh image start with around 40 or so, and an image that's been up a bit, pushing 5000, much more than any other process on the box. I'm still waiting for another crash, but this seems a likely suspect, any more ideas?
From: Ramon Leon Looking at the handle count, I'm seeing a fresh image start with
around
40 or so, and an image that's been up a bit, pushing 5000, much more than any other process on the box.
OK, this feels like a promising source.
I'm still waiting for another crash, but this seems a likely suspect, any more ideas?
If it's a networking issue, copies of handle.exe and tdimon.exe (both from www.sysinternals.com) may be useful - if I recall correctly, they may tell you for what the handles are being used. Then it's a case of reviewing the code that opens that kind of thing and seeing whether it disposes of the object correctly afterwards - could be a VM issue, could be an image issue.
- Peter
Peter Crowther wrote:
From: Ramon Leon Looking at the handle count, I'm seeing a fresh image start with
around
40 or so, and an image that's been up a bit, pushing 5000, much more than any other process on the box.
OK, this feels like a promising source.
I'm still waiting for another crash, but this seems a likely suspect, any more ideas?
If it's a networking issue, copies of handle.exe and tdimon.exe (both from www.sysinternals.com) may be useful - if I recall correctly, they may tell you for what the handles are being used. Then it's a case of reviewing the code that opens that kind of thing and seeing whether it disposes of the object correctly afterwards - could be a VM issue, could be an image issue.
- Peter
Appreciate the tips, I'll try them out and see if it leads anywhere when I go to work tomorrow.
If it's a networking issue, copies of handle.exe and tdimon.exe (both from www.sysinternals.com) may be useful - if I recall correctly, they may tell you for what the handles are being used. Then it's a case of reviewing the code that opens that kind of thing and seeing whether it disposes of the object correctly afterwards - could be a VM issue, could be an image issue.
- Peter
OK, playing with handle, and seems they are thread handles.
Squeak.exe in task manager has 14,702 handles, 7 threads.
Yet handle -s shows 15000+ Thread handles
Locally, I can see both the handle count and thread count spike when I do a soap call in a loop forking each call, which would kind of simulate my live environment, the Seaside app doing soap calls on a forked process and polling for the result. Seems somehow I'm leaving thread handles hanging around, any idea what might cause this or how I can track it down?
Ramon Leon http://onsmalltalk.com
OK, playing with handle, and seems they are thread handles.
Squeak.exe in task manager has 14,702 handles, 7 threads.
Yet handle -s shows 15000+ Thread handles
Locally, I can see both the handle count and thread count spike when I do a soap call in a loop forking each call, which would kind of simulate my live environment, the Seaside app doing soap calls on a forked process and polling for the result. Seems somehow I'm leaving thread handles hanging around, any idea what might cause this or how I can track it down?
Ramon Leon http://onsmalltalk.com
OK, I've found the offending line of code. I'm using
NetNameResolver localHostName
To print the web server name in the html source code for debugging purposes, and turns out each time it's called, it leaves a handle hanging.
10000 timesRepeat: [NetNameResolver localHostName]
Confirms to me that this is my bug. So... Anyone know a reliable method of getting the computers name that doesn't leak like a sieve?
Ramon Leon http://onsmalltalk.com
Ramon Leon wrote:
10000 timesRepeat: [NetNameResolver localHostName]
Confirms to me that this is my bug. So... Anyone know a reliable method of getting the computers name that doesn't leak like a sieve?
What Windows version are you running? I've just run the above code on XP and everything went fine, i.e., no handles were leaked.
Cheers, - Andreas
Ramon Leon wrote:
10000 timesRepeat: [NetNameResolver localHostName]
Confirms to me that this is my bug. So... Anyone know a reliable method of getting the computers name that doesn't leak like a sieve?
What Windows version are you running? I've just run the above code on XP and everything went fine, i.e., no handles were leaked.
Cheers,
- Andreas
I'm using XP Professional Service Pack 2, and I just reconfirmed that this leaks handles for both Squeak 3.8.1 and Squeak 3.9 for me.
Ramon Leon http://onsmalltalk.com
Ramon Leon wrote:
10000 timesRepeat: [NetNameResolver localHostName]
Confirms to me that this is my bug. So... Anyone know a reliable method of getting the computers name that doesn't leak
like a sieve?
What Windows version are you running? I've just run the
above code on
XP and everything went fine, i.e., no handles were leaked.
Cheers,
- Andreas
I'm using XP Professional Service Pack 2, and I just reconfirmed that this leaks handles for both Squeak 3.8.1 and Squeak 3.9 for me.
And in production, I'm using Windows Server 2003, also leaks.
Ramon Leon http://onsmalltalk.com
Ramon Leon wrote:
I'm using XP Professional Service Pack 2, and I just reconfirmed that this leaks handles for both Squeak 3.8.1 and Squeak 3.9 for me.
Then I need you to debug some more stuff:
1) Which VM are you using?
2) Does the following leak?
1000 timesRepeat: [NetNameResolver localHostAddress]
3) Does the following leak? How fast does it execute? How many stars do you get in the transcript?
addr := NetNameResolver localHostAddress. [ 1000 timesRepeat: [ (NetNameResolver nameForAddress: addr timeout: 5) ifNil:[Transcript show: '*']. ] ] timeToRun.
4) If test 3) does leak, increase the timeout to 50 and re-run. Does it still leak? How fast does it execute?
Cheers, - Andreas
On Tue, 19 Dec 2006 10:26:25 -0800, Andreas Raab andreas.raab@gmx.de wrote:
- Which VM are you using?
3.7.1 (release) from Sept 23, 2004 Compiler: gcc 2.95.2 19991024 (release)
Does the following leak?
1000 timesRepeat: [NetNameResolver localHostAddress]
No.
- Does the following leak?
Yes.
How fast does it execute?
283 ms
How many stars do you get in the transcript?
None.
addr := NetNameResolver localHostAddress. [ 1000 timesRepeat: [ (NetNameResolver nameForAddress: addr timeout: 5) ifNil:[Transcript show: '*']. ] ] timeToRun.
- If test 3) does leak, increase the timeout to 50 and re-run. Does it
still leak? How fast does it execute?
Same results, 1000 more handles, no stars on the transcript, run time was 265 ms.
This is on Squeak 3.8 (6665) on XP Pro.
Later, Jon
-------------------------------------------------------------- Jon Hylands Jon@huv.com http://www.huv.com/jon
Project: Micro Raptor (Small Biped Velociraptor Robot) http://www.huv.com/blog
Then I need you to debug some more stuff:
- Which VM are you using?
3.7.1
Does the following leak?
1000 timesRepeat: [NetNameResolver localHostAddress]
Nope.
- Does the following leak?
Yup
How fast does it execute?
387
How many stars do you get in the transcript?
None.
- If test 3) does leak, increase the timeout to 50 and
re-run. Does it still leak? How fast does it execute?
Still leaks, finishes in about the same on average.
Cheers,
- Andreas
Ramon Leon wrote:
Then I need you to debug some more stuff:
- Which VM are you using?
If the millisecondClock is close to rolling over then the deadline may be set in the future to a number greater than SmallInteger maxVal and timeouts will never complete.
For Socket the default timeout is 45 seconds. So thinking about it this error is only likely to occur for 45 seconds every 12 days or so, but it will occur if there is code which relies upon the timeout itself.
Now this is a complete guess but it might be your explanation. What if somehow your millisecond clock has failed to roll over and might be stuck in that dangerous region ie. at SmallInteger maxVal.
Keith
___________________________________________________________ Try the all-new Yahoo! Mail. "The New Version is radically easier to use" The Wall Street Journal http://uk.docs.yahoo.com/nowyoucan.html
A fix for the smalltalk bug is to make sure that the timeout calculation similarly rolls over. i.e.
deadlineSecs: secs "Return a deadline time the given number of seconds from now."
^ (Time millisecondClockValue + (secs * 1000) truncated) \ SmallInteger maxVal.
The code in question does not use Socket-#deadlineSecs: and there are many many places in the image that could be caught by this. The solution is to put this code on Time and encourage its use.
Time-deadlineSecs: Time-pastDeadline: deadline
of course if your millisecondClock has got stuck then its a vm problem
Keith
If the millisecondClock is close to rolling over then the deadline may be set in the future to a number greater than SmallInteger maxVal and timeouts will never complete.
For Socket the default timeout is 45 seconds. So thinking about it this error is only likely to occur for 45 seconds every 12 days or so, but it will occur if there is code which relies upon the timeout itself.
Now this is a complete guess but it might be your explanation. What if somehow your millisecond clock has failed to roll over and might be stuck in that dangerous region ie. at SmallInteger maxVal.
Keith
___________________________________________________________ Try the all-new Yahoo! Mail. "The New Version is radically easier to use" – The Wall Street Journal http://uk.docs.yahoo.com/nowyoucan.html
___________________________________________________________ All New Yahoo! Mail – Tired of Vi@gr@! come-ons? Let our SpamGuard protect you. http://uk.docs.yahoo.com/nowyoucan.html
Ramon, Jon -
This is really confusing. Just to make sure we're measuring the same things (and not some random numbers that have no relation to reality ;-) let's make sure we're measuring the same thing. When I was running the test I used the windows task manager which, under the performance tab, displays the the number of total handles, threads, processes, memory etc.
When running the test I saw no relevant change in either handles, threads, memory, or commit charge. Did you use the same mechanism or did you use something else? If you didn't use the windows task manager, what did you use? And what does windows task manager report? If you see a change can you send me the before/after values when running our little "benchmark"?
Thanks, - Andreas
Ramon, Jon -
This is really confusing. Just to make sure we're measuring the same things (and not some random numbers that have no relation to reality ;-) let's make sure we're measuring the same thing. When I was running the test I used the windows task manager which, under the performance tab, displays the the number of total handles, threads, processes, memory etc.
When running the test I saw no relevant change in either handles, threads, memory, or commit charge. Did you use the same mechanism or did you use something else? If you didn't use the windows task manager, what did you use? And what does windows task manager report? If you see a change can you send me the before/after values when running our little "benchmark"?
Thanks,
- Andreas
Actually, that works as well, but I was using the process tab, after choosing view/select columns and adding handles and threads to the display. This lets me see that it's Squeak eating those handles.
However, on my machine, now at home, totally different box, but also Windows XP Pro, same behavior, each time I call NetNameResolver localHostName, the handle count goes up by 1 on the Squeak process. Calling 1000 timesRepeat: [NetNameResolver localHostName] naturally kicks it up by exactly 1000, quite reliably.
If you have more tests, send em, I'll run em, if you want my image, here's my image, VM, and all http://onsmalltalk.com/downloads/DevImage.zip, anything I can do to assist, let me know, this has been a pain in my but for weeks already, I'm just glad I found out what was causing my image to crash.
Ramon Leon http://onsmalltalk.com
Ramon Leon wrote:
Actually, that works as well, but I was using the process tab, after choosing view/select columns and adding handles and threads to the display. This lets me see that it's Squeak eating those handles.
Ah, thanks. I didn't even know about that ;-) And after turning it on, I can see it, too - and I'm not sure what I saw before because what happened reminded me of the handle leak that used to be in sockets and sure enough, the lookup code has the same problem (a missing CloseHandle() call for the thread created).
Well, the good news is that I understand the problem and know how to fix it. The bad news is that I currently have no definitive VMMaker version against which to compile - I might do a cheap respin of 3.7 (e.g., a 3.7.2) with *just* that fix included. Would this helpful for you?
Cheers, - Andreas
On 19-Dec-06, at 9:42 PM, Andreas Raab wrote:
Well, the good news is that I understand the problem and know how to fix it. The bad news is that I currently have no definitive VMMaker version against which to compile - I might do a cheap respin of 3.7 (e.g., a 3.7.2) with *just* that fix included. Would this helpful for you?
Is there any reason not to use 3.8b6? I don't recall any changes that make it not a good candidate but I haven't had time available to even think about it in 12 months.
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim How many of you believe in telekinesis? Raise my hands....
Andreas Raab wrote:
Ramon Leon wrote:
Actually, that works as well, but I was using the process tab, after choosing view/select columns and adding handles and threads to the display. This lets me see that it's Squeak eating those handles.
Ah, thanks. I didn't even know about that ;-) And after turning it on, I can see it, too - and I'm not sure what I saw before because what happened reminded me of the handle leak that used to be in sockets and sure enough, the lookup code has the same problem (a missing CloseHandle() call for the thread created).
Well, the good news is that I understand the problem and know how to fix it. The bad news is that I currently have no definitive VMMaker version against which to compile - I might do a cheap respin of 3.7 (e.g., a 3.7.2) with *just* that fix included. Would this helpful for you?
Cheers,
- Andreas
Honestly, I don't even need it, I was just using the machine name in some debug statements for separating the machine in the web farm, I just started printing the IP address instead, it doesn't leak, and works just as well. But thanks for the offer.
Honestly, I don't even need it, I was just using the machine name in some debug statements for separating the machine in the web farm, I just started printing the IP address instead, it doesn't leak, and works just as well. But thanks for the offer.
Is there a good reason why this value should not be cached on startUp?
Keith Send instant messages to your online friends http://uk.messenger.yahoo.com
Honestly, I don't even need it, I was just using the machine name in some debug statements for separating the machine in the web farm, I just started printing the IP address instead, it doesn't leak, and works just as well. But thanks for the offer.
Is there a good reason why this value should not be cached on startUp?
Keith Send instant messages to your online friends http://uk.messenger.yahoo.com
Had I known it leaked, certainly, I would have, but like I said, I don't even need it, it's already been removed.
Ramon Leon http://onsmalltalk.com
2006/12/19, Ramon Leon ramon.leon@allresnet.com:
Ramon Leon wrote:
10000 timesRepeat: [NetNameResolver localHostName]
Confirms to me that this is my bug. So... Anyone know a reliable method of getting the computers name that doesn't leak like a sieve?
What Windows version are you running? I've just run the above code on XP and everything went fine, i.e., no handles were leaked.
Cheers,
- Andreas
I'm using XP Professional Service Pack 2, and I just reconfirmed that this leaks handles for both Squeak 3.8.1 and Squeak 3.9 for me.
You don't happen to have the Windows "firewall" on, do you?
Philippe
I'm using XP Professional Service Pack 2, and I just
reconfirmed that
this leaks handles for both Squeak 3.8.1 and Squeak 3.9 for me.
You don't happen to have the Windows "firewall" on, do you?
Philippe
Nope, I hate that thing.
Ramon Leon http://onsmalltalk.com
2006/12/19, Andreas Raab andreas.raab@gmx.de:
Philippe Marschall wrote:
You don't happen to have the Windows "firewall" on, do you?
I don't, but have you seen problem when Windows firewall was turned on?
No, I just wanted to make sure it's none of the "simple" problems.
Philippe
On Tue, 19 Dec 2006 10:08:10 -0800, Andreas Raab andreas.raab@gmx.de wrote:
What Windows version are you running? I've just run the above code on XP and everything went fine, i.e., no handles were leaked.
I am running Windows XP Pro (Version 5.1.2600 Service Pack 2 Build 2600) and confirmed that the same thing happens to me as what happened to Ramon.
The handle count went from 142 to 10,142...
Later, Jon
-------------------------------------------------------------- Jon Hylands Jon@huv.com http://www.huv.com/jon
Project: Micro Raptor (Small Biped Velociraptor Robot) http://www.huv.com/blog
On Tue, 19 Dec 2006 13:23:04 -0500, Jon Hylands jon@huv.com wrote:
I am running Windows XP Pro (Version 5.1.2600 Service Pack 2 Build 2600) and confirmed that the same thing happens to me as what happened to Ramon.
I also tried the same thing on my other laptop, which is running XP Home Edition (also Version 5.1.2600 Service Pack 2 Build 2600).
Same results - handle count went up by 10,000.
Later, Jon
-------------------------------------------------------------- Jon Hylands Jon@huv.com http://www.huv.com/jon
Project: Micro Raptor (Small Biped Velociraptor Robot) http://www.huv.com/blog
squeak-dev@lists.squeakfoundation.org