[Box-Admins] Re: [squeak-dev] SqueakSource.com home page (was: Fix for OSProcess

Fri Nov 15 18:11:43 UTC 2013

On Fri, Nov 15, 2013 at 09:31:18AM -0800, Eliot Miranda wrote:
> Hi David,
> 
> On Thu, Nov 14, 2013 at 6:00 PM, David T. Lewis <lewis at mail.msen.com> wrote:
> 
> > Attached is a screen shot of the process browser in the squeaksource.com
> > image, showing the excess SSSession processes. They are deadlocked on
> > accessing DateAndTime now, which contains a critical section using the
> > LastTickSemaphore in class DateAndTime.
> >
> > In the squeaksource.com image, LastTickSemaphore has 0 excess signals,
> > whereas in other images I look at, it has 1 excess signal. This looks
> > to me like a mutex that has gotten confused.
> >
> > I sent a signal to LastTickSemaphore in class DateAndTime, and now it
> > looks like a mutex again. Let's see if that clears the problem.
> >
> > This certainly has a bad smell about it :-(  But I note also that
> > we are running our SqueakSource services on older images, and a number
> > of changes have been made to DateAndTime since then.
> >
> > Nicolas, I will send private email to give you the VNC password for access
> > to the squeaksource.com image in case you need it (I am going to get some
> > sleep soon).
> >
> 
> All this LastTickSemaphore stuff is complete nonsense, wasting on average
> 1/2 a second on startup spinning until the clock rolls over.  If we move to
> the 64-bit microsecond timebase which is provided by the Cog time
> primitives we don't need to sync the second and the millisecond clocks
> because they are replaced by a single microsecond clock.  If the current
> Interpreter VMs do support the 64-bit microsecond primitives I suggest we
> move ASAP.  QWe've already done this in our images at Cadence and been
> running happily with it for several months.  Would this help?
> 

Yes, it probably would help, in the sense that it would make this particular
failure scenario impossible.

But I think that something else must be going on here, and it would be
worth getting to the bottom of it. The particular deadlock we are seeing
here should be impossible, regardless of how nonsensical the LastTickSemaphore
stuff may be. We are looking at a small section of code evaluated within
a critical section. If semaphores and process scheduling are working
correctly, it should be impossible for two different processes to deadlock
on that section.

I recall that Andreas made an important fix to process scheduling perhaps
a couple of years ago, but I can't remember the details. I wonder if our
SqueakSource images may be lacking that fix?

Also, Chris Muller pointed out that he has seen similar symptoms related
to Seaside:

  This might be a problem I think I observed with using Seaside's
  #returnResponse: inside a Mutex's critical: block.  The block is
  entered, the Sema waited but never resignaled, leaving all subsequent
  processes stuck waiting.

So I am concerned about the following: How is it possible that a semaphore
that is used privately by a small section of uncomplicated code ever
get itself into a state where it has missed a signal and no longer
functions as a mutex? In normal operation this never happens, but is
there some scenario related to Seaside operations, socket timeouts,
process scheduling, or image save and restart that might lead to this
condition? 

BTW, squeaksource.com seems to be working nicely since I signalled
that semaphore yesterday to break things loose. I uploaded a few packages
today without problems. But the problem will be back, I am certain
of that.

Dave

[Box-Admins] Re: [squeak-dev] SqueakSource.com home page (was: Fix for OSProcess - Where to commit?)