SqueakSource is down again

Sun Dec 23 23:12:43 UTC 2007

2007/12/24, Andreas Raab <andreas.raab at gmx.de>:
> Lukas Renggli wrote:
> >>> So which parts do we need to fix to make the Semaphore, Socket and
> >>> image freezing problems go away?
> >> For semaphores I'd recommend the fixes that I've posted over the year.
> >
> > I loaded all your semaphore related patches a couple of months ago and
> > squeaksource.com ran quietly and happily up to a few weeks ago. Then
> > suddenly we got many processes hanging in Semaphore>>#critical:.
>
> If you could send a couple of complete stack dumps from the affected
> image it might be interesting. There is a possibility you were affected
> by the problem of primitiveSuspend (which we discussed earlier) but
> that's difficult to tell from a stack dump. Much much easier if you can
> go into the image and check whether the doIt I sent comes up empty or not.
>
> >> For image freezes -in particular in
> >> Squeaksource- you probably need to fix the concurrency issues in
> >> Squeaksource itself.
> >
> > What kind of concurrency issues in squeaksource.com itself could cause
> > these problems? I know that the code is far from perfect, but I must
> > also point out that we didn't loose a single of the more than 71'000
> > versions during the past 4 years. We also never experienced a
> > corrupted data model.
>
> What we've experienced was basically that after the first commit, when
> our image went to saving the data model in a reference stream (via
> SSFileSystem; takes about two minutes or so), a second commit would
> wreck havoc on the system. You can probably simulate this by generating
> enough load from different clients on the network with or without
> SSFileSystem. And I don't like the idea of saving the image very much
> because it's probably not feasible to save multiple versions of that
> image which ultimately means that any data corruption kills the whole
> data model.

We don't use reference streams anymore. We are at the point were it
takes more than 30 minutes to write the model to disk. We only save
the image. We are aware how suboptimal this is but until now we have
been very lucky to get away with this.

Cheers
Philippe

> > I wonder how it can happen that semaphores are suddenly blocked? Might
> > this be related to image saving happening while being within a
> > critical section?
>
> Interesting thought. It may be possible for some strange things to
> happen if Seaside doesn't take precautions of not accepting connections
> while in the midst of a save. The problem is that the image save/startup
> runs with whatever priority it's being issued at, so if there's another
> process running at the same time there is a chance this process
> interrupts the image save with the potential for strange things
> happening. Here is one way in which I could see this happening: A
> critical lock held by a process waiting for network traffic to occur
> when the image is saved. When the image is restored later on, that
> socket is no longer valid but the process could still wait on the
> semaphore, blocking the critical section for all other uses.
>
> Cheers,
>    - Andreas
>
>
>