[squeak-dev] Re: Suspending process fix

Andreas Raab andreas.raab at gmx.de
Tue Apr 28 09:21:55 UTC 2009


Igor Stasenko wrote:
> 2009/4/28 Andreas Raab <andreas.raab at gmx.de>:
>> Igor Stasenko wrote:
>>> 2009/4/28 Andreas Raab <andreas.raab at gmx.de>:
>>>> One thing I'm curious about is what is the use case are you looking at? I
>>>> have never come across this particular problem in practice since explicit
>>>> suspend and resume operations are exceptionally rare outside of the
>>>> debugger.
>>> Well, i discovered this issue when wanted to make a current process be
>>> the only process which can be running during a certain period of time
>>> - without any chance for being interrupted by another, higher priority
>>> process.
>> I'm missing some context here. How does this issue relate to sending a
>> process suspend; resume and expect it to keep waiting on a semaphore? If I'd
>> have to solve this problem I would just bump the process' priority
>> temporarily.
>>
> 
> Process suspension is a STRONG guarantee that given process will not
> perform any actions,until it receives #resume.

This is not an answer to my (honest) question. I don't understand how 
suspending and resuming processes does allow you to solve your 
originally stated problem that is "running during a certain period of 
time without any chance for being interrupted by another, higher 
priority process". I can do what you are stating as the problem by 
changing the process priority, I don't see how suspending a process 
would even begin to address this. In short, I don't see what solution 
you are proposing to address your stated problem and why the issue with 
suspend and resume arises from that.

> Priority is a wrong way to ensure this: different VMs could break this
> contract easily , while breaking a #suspend contract is something what
> doesn't fits in my mind :)

I'll fix that when I have a reason to switch to a VM that has a 
different scheduling policy. I'm not into discussing problems I don't have.

>>> IMO, the suspend/resume should guarantee that suspended process will
>>> not perform any actions under any circumstances and be able to
>>> continue normally after issuing corresponding #resume.
>>> As my issue illustrates, this is not true for processes which is
>>> waiting on semaphore.
>> Yes, but even with your proposal this wouldn't be true since a process
>> suspended from some position in the list wouldn't be put back on in the same
>> position. In practice, there are *severe* limits to that statement about how
>> the system "ought" to behave when you run hundreds of processes with some
>> 50,000 network interrupts per second behind a Tweak UI ;-) I think I can
>> prove that your implied definition of "continuing normally" is impossible to
>> achieve in any system that has to deal with asynchronous signals.
>>
> 
> Do not try to scare me with numbers: if things working correctly for
> 2-3 processes, why they should fail for 50000? ;)

Because your assumption that the code is "working correctly" is just 
that - a wild guess about what might happen in what circumstances. Since 
there is no mathematically sound proof that the code is indeed "working 
correctly" (in fact you probably don't even have a definition of what it 
means to "work correctly"), running that code at 10,000 times the 
frequency allows you to find problems with a much higher probability 
than you would otherwise be able to. Seriously, we didn't find the 
problems we fixed over the last years by reasoning - we found them 
because the system came to a screeching halt often enough to allow us to 
find the issues.

> Certainly, the problem is to correctly identify a set of operations
> which require atomicity (at language side and at VM side, if its using
> many native threads). But if its done right, then who cares about
> numbers?

Numbers show that your code *actually* works instead you just thinking 
it works. Do you really believe that people who wrote the code that we 
fixed did know that their code was buggy and were just too lazy to write 
correct code? Come on, get serious.

>>> Ask yourself, why a developer, who may want to suspend any process
>>> (regardless of his intents) to resume it later, should make any
>>> assertions like "what will be broken if i suspend it?".
>> Thus my question about use cases. I haven't seen many uses of suspend
>> outside of the debugger. And I don't think that's by accident - suspend is
>> very tricky to deal with in a realistic setting that needs to deal with
>> asynchronous signals. Most of the time it is a last resort solution (i.e.,
>> don't care too much about what happens afterwards) not something that you
>> would do casually and expect to be side-effect free.
> 
> Suspeding process is an explicit way to control on what happens in your system.
> Many facilities can benefit from it, is we guarantee a certain
> contracts to be fullfilled.
> Actually, we are using suspend/resume every day, even without noticing
> it - consider an image snapshot/startup :)

But that a) suspends *all* processes (in effect stops time) and b) it is 
not side-effect free and nobody expects it to be. In fact image snapshot 
is a great example of cooperating processes (including shutdown / 
startup processing) and why one cannot assume that external suspend / 
resume can be completely side effect free.

> Does processes which were waiting for semaphore and saved in image in
> such state start working after startup as if semaphore signalled? Do
> such processes lose their 'wait' state?

In the non-trivial cases, they do. That's because they are being shut 
down during the system shutdown and startup sequence. Check out what 
(for example) web servers do - they stop the listener process and 
restart it after the system has been restarted. But even that is beside 
the point because you are cherry-picking one particular aspect of image 
saving without looking at the parts that make this possible. Including 
for example, the *great* care Delay goes through when adjusting wakeup 
times, or the resource management necessary for sockets and files just 
so the rest of the system can *pretend* nothing of importance just 
happened. Or are you now suggesting that a process that is waiting on a 
delay must also adjust the wakeup time for that delay when it is 
suspended and resumed later? Close and reopen files? Shut down and 
reopen sockets? ;-)

>> The problem is that in a "real" environment signals are asynchronous. Unless
>> you have some way of stopping time and other external interrupts at the same
>> time you simply cannot guarantee that after the #suspend there isn't an
>> external signal which causes some other process waiting on the semaphore to
>> execute before the process that "ought" to be released.
>>
> If you speaking about Squeak VM, and its green threading model then
> this is certanly doable, because at primitive (VM) level there is no
> other activity at language side, other than VM does.

I don't understand that sentence.

> It should stall, of course, because first process who waiting on mutex
> should obtain it first. The fact that its suspended is not relevant.

Well, I guess that's the end of this discussion for me. I have little 
interest to discuss hypotheticals. All of my needs are very practical.

> This is what i'm trying to say: waiting semantics should be kept
> separated from scheduling.
> A proof case is:
> 
> mutex critical: [
>   proc := Processor activeProcess.
>   [ proc suspend.  self do something here.  proc resume. ] fork.
>   Processor yield.
> ].
> 
> it shows, that process which obtained a mutex can be suspended at any
> point of time. And it gives you the right answer: any other processes
> who waiting on same mutex will stall forever, until eventually,
> suspended process will be resumed and release the mutex.

I don't get what you are trying to show with the above. Yes a process 
which holds a mutex can be suspended while it holds the mutex. This has 
been true forever and it has nothing to do with the case I was 
describing which was about the order at which processes arrive at and 
leave a mutex.

> I do not agree. As i said before, priorities is a fluid essence, which
> simply shows to VM , which process have a better chance to take
> control over computing resources (other VMs can treat priority
> differently - like a percentage of computing resources which can be
> allocated for a given process and guarantee that all active processes
> will not starve during a certain period of time).
> I don't like implicit control, explicit is much more better , because
> it guarantees that under any circumstances your code will work same as
> before.
> 
> I will try to implement a VM-side primitives which will guarantee
> atomicity for Semaphore wait/signal operations. Then we can continue
> our discussion using more grounded arguments. :)

Good luck with that. I'll settle for a definition of "correct" behavior 
in the presence of asynchronous interrupts since it is really not clear 
to me what you mean when you say "correct".

My problem here is that I've been knee-deep for years now into all of 
the things that *actually* happen that I have no illusions left about 
hypotheticals. I'll settle for a least-surprise approach that gives me 
"a" result consistently. I'll rather take a wrong result 100% of the 
time that a right result 95% of the time because the former you can 
learn to work around quickly whereas the latter may work for weeks and 
then fall over three times in a day.

But you know all that already - just compare our discussion about the 
expected behavior of forking processes at the same priority. Obviously 
we have different opinions about these issues but in my experience with 
these (and related) issues whenever we removed uncertainty we improved 
robustness because it's the rare and unexpected things that really get 
you, not the obvious ones.

Cheers,
   - Andreas



More information about the Squeak-dev mailing list