[squeak-dev] Re: [Pharo-project] SUnit Time out

Thu Jun 3 00:09:42 UTC 2010

Thanks for clarifying your goals w.r.t. introducing the timeout.  I
think that's important because, as I've said, legacy tests that live
in external packages are affected.

I read your whole note a few times, and one part in particular stuck
out to me as a potentially useful use-case for test-case timeout:

> These changes are largely intended for automated integration testing. I am
> hoping to automate the tests for community supported packages to a point
> where there will be no user in front of the system.

If, by this, you mean you want to simply have a headless running
squeak image which:

  [ true ] whileTrue:
    [ loadLatestPackageCombinations.
    runTestSuite.
    mailResultsToSqueakDev ]

THEN, that brings us down to only haggling about the default timeout,
although I still would prefer to handle timeout it at a higher level..

If, however, this isn't the goal, then I still don't seem to have
grasped, what I sense is, some key point.. or that my own concerns
were properly understood.  If so, let me try one more time.  :)

> done" but the reality is that regardless of what the operation is we never
> actually wait forever. At some point we *will* give up no matter what you
> may think. This is THE fundamental point here. Everything else is basically
> haggling about what the right timeout is.

Of course we would "give up" after an unreasonable amount of time.  In
either case, there is something to interrogate, either a live looping
test-runner machine, or a static report of test results with one or
more that say, "timed out".

In the former case, we have a bevy of useful information, (e.g., which
test is it trying to run?  How much memory is the test image using
right now?  Can I Alt+. interrupt it and get even more information?)

In the latter case, there is no choice but to start at square 1:  Try
to recreate the problem.  (What if it works?)

Personally, I would always prefer to deal with the former case than the latter..

> For the right timeout the second fundamental thing to understand is that if
> there's a question of whether the operation "maybe" completed, then your
> timeout is too short. Period. The timeout's value is not to indicate that
> "maybe" the operation completed, it is there to say unequivocally that
> something caused it to not complete and that it DID fail.

I didn't understand this.  There is no question about "maybe
completed".  We know if a test times out then it _didn't_ complete.
The "maybe" I referred to was about the core question:  whether the
underlying software being tested can be used or not.  "Maybe" it
could, then again, maybe it shouldn't.  It sounds like we agree, a
timeout would *have* to be regarded as a failure.

> Obviously, introducing timeouts will create some initial false positives.

You mean false negatives?  If we are saying that we must treat a
timeout as failure, and failure is "negative", then a timeout would be
false negative or a true negative....?

> But it may be interesting to be a bit more precise on what we're talking
> about. To do this I attributed TestRunner to measure the time it takes to
> run each test and then ran all the tests in 4.2 to see where that leads us.
> As you might expect, the distribution is extremely uneven. Out of 2681 tests
> run  2588 execute in < 500 msecs (approx. 1800 execute with no measurable
> time);  2630 execute in less than one second, leaving a total of 51 that
> take more than a second and only three tests actually take longer than 5
> seconds and they are all tagged as such.

That's fine for the 4.2 tests, but there are hundreds of tests in
external packages.  With a mere 5-second default, many will need to be
updated with a pragma.  But then we're talking about a branch in the
package because that won't be backward compatible with 3.9, will it?

> As you can see the vast majority of tests have a "safety margin" of 10x or
> more between the time the test usually takes and its timeout value.
> Generally speaking, this margin is sufficient to compensate for "other"
> effects that might rightfully delay the completion of the test in time.

I can see that jacking up the timeout may tend reduce the number of
false negatives (at the expense of potentially longer wait times!),
but when they do, we have no useful information whatsoever.  Not even
certainty whether the underlying software is usable or not, because it
could be a false negative.

> If
> you have tests that commonly vary by 10x I'd be interested in finding out
> more about what makes them so unpredictable.

Well, again, it's not just about randomness in the tests but also
about external factors; CPU speed, current system load, etc.

> So if your question is "are my timeouts to tight" one thing we could do is
> to introduce the 10x as a more or less general guideline for executing
> tests,

Ok, with that kind of margin, the message I'm getting from you is that
it does about making a human have to wait.  We just want to make sure
we "get some kind of report?"

>> But, the reason given for the change was not for running tests
>> interactively (the 99% case), rather, all tests form the beginning of
>> time are now saddled with a timeout for the 1% case:
>
> As the data shows, this is already the case. It may be interesting to note
> that so far there were a total of 5 (five) places that had to be adjusted in
> Squeak.

I'm not worried about the built-in tests; recall I acknowledged that I
can "almost understand" a forced timeout in the context of an
open-source project where people are all contributing their portions
and no one else wants to be "held up" because of one persons tests
looping.

My concern is more about the impact to legacy external packages..

>  One was a general place (the default timeout for the decompiler
> tests) and four were individual methods. Considering that computers usually
> don't become slower over time, it seems unlikely that further adjustments
> will be necessary here.

Well, they do..  It's not just a function of time, but who's running
it, and on which machine.  We all have different machines.  Maybe
someone wants to test on an iPhone that might be considerably slower
than the original desktop on which the timeout was specified...

> So the bottom line is that the changes required
> aren't exactly excessive.

That depends on whether, to have an Community Supported Package be
included, how many test methods I have and whether I also want that to
run in 3.9 and whether, to do that, I have to put in a pragma..
(unless I'm mistaken about pragmas working in 3.9).

Bottom line:  Today Magma runs on 3.9 - 4.2 + Pharo.  Some of Magma's
tests necessarily take several minutes.

Question:  Can Magma be a CSP and still retain this wide compatibility?

> These changes are largely intended for automated integration testing. I am
> hoping to automate the tests for community supported packages to a point
> where there will be no user in front of the system.
>
> Even if there were, it's
> not clear whether that person can fix the issue immediately or whether the
> entire process is stuck because someone can momentarily not fix the problem
> at hand and the tests will never run to completion and produce any useful
> result.

Who is "that person" and what is their role?

> begin with. The whole idea of running the tests to catch *unexpected*
> situations and as a consequence there is value of capturing these situations
> instead of hanging and producing no useful result.

To me, "timed out" is what is not useful.  To find a hanging machine
that can be interrogated is much more useful.

>> In that case, the high-level test-controller which spits out the
>> results could and should be responsible for handling "unexpected user
>> input" and/or putting in a timeout, not each and every last test
>> method..
>
> Do you have such a "high-level test-controller"? Or do you mean a human
> being spending their time watching the tests run to completion? If the
> former, I'm curious as to how it would differ from what I did. If the
> latter, are you volunteering? ;-)

I meant the former.  It differs from what you did in that it preserves
legacy compatibilty, and the legacy deterministic property of testing.
 To handle automated test server, I would handle the on-timeout: from
a much higher place, and therefore it would not be for individual
tests, but for the whole suite.  Information about the last running
test would be sufficient for me, especially if we're talking about all
of the other disadvantages I've mentioned for fine-grained timeouts..

 - Chris