[squeak-dev] Re: [Pharo-project] SUnit Time out

Thu Jun 3 03:35:58 UTC 2010

Usually in a test, "false positive" is when the test thinks it found a bug, but there's actually something wrong with the test. "False negative" usually means that a test erroneously passed when it shouldn't have. Of course I am probably speaking a regional dialect which may be somewhat rooted in Seattle, WA test culture:)

On Jun 2, 2010, at 5:09 PM, Chris Muller <asqueaker at gmail.com> wrote:

> Thanks for clarifying your goals w.r.t. introducing the timeout.  I
> think that's important because, as I've said, legacy tests that live
> in external packages are affected.
> 
> I read your whole note a few times, and one part in particular stuck
> out to me as a potentially useful use-case for test-case timeout:
> 
>> These changes are largely intended for automated integration testing. I am
>> hoping to automate the tests for community supported packages to a point
>> where there will be no user in front of the system.
> 
> If, by this, you mean you want to simply have a headless running
> squeak image which:
> 
>  [ true ] whileTrue:
>    [ loadLatestPackageCombinations.
>    runTestSuite.
>    mailResultsToSqueakDev ]
> 
> THEN, that brings us down to only haggling about the default timeout,
> although I still would prefer to handle timeout it at a higher level..
> 
> If, however, this isn't the goal, then I still don't seem to have
> grasped, what I sense is, some key point.. or that my own concerns
> were properly understood.  If so, let me try one more time.  :)
> 
>> done" but the reality is that regardless of what the operation is we never
>> actually wait forever. At some point we *will* give up no matter what you
>> may think. This is THE fundamental point here. Everything else is basically
>> haggling about what the right timeout is.
> 
> Of course we would "give up" after an unreasonable amount of time.  In
> either case, there is something to interrogate, either a live looping
> test-runner machine, or a static report of test results with one or
> more that say, "timed out".
> 
> In the former case, we have a bevy of useful information, (e.g., which
> test is it trying to run?  How much memory is the test image using
> right now?  Can I Alt+. interrupt it and get even more information?)
> 
> In the latter case, there is no choice but to start at square 1:  Try
> to recreate the problem.  (What if it works?)
> 
> Personally, I would always prefer to deal with the former case than the latter..
> 
>> For the right timeout the second fundamental thing to understand is that if
>> there's a question of whether the operation "maybe" completed, then your
>> timeout is too short. Period. The timeout's value is not to indicate that
>> "maybe" the operation completed, it is there to say unequivocally that
>> something caused it to not complete and that it DID fail.
> 
> I didn't understand this.  There is no question about "maybe
> completed".  We know if a test times out then it _didn't_ complete.
> The "maybe" I referred to was about the core question:  whether the
> underlying software being tested can be used or not.  "Maybe" it
> could, then again, maybe it shouldn't.  It sounds like we agree, a
> timeout would *have* to be regarded as a failure.
> 
>> Obviously, introducing timeouts will create some initial false positives.
> 
> You mean false negatives?  If we are saying that we must treat a
> timeout as failure, and failure is "negative", then a timeout would be
> false negative or a true negative....?
> 
>> But it may be interesting to be a bit more precise on what we're talking
>> about. To do this I attributed TestRunner to measure the time it takes to
>> run each test and then ran all the tests in 4.2 to see where that leads us.
>> As you might expect, the distribution is extremely uneven. Out of 2681 tests
>> run  2588 execute in < 500 msecs (approx. 1800 execute with no measurable
>> time);  2630 execute in less than one second, leaving a total of 51 that
>> take more than a second and only three tests actually take longer than 5
>> seconds and they are all tagged as such.
> 
> That's fine for the 4.2 tests, but there are hundreds of tests in
> external packages.  With a mere 5-second default, many will need to be
> updated with a pragma.  But then we're talking about a branch in the
> package because that won't be backward compatible with 3.9, will it?
> 
>> As you can see the vast majority of tests have a "safety margin" of 10x or
>> more between the time the test usually takes and its timeout value.
>> Generally speaking, this margin is sufficient to compensate for "other"
>> effects that might rightfully delay the completion of the test in time.
> 
> I can see that jacking up the timeout may tend reduce the number of
> false negatives (at the expense of potentially longer wait times!),
> but when they do, we have no useful information whatsoever.  Not even
> certainty whether the underlying software is usable or not, because it
> could be a false negative.
> 
>> If
>> you have tests that commonly vary by 10x I'd be interested in finding out
>> more about what makes them so unpredictable.
> 
> Well, again, it's not just about randomness in the tests but also
> about external factors; CPU speed, current system load, etc.
> 
>> So if your question is "are my timeouts to tight" one thing we could do is
>> to introduce the 10x as a more or less general guideline for executing
>> tests,
> 
> Ok, with that kind of margin, the message I'm getting from you is that
> it does about making a human have to wait.  We just want to make sure
> we "get some kind of report?"
> 
>>> But, the reason given for the change was not for running tests
>>> interactively (the 99% case), rather, all tests form the beginning of
>>> time are now saddled with a timeout for the 1% case:
>> 
>> As the data shows, this is already the case. It may be interesting to note
>> that so far there were a total of 5 (five) places that had to be adjusted in
>> Squeak.
> 
> I'm not worried about the built-in tests; recall I acknowledged that I
> can "almost understand" a forced timeout in the context of an
> open-source project where people are all contributing their portions
> and no one else wants to be "held up" because of one persons tests
> looping.
> 
> My concern is more about the impact to legacy external packages..
> 
>> One was a general place (the default timeout for the decompiler
>> tests) and four were individual methods. Considering that computers usually
>> don't become slower over time, it seems unlikely that further adjustments
>> will be necessary here.
> 
> Well, they do..  It's not just a function of time, but who's running
> it, and on which machine.  We all have different machines.  Maybe
> someone wants to test on an iPhone that might be considerably slower
> than the original desktop on which the timeout was specified...
> 
>> So the bottom line is that the changes required
>> aren't exactly excessive.
> 
> That depends on whether, to have an Community Supported Package be
> included, how many test methods I have and whether I also want that to
> run in 3.9 and whether, to do that, I have to put in a pragma..
> (unless I'm mistaken about pragmas working in 3.9).
> 
> Bottom line:  Today Magma runs on 3.9 - 4.2 + Pharo.  Some of Magma's
> tests necessarily take several minutes.
> 
> Question:  Can Magma be a CSP and still retain this wide compatibility?
> 
>> These changes are largely intended for automated integration testing. I am
>> hoping to automate the tests for community supported packages to a point
>> where there will be no user in front of the system.
>> 
>> Even if there were, it's
>> not clear whether that person can fix the issue immediately or whether the
>> entire process is stuck because someone can momentarily not fix the problem
>> at hand and the tests will never run to completion and produce any useful
>> result.
> 
> Who is "that person" and what is their role?
> 
>> begin with. The whole idea of running the tests to catch *unexpected*
>> situations and as a consequence there is value of capturing these situations
>> instead of hanging and producing no useful result.
> 
> To me, "timed out" is what is not useful.  To find a hanging machine
> that can be interrogated is much more useful.
> 
>>> In that case, the high-level test-controller which spits out the
>>> results could and should be responsible for handling "unexpected user
>>> input" and/or putting in a timeout, not each and every last test
>>> method..
>> 
>> Do you have such a "high-level test-controller"? Or do you mean a human
>> being spending their time watching the tests run to completion? If the
>> former, I'm curious as to how it would differ from what I did. If the
>> latter, are you volunteering? ;-)
> 
> I meant the former.  It differs from what you did in that it preserves
> legacy compatibilty, and the legacy deterministic property of testing.
> To handle automated test server, I would handle the on-timeout: from
> a much higher place, and therefore it would not be for individual
> tests, but for the whole suite.  Information about the last running
> test would be sufficient for me, especially if we're talking about all
> of the other disadvantages I've mentioned for fine-grained timeouts..
> 
> - Chris
>