Giving tests a second chance

If you’ve ever maintained a large suite of GUI automated tests, then you’re probably familiar with this scenario: You run your suite overnight and everything passes except for a couple of tests which fail unexpectedly. When you re-run them, they pass. These “random failures” cause your suite to look less than reliable, and you find yourself rerunning the suite every morning, or perhaps just rerunning the failed tests. You may find that your team starts to lose confidence in the quality of the tests. After all, they seem to be failing randomly for no good reason.

Well, there’s always a reason. In the case of our tests, they were encountering occasional issues like server timeouts, network issues, browser flakiness and a myriad of other environmental problems that quite frankly weren’t worth fixing at the time. I don’t like pandering to the diva that is GUI automation – every time I have to spend time fixing something just to make automated tests work, this increases the maintenance cost of automated testing and decreases its value. Automated testing is supposed to save us time, not create more work for us.

So here’s a cheap solution to this problem – just run the tests again. Each test gets one more chance to pass, and if they fail twice then they fail for good. Since GUI automation is generally plagued with more false negatives than false positives, I’m not too concerned about the risk of tests passing incorrectly. But even so, it’s good to have a record of the tests that failed the first time, just to glance over. Just in case.

My colleague Pete came up with this solution for our test suite, which will work for tests driven by Gallio (MbUnit). It’s a modification of an attribute that lets you run tests multiple times. He’s changed it so that they will only run a second time if they fail. Unfortunately in the standard test output, it hides the failure message, but you can still see it in the Gallio test report, which we use to view screenshots of test failures anyway.

It’s certainly helped the reliability of our test suites, and increased our team’s confidence in the value of the automated tests.

using System; using Gallio.Framework; using Gallio.Framework.Pattern; using Gallio.Model; using Gallio.Common.Reflection; using MbUnit.Framework; namespace SecondChanceExample { public class RepeatOnFailureAttribute :TestDecoratorPatternAttribute { private readonly int _maxNumberOfAttempts; public RepeatOnFailureAttribute(int maxNumberOfAttempts) { if (maxNumberOfAttempts < 1) throw new ArgumentOutOfRangeException("maxNumberOfAttempts", @"The maximum number of attempts must be at least 1."); _maxNumberOfAttempts = maxNumberOfAttempts; } protected override void DecorateTest(IPatternScope scope, ICodeElementInfo codeElement) { scope.TestBuilder.TestInstanceActions.RunTestInstanceBodyChain.Around( delegate(PatternTestInstanceState state, Gallio.Common.Func inner) { TestOutcome outcome = TestOutcome.Passed; int failureCount = 0; // we will try up to 'max' times to get a pass, // if we do, then break out and don't run the test anymore for (int i = 0; i < _maxNumberOfAttempts; i++) { string name = String.Format("Repetition #{0}", i + 1); TestContext context = TestStep.RunStep(name, delegate { TestOutcome innerOutcome = inner(state); // if we get a fail, throw an error if (innerOutcome.Status != TestStatus.Passed) { throw new SilentTestException(innerOutcome); } }, null, false, codeElement); outcome = context.Outcome; // escape the loop if the test has passed, // otherwise increment the failure count if (context.Outcome.Status == TestStatus.Passed) break; failureCount++; } TestLog.WriteLine(String.Format( failureCount == _maxNumberOfAttempts ? "Tried {0} times to get a pass test result but didn't get it" : "The test passed on attempt {1} out of {0}", _maxNumberOfAttempts, failureCount + 1)); return outcome; }); } } }

4 thoughts on “Giving tests a second chance”

David Allen says:

July 20, 2011 at 11:17 am

We’ve had that same problem. We use Microsoft Visual Studio, TFS, and Microsoft Test engine. I don’t know enough about whether or how to extend those to do something similar. But I totally get your point about the ROI of fixing it. Everything is a judgment call. If I can make the test more robust, at a reasonable cost, I will. But I am thinking “How much value does THIS test add, and how much is it worth investing to make it “better?” If it’s REALLY bad and REALLY hard to fix, I may even disable it entirely.
Jeff Lucas says:

August 28, 2011 at 4:08 am

Hi Trish – I am currently using TestComplete to smoke test a Java thick-client application. A more difficult problem that crops up is what I call the “bait and switch” – a GUI object appears but is not ready to respond to the test tool (i.e. it is disabled). The result is that (for example) the “click” method doesn’t exist, which caused a -fatal- exception in the tool, shutting the test run down completely. That makes overnight automated runs nearly impossible.

Like you, I have some tricks for overcoming those issues, but it takes a lot of work. It makes me pine for the days when I was web testing!
Trish Khoo says:

August 28, 2011 at 1:08 pm

Wow Jeff, that’s a pretty awful way for that test framework to react to not being able to click on something! Not being able to find a control is a pretty common scenario in most tests, so you’d think it would be able to handle that a little more gracefully. I’ve only used TestPartner for testing desktop applications and it had its own peculiarities too. At least driving the DOM with web testing is a little more predictable, but now with more complex web technologies emerging it’s only going to get harder.
Ivo Grootjes says:

September 30, 2011 at 6:24 pm

Hi, this is a really helpfull attribute for UI tests. Did you consider submitting a patch to gallio and or mentioning this attribute on the gallio wiki?

Great stuff!

Comments are closed.