Failing tests: When are they okay?

As a developer on a team, when (if ever) is it reasonable to check in something which breaks one or more tests and causes the build to fail?

The most important aspect of a build seems to be that it accurately represents the health of your product. In an ideal world, if all the tests pass, I should be comfortable deploying that code. In a perhaps more realistic world, I should either be comfortable deploying to a part of the system (small random %, special testers group, etc), or at least be comfortable sending it off to QA with the expectation that it will pass (and be surprised if it doesn’t). Conversely if the build fails I shouldn’t be suspecting that one of the tests must have randomly failed or that someone forget to update a test to reflect application changes; I should be wondering what is wrong with the product.

But if you have set release dates, say every two weeks, is a broken build in week 1 a problem? Is it okay for developers to be using TDD and checking in the tests first, or checking in incomplete or broken code so that others can collaborate on it? In some ways it definitely seems reasonable to allow for this. After all, the release date is in the future and it is quite expected that features aren’t completed yet or that bugs aren’t fixed yet. You should be encouraged to commit early and often and you don’t want to have to jump through hoops to collaborate.

However there seem to be disadvantages to this type of workflow. First of all, if a build is broken, it can’t break. What am I talking about? If I check in my expected-to-fail test or my half-finished code and the build is failing, the next time someone unexpectedly breaks something, it isn’t nearly as obvious. A passing test suite changing to a failing suite, in a good environment, should be blindingly visible. But what about a failing test suite continuing to fail, albeit in more ways? That’s more subtle. From the second that happens you’re accumulating debt and the faster you find it, the easier it will be to fix. But if you can’t check in broken code, how do you easily collaborate with someone else on it over time?

Another problem that can arise is the accumulation of potential debt by replacing functional code with incomplete code. If you aren’t able to make the timeline or the feature gets dropped for release, you now have a reverse merge on your hands. This could be particularly time consuming if others have been working in the same files that your work touches. On the other hand if you had been implementing it side-by-side and waiting to actually replace it until it worked, it would be no problem to ship the code in that state at any point if need be. Your deployment is more flexible and less error prone.

So how can you balance these tensions in an agile environment? I’d love any feedback that anyone has to provide. What are the reasons for prioritizing a passing build that I missed, and what other drawbacks exist?

Comments

bobthevirus

I’d suggest that this is one of the great advantages of dvcs. The cheap branching means that you can have your non-passing tests for undeveloped features in separate branches, merging each branch when, and only when, it goes green.

Marius Gedminas

I started writing an answer, and then it became long and transformed into a blog post: keep the buildbot green!.

Also, Blogger ate my first attempt to post a comment here, and I felt somewhat better about deciding to do a post rather than answer here.

Andy Friesen

I agree with Hitchens.

If your build is red, your first priority should be doing whatever it takes to make the build green.

If the bug you have found is serious, then that probably means that your team’s highest priority should be fixing the bug.

If it is not so serious, then you should back out the test until you have reason to believe it will pass.

Either way, you do not want to allow unrelated commits while the build is red. Red means that your application has burst beyond your control, and building further functionality atop a system that is already out of control is simply not a sustainable way to develop.

oliver

I think it’s important to always have a “stable” development version in VCS which is known to build and pass all tests and which most likely works correctly. This makes it easier to quickly make a release, or to give somebody a “preview release” which contains the latest features.

For this, the possibly-broken code must be separated from the stable version. Possible ways:
- develop on a branch until all tests pass, then merge it into HEAD
- set a “stable” tag in VCS which is moved up whenever a new feature is deemed stable
- add Make system switches to disable the possibly-buggy features; that way, you can have a “stable” build config which always works (also useful as autobuild/autotest target), and a “bleeding-edge” build config which is probably broken for quite a while.

IMO the latter two ways have the big advantage of making it easy for all developers to test some new unstable feature without having to manually merge its branch first.

Hitchens

Hey Rooney! Since you asked for feedback, here’s my take: it should never be ok for a build to be red. If you check in code and it goes red, you should back it out immediately. If people are willing to collaborate on finding out why it was red, well, the code’s still in source control, right? They can get it by reverting your revert.

If a test goes red but there’s nothing wrong with your production, then that’s an opportunity for you to learn how to write tests better! It’s obviously testing something you don’t care about, so you go in and rewrite it so that it only asserts on the things that matter.

Mike Rooney

programming and philosophy

Failing Tests: When Are They Okay?

Comments