02 September 2015

How to Know if TDD is Working

How will you know if TDD is working for your teams, program, or organization?

I've noticed that small, independent teams typically don't ask this.  They are so close to the end-points of their value-stream that they can sense whether a new discipline is helping or hindering.

But on larger programs with multiple teams, or a big "roll-out" or "push" for quality practices, leaders want to know whether or not they're getting a return on investment.  Sometimes they ask me, point-blank: "How long before I recoup the cost of your TDD training and coaching?" There are a lot of variables, of course; and knowing when you've reached break-even is going to depend on what you've already been measuring.  Frankly, you're not going to be able to measure the change in a metric you're not already measuring.  Nevertheless, you may be able to tell simply by the morale on the teams. In my experience, there's always a direct correlation between happy employees and happy customers. Also, a direct correlation between happy customers and happy stakeholders.  That's the triple-win:  What's truly good for customers and employees is good for stakeholders.

So I've assembled a few notes about quality metrics.

Metrics I like


(Disclaimer: I may have my "lead" and "cycle" terminology muddled a little.  If so I apologize. Please focus on the simplicity of these metrics.  I'll fix this post as time allows.)

Here are some metrics I've recommended in the past.  I'm not suggesting you must track all of these.
  • Average lead time for defect repair: Measure the time between defect-found and defect-fixed, by collecting the dates of these events.  Graph the average over time.
  • Average cycle time for defect repair: Measure the time between decide-to-fix-defect and defect-fixedby collecting the dates of these events. Graph the average over time.
  • A simple count of unfixed, truly high-priority defects.  Show-stoppers and criticals, that sort of thing.  Graph the count over time.
Eventually, other quality metrics could be used.  Once a team is doing well, Mean Time Between Failures (MTBF), which assumes a very short (near-zero) defect lead time, can be used.

On one high-performing team I worked on way back in 2001, we eventually focused on one metric:  "Age of Oldest Defect."  It really got us to dig into one old, ornery, hard-to-reproduce defect with a ridiculously simple work-around (i.e., "Please take a deep breath and resubmit your request" usually did the trick, which explains why we weren't compelled to fix it for quite some time).  This bug was a great representation of the general rule of bug-fixing:  Most bugs are easy to fix once found, but very difficult to locate!  (Shout out to Al Shalloway of Net Objectives for teaching me that one.)

I also suggest that all teams keep an eye on this one:  Average cycle &/or lead times for User Stories, or Minimal Marketable Features. On the surface, this sounds like a performance metric.  I suppose if the work-items are surely arriving in a most-important-thing-first order, then it's a reasonable proxy for "performance."  But its real purpose is to help diagnose and resolve systemic (i.e., "process") issues.

What’s truly important about measuring these:
  1. Start measuring as soon as possible, preferably gaining some idea of what things look like before making broad changes, e.g., before I deliver my Essential Test-Driven Development course, and follow-on TDD coaching, to your teams.
  2. The data should be collected as easily as possible: Automatically, or by an unobtrusive, non-managerial, third party. Burdening the team with a lot of measurement overhead is often counterproductive:  The measurement data suffers, productivity suffers, morale suffers.  
  3. The metrics must be used as "informational" and not "motivational": They should be available to team, first and foremost, so that team can watch for trends. Metrics must never be used to reward or punish the team, or to pit teams within the same program or organization against each other. 
If you want (or already have) highly-competitive teams, then consider estimating Cost of Delay and CoD/Duration (aka CD3, estimated by all involved "levels" and "functions"), customer conversions, customer satisfaction, and other Lean Startup metrics; and have your whole organization compete against itself to improve the throughput of real value, and compete against your actual competitors.



A graph sent (unsolicited) to me from one client. Yeah, it'd be great if they had added a "value" line. Did I mention unsolicited? Anyway, there's the obvious benefit of fewer defects.  Also note that bugs-found is no longer oscillating at release boundaries. Oscillation is what a system does before tearing itself apart.

Metrics I didn't mention

Velocity:

Estimation of story points and the use of velocity may be necessary on a team where the User Stories vary considerably in size.  Velocity is an important planning tool that gives the team an idea of whether the scope they have outlined in the release plan will be completed by the release date.

Story points and velocity (SPs/sprint) give information similar to cycle time, just inverted.

To illustrate this:  Often Scrum teams who stop using sprints and release plans in favor of continuous flow will switch from story points per sprint to average cycle time per story point. Then, if the variation in User Story effort diminishes, they can drop points entirely and measure average cycle time per story.

The problem with using velocity as a metric to track improvements (e.g., the use of TDD) is this:  As things improve, story-point estimates (an estimate of effort, not time) may actually drop for similar stories.  We expect velocity to stabilize, not increase, over time.  Velocity is for planning; it's a poor proxy for productivity.

Code coverage:

You could measure code-coverage, how much of the code is exercised via tests, particularly unit-tests, and watch the trends, similar to the graph above (they measured number-of-tests).  This is fine, again, if used as an informational metric and not a motivational metric.  Keep in mind that it's easy for an informational metric to be perceived as motivational, which makes it motivational.  The trouble with code-coverage is that it is too much in the hands of those who feel motivated to improve it, and they may subconsciously "game" the metric.

About 10 years ago, I was working with a team who had been given the task of increasing their coverage by 10% each iteration.  When I got there, they were at 80%, and very pleased with themselves.  But as I looked at the tests, I saw a pattern:  No assertions (aka expectations)!  In other words, the tests literally exercised the code but didn't test anything.  When I asked the developers, they looked me in the eyes, straight-faces, and said, "Well, if the code doesn't throw an exception, it's working."

Of course, these junior developers soon understood otherwise, and many went on to do great things in their careers. But they really did think, at the time, they were correctly doing what was required!

The metrics that I do recommend are more difficult to "game" by an individual working alone.  Cycle-times are a team metric.  (Yes, it's possible a team could conspire to game those metrics, but they would have to do so consciously, and nefariously.  If you don't, or can't, trust your team to behave as professionals, no metric or engineering practice is going to help anyway.  You will simply fail to produce anything of value.)

Please always remember:  You get what you measure!