Essential Test-Driven Development: 2015

How will you know if TDD is working for your teams, program, or organization?

I've noticed that small, independent teams typically don't ask this. They are so close to the end-points of their value-stream that they can sense whether a new discipline is helping or hindering.

But on larger programs with multiple teams, or a big "roll-out" or "push" for quality practices, leaders want to know whether or not they're getting a return on investment. Sometimes they ask me, point-blank: "How long before I recoup the cost of your TDD training and coaching?" There are a lot of variables, of course; and knowing when you've reached break-even is going to depend on what you've already been measuring. Frankly, you're not going to be able to measure the change in a metric you're not already measuring. Nevertheless, you may be able to tell simply by the morale on the teams. In my experience, there's always a direct correlation between happy employees and happy customers. Also, a direct correlation between happy customers and happy stakeholders. That's the triple-win: What's truly good for customers and employees is good for stakeholders.

So I've assembled a few notes about quality metrics.

Metrics I like

(Disclaimer: I may have my "lead" and "cycle" terminology muddled a little. If so I apologize. Please focus on the simplicity of these metrics. I'll fix this post as time allows.)

Here are some metrics I've recommended in the past. I'm not suggesting you must track all of these.

Average lead time for defect repair: Measure the time between defect-found and defect-fixed, by collecting the dates of these events. Graph the average over time.
Average cycle time for defect repair: Measure the time between decide-to-fix-defect and defect-fixed, by collecting the dates of these events. Graph the average over time.
A simple count of unfixed, truly high-priority defects. Show-stoppers and criticals, that sort of thing. Graph the count over time.

Eventually, other quality metrics could be used. Once a team is doing well, Mean Time Between Failures (MTBF), which assumes a very short (near-zero) defect lead time, can be used.

On one high-performing team I worked on way back in 2001, we eventually focused on one metric: "Age of Oldest Defect." It really got us to dig into one old, ornery, hard-to-reproduce defect with a ridiculously simple work-around (i.e., "Please take a deep breath and resubmit your request" usually did the trick, which explains why we weren't compelled to fix it for quite some time). This bug was a great representation of the general rule of bug-fixing: Most bugs are easy to fix once found, but very difficult to locate! (Shout out to Al Shalloway of Net Objectives for teaching me that one.)

I also suggest that all teams keep an eye on this one: Average cycle &/or lead times for User Stories, or Minimal Marketable Features. On the surface, this sounds like a performance metric. I suppose if the work-items are surely arriving in a most-important-thing-first order, then it's a reasonable proxy for "performance." But its real purpose is to help diagnose and resolve systemic (i.e., "process") issues.

What’s truly important about measuring these:

Start measuring as soon as possible, preferably gaining some idea of what things look like before making broad changes, e.g., before I deliver my Essential Test-Driven Development course, and follow-on TDD coaching, to your teams.
The data should be collected as easily as possible: Automatically, or by an unobtrusive, non-managerial, third party. Burdening the team with a lot of measurement overhead is often counterproductive: The measurement data suffers, productivity suffers, morale suffers.
The metrics must be used as "informational" and not "motivational": They should be available to team, first and foremost, so that team can watch for trends. Metrics must never be used to reward or punish the team, or to pit teams within the same program or organization against each other.

If you want (or already have) highly-competitive teams, then consider estimating Cost of Delay and CoD/Duration (aka CD3, estimated by all involved "levels" and "functions"), customer conversions, customer satisfaction, and other Lean Startup metrics; and have your whole organization compete against itself to improve the throughput of real value, and compete against your actual competitors.

A graph sent (unsolicited) to me from one client. Yeah, it'd be great if they had added a "value" line. Did I mention unsolicited? Anyway, there's the obvious benefit of fewer defects. Also note that bugs-found is no longer oscillating at release boundaries. Oscillation is what a system does before tearing itself apart.

Metrics I didn't mention

Velocity:

Estimation of story points and the use of velocity may be necessary on a team where the User Stories vary considerably in size. Velocity is an important planning tool that gives the team an idea of whether the scope they have outlined in the release plan will be completed by the release date.

Story points and velocity (SPs/sprint) give information similar to cycle time, just inverted.

To illustrate this: Often Scrum teams who stop using sprints and release plans in favor of continuous flow will switch from story points per sprint to average cycle time per story point. Then, if the variation in User Story effort diminishes, they can drop points entirely and measure average cycle time per story.

The problem with using velocity as a metric to track improvements (e.g., the use of TDD) is this: As things improve, story-point estimates (an estimate of effort, not time) may actually drop for similar stories. We expect velocity to stabilize, not increase, over time. Velocity is for planning; it's a poor proxy for productivity.

Code coverage:

You could measure code-coverage, how much of the code is exercised via tests, particularly unit-tests, and watch the trends, similar to the graph above (they measured number-of-tests). This is fine, again, if used as an informational metric and not a motivational metric. Keep in mind that it's easy for an informational metric to be perceived as motivational, which makes it motivational. The trouble with code-coverage is that it is too much in the hands of those who feel motivated to improve it, and they may subconsciously "game" the metric.

About 10 years ago, I was working with a team who had been given the task of increasing their coverage by 10% each iteration. When I got there, they were at 80%, and very pleased with themselves. But as I looked at the tests, I saw a pattern: No assertions (aka expectations)! In other words, the tests literally exercised the code but didn't test anything. When I asked the developers, they looked me in the eyes, straight-faces, and said, "Well, if the code doesn't throw an exception, it's working."

Of course, these junior developers soon understood otherwise, and many went on to do great things in their careers. But they really did think, at the time, they were correctly doing what was required!

The metrics that I do recommend are more difficult to "game" by an individual working alone. Cycle-times are a team metric. (Yes, it's possible a team could conspire to game those metrics, but they would have to do so consciously, and nefariously. If you don't, or can't, trust your team to behave as professionals, no metric or engineering practice is going to help anyway. You will simply fail to produce anything of value.)

Please always remember: You get what you measure!

Let's take a fresh look at pair programming, with an eye towards systems optimization.

Pair programming is perhaps the most controversial of the many agile engineering practices. It appears inefficient on the surface, and may often measure as such based on code output (the infamous KLOC metric) or number of coding tasks completed. But dismissing it without exploring the impact to your overall value-stream could actually slow down throughput. We'll take a look at the benefits of pair programming--some subtle, some sublime--so you are well-equipped to evaluate the impact.

Definition

Pair programming is the practice of having two people collaborating simultaneously on a coding task. The pair can be physically co-located, as in the picture below, or they can use some form of screen-sharing and video-chat software. The person currently using the keyboard and mouse is called the "Driver" and the other person is the "Navigator." I say "currently" because people will often switch roles during a pairing session.

Low-stress, co-located pair programming looks like this: Neither Driver nor Navigator has to lean sideways to see the screen, or to type on the keyboard. The font is large enough so both can read the code comfortably. We're not sitting so close that our chairs collide, but not so far that we need to mirror the displays.

Misconceptions

There are many misconceptions about pair programming, leading people to conclude that "it's two doing the work of one." Here are a few of the more common misapprehensions...

Navigator as Observer

The Navigator is not watching over your shoulder, per se.

The Navigator is an active participant. She discusses design and implementation options where necessary; keeps the overall task, and the bigger design picture, in mind; manages the short list of emerging sub-tasks; selects the next step, or test, or sub-task to work on; and catches many things that the compiler may not catch. (If there is a compiler!) There isn't really time for boredom.

If she is literally looking over your shoulder, then you're not at a workstation suitable for pairing. Cubes are generally bad for pairing, because the desks are small or curved inwards. Co-located pairing is side-by-side, at a table that has a straight edge or is curved outwards.

Often only one wireless keyboard and one wireless mouse are used. Wireless devices make switching Drivers much easier.

Unless the pair is not co-located, one screen for the code is sufficient. Other monitors can be used for research, test results, whatever. You may want to avoid having the same image displayed on two adjacent screens. It may seem easier at first, but eventually one of you will gesture at the screen and the other will have to lean over to see the target of the gesture.

Pairing as Just Sitting Together

Pairing is not two people working on two separate files, even if one file contains the unit tests and the other contains the implementation code. Both people agree on the intent of the next test, and on the shape of the implementation. They are collaborating, and sharing critical information with each other.

The Navigator may occasionally turn to use a second, nearby computer to do some related research (e.g., the exact syntax for the needed regular expression). This is always in response to the ongoing conversation and task. It is not "oh, look, I got e-mail from Dr. Oz again...!"

Navigator as Advisor

Pair programming is not a master/apprentice relationship. It's the meeting of two professionals to collaborate on a programming task. They both bring different talents and experience to the task. Both people tend to learn a lot, because everyone has various levels of experience with various tools and technologies.

In 2003 I was tech lead and XP coach on a growing team. We had just hired an excellent candidate nearly fresh out of college. He and I sat down to pair on the next programming task. I described how we were planning to redo the XML-emitting Java code to meet the schema of our biggest client, instead of supporting our long-lived internal schema. I explained that we expected to have to change quite a bit of code, and perhaps the database schema as well, and that we'd be scheduling it in bits and pieces over the upcoming months. I reassured him that we had plenty of test coverage to feel confident in our drastic overhaul.

He frowned, and said, "Why not just run the data through an XSLT transformation?!" (XSLT is a rich programming language written as XML, designed for such transformations. Until this point, I hadn't given it much consideration.)

He saved us months of rework! To my delight, I learned a new technology (new for me anyway). My contribution to the efforts was to show him how we could use JUnit to test-drive the XSLT transformations. Both parties learned a great deal from each other.

In software development, there are no "juniors" or "seniors," just people with varying degrees of knowledge and experience with a wide variety of technologies and techniques.

Systemic Benefits

Fewer Defects

This is the most-cited benefit of pair programming. It's relatively easy to measure over time.

In my own experience, it's not clear that this is the main benefit. I've always combined pair programming with TDD, and TDD catches a broad variety of defects very quickly. In that productive but scientifically uncontrolled environment, measuring defects caught by pairing becomes much more difficult.

But this is where systems thinking comes in: Pair programming reduces rework, allowing a task or story that is declared "done" to be done, without having to revisit the code excessively in the future. Pair programming may be slower, the way quality craftsmanship always appears slower: The code remains as valuable in the future.

The benefits that follow reflect this. Pair programming is an investment.

Better Design

I've noticed that even the most experienced developer, when left to himself, will on occasion write code that only one person can quickly understand: Himself. And often even he won't understand it months from now.

But if two people write the code together, it instantly breaks that lone-wolf barrier, resulting in code that is understandable to, and changeable by, many.

Continuous Code Review

Because most (around 80%) of defects and design issues are caught and fixed immediately with pair programming, the need for code reviews diminishes.

Many times I've seen this nasty cycle:

All code must be reviewed by the Architect. The Architect is overburdened with code reviews. The Architect rubber-stamps your changes in order to keep up with demand for code reviews. Defects slip past the code-review gate and make their way into production. All code must be reviewed by the Architect.

This shows up in the value-stream (or your Kanban wall) as an overloaded code-review queue.

Also, if the Architect does catch a mistake, the task is sent back to the developers for repair and rework. This shows up in the value-stream as a loop backwards from code-review to development. And rework is waste. The longer the delay between the time a defect is introduced and the time it is fixed, the greater the waste.

From a Lean, systems, or Theory of Constraints standpoint, the removal of a gated activity (the code review) in favor of a parallel or collaborative activity (pair programming) at the constraint (the most overburdened queue) may improve throughput.

Enhanced Development Skills

The educational value of pair programming is immense, ongoing, and narrowed in scope to actual tasks that the team encounters.

An individual who encounters a challenging technological hurdle may assume he knows the right thing to do when he doesn't, or spend a great deal of time in detailed research, or succumb to feelings of inadequacy and try to find a circuitous, face-saving route around the hurdle.

When a pair encounters a hurdle that neither has seen before, they know immediately that it's a significant challenge rather than a perceived inadequacy, and that they have a number of options. Those options are explored in just enough detail to overcome the hurdle efficiently.

People don't often pair with the same person for an extended period of time, so there's opportunity for a broad, and just-deep-enough, continuous education in relevant technologies, tools, techniques.

Through this ongoing process of shared learning and cross pollination, the whole development team becomes more and more capable.

For example, perhaps your SQL-optimization expert pairs with someone who is interested in the topic today. Tomorrow, the SQL-optimization expert can go on vacation, without bringing development to a halt, and without having a queue of unfinished work pile up on her desk while she's in Hawai'i.

Not everyone has to be an expert in everything. The task can be completed sufficiently to complete the story, and perhaps a more specific story or task will bring the tanned-and-rested expert's attention to the mostly-optimized SQL query at a later time.

This is an important risk-mitigation strategy, because having too few people who know how to perform a critical task is asking for trouble.

Improved Flow

Imagine you are the leader of a development team. You walk in after a nice relaxing weekend and see one of your developers hard at work. "Hey, Travis, how was your weekend?"

Travis gets this frustrated look on his face (generally, developers should not play poker), "Uh...what? Oh. Fine!" and he waves you away dismissively. You've pulled him from The Zone.

What if, instead, you had walked in to see Travis and Fred sitting together, conversing quietly, and looking intently at a single computer screen. Wouldn't you save your greeting for later?

Or, what if you had something important to ask? "Hey guys, are you going to make the client meeting at 3pm today?"

Travis continues to stare intently at the screen, and types something; but Fred spins his chair, "Oh, right! I'll add that to our calendars." He writes a brief note on a 3x5 card that was already in his hand, and smiles reassuringly, "We'll be there!"

See the difference? Fred has handled the priority interruption without pulling Travis out of The Zone, without forcing the pair to task-switch (another popular form of waste). And Travis will be able to get Fred caught up very quickly, and they'll be on their way to finishing the task.

Mutual Encouragement

"Encouragement" contains the root word "courage." With two, it's always easier to face our fears, our fatigue, and our foibles.

Even if both Driver and Navigator are fatigued (e.g., after a big team lunch, or a long morning of meetings), together they may muster enough energy and enthusiasm to carry on and complete a task.

Enhanced Self-Control

Have you ever taken a "brief" personal break, only to discover 90 minutes later that your still involved in that phone call with Mom, Facebook update, or silent daydream?

Don't feel bad. It's natural.

If you and your pair-programming partner agree to a 15 minute break, however, you will be more likely to time your activities to return to the pairing workstation in 15 minutes, and you're more likely to engage in actual restful activities, rather than checking e-mail for 13 minutes before walking to the coffee shop.

Also, while writing code, neither Driver nor Navigator will allow themselves to become repeatedly distracted by e-mail pop-ups or cell phone ringtones. If it's not urgent, it can wait. Or, either person can call for a break.

Human Systems

We have to remember that humans make up this complex adaptive system we use to build things, and so human nature has an extremely large impact on how we build things.

Pairing helps alleviate distraction, fatigue, brittle code, skills-gaps, embarrassment over inadequacy, communication issues, fear of failure. Pairing thus improves overall throughput by decreasing rework and hidden, "deferred" work.

I find that pair programming is usually faster when measured by task completion over time. On average, if you give two people two dev tasks, A and B, they will be done with both tasks sooner if they pair up and complete A and B sequentially, rather than if one takes A and the other takes B.

On the surface, this may seem to contradict my earlier systems-related advice about replacing a gate with collaboration. But there is no gate, explicit or implicit, between these developers or between most software development tasks.

Also, much depends on where a change is applied relative to the actual constraint. If you optimize upstream from the constraint, you'll likely gum up the system even more. (You didn't think scaled agility was going to be delivered in a pretty, branded, simple, mass-produced, gift-wrapped box...did you?! ;-)

But if you discover that your current constraint is within the development process, then allowing the team to use pair programming may considerably improve overall throughput of value. (Emphasis on value. "Velocity" isn't value. Lines of code are not necessarily valuable either. Working, useable, deployed code is valuable.)

Try It

I've used pair programming on a number of teams since 1998, and I've always found it beneficial in the ways described above, and many other ways.

All participants in my technical classes, since 2002, pair on all coding labs. It's a great experience for participants: they often learn a great deal from each other as well as from the labs. It also benefits me, the instructor: I can tell when a pair is in The Zone, stuck, finished, or not getting along; all by listening carefully to the levels and tone of conversations in the room.

I recommend, as with all new practices, you and your team outline a simple plan (or "test") to determine whether or not the new practice is impacting the working environment for the better. Then try it out in earnest for 30 days, and re-evaluate at a retrospective. Pair programming, as with many seemingly counter-intuitive agile practices, may just surprise you.

Essential Test-Driven Development

02 September 2015

How to Know if TDD is Working