Essential Test-Driven Development

08 July 2025

Ten YEARS!

Someone just asked me how long I've been working on this book.

It was greenlit on 06 July 2015, and I signed the contract on 04 September 2015.

Since I started working on it while negotiating the contract, it's fair to say I started it on the 8th of July 2015. At least, that will be my headcanon! Why? Because today is the 10th anniversary of that start-date, and...

...today is the day I'm handing off the final draft to the development editor!

So...10 years "exactly." (With a margin of error +- a few days.)

26 June 2025

Sooo close!

Wrapping up the Introduction! (Often the last chapter to be written, interestingly enough.) Sent chapter word-counts and sample chapter to my editor so she can shop it to a few development editors. I think this means we're on-track for a Q4 2025 pub date!

02 September 2015

How to Know if TDD is Working

How will you know if TDD is working for your teams, program, or organization?

I've noticed that small, independent teams typically don't ask this. They are so close to the end-points of their value-stream that they can sense whether a new discipline is helping or hindering.

But on larger programs with multiple teams, or a big "roll-out" or "push" for quality practices, leaders want to know whether or not they're getting a return on investment. Sometimes they ask me, point-blank: "How long before I recoup the cost of your TDD training and coaching?" There are a lot of variables, of course; and knowing when you've reached break-even is going to depend on what you've already been measuring. Frankly, you're not going to be able to measure the change in a metric you're not already measuring. Nevertheless, you may be able to tell simply by the morale on the teams. In my experience, there's always a direct correlation between happy employees and happy customers. Also, a direct correlation between happy customers and happy stakeholders. That's the triple-win: What's truly good for customers and employees is good for stakeholders.

So I've assembled a few notes about quality metrics.

Metrics I like

(Disclaimer: I may have my "lead" and "cycle" terminology muddled a little. If so I apologize. Please focus on the simplicity of these metrics. I'll fix this post as time allows.)

Here are some metrics I've recommended in the past. I'm not suggesting you must track all of these.

Average lead time for defect repair: Measure the time between defect-found and defect-fixed, by collecting the dates of these events. Graph the average over time.
Average cycle time for defect repair: Measure the time between decide-to-fix-defect and defect-fixed, by collecting the dates of these events. Graph the average over time.
A simple count of unfixed, truly high-priority defects. Show-stoppers and criticals, that sort of thing. Graph the count over time.

Eventually, other quality metrics could be used. Once a team is doing well, Mean Time Between Failures (MTBF), which assumes a very short (near-zero) defect lead time, can be used.

On one high-performing team I worked on way back in 2001, we eventually focused on one metric: "Age of Oldest Defect." It really got us to dig into one old, ornery, hard-to-reproduce defect with a ridiculously simple work-around (i.e., "Please take a deep breath and resubmit your request" usually did the trick, which explains why we weren't compelled to fix it for quite some time). This bug was a great representation of the general rule of bug-fixing: Most bugs are easy to fix once found, but very difficult to locate! (Shout out to Al Shalloway of Net Objectives for teaching me that one.)

I also suggest that all teams keep an eye on this one: Average cycle &/or lead times for User Stories, or Minimal Marketable Features. On the surface, this sounds like a performance metric. I suppose if the work-items are surely arriving in a most-important-thing-first order, then it's a reasonable proxy for "performance." But its real purpose is to help diagnose and resolve systemic (i.e., "process") issues.

What’s truly important about measuring these:

Start measuring as soon as possible, preferably gaining some idea of what things look like before making broad changes, e.g., before I deliver my Essential Test-Driven Development course, and follow-on TDD coaching, to your teams.
The data should be collected as easily as possible: Automatically, or by an unobtrusive, non-managerial, third party. Burdening the team with a lot of measurement overhead is often counterproductive: The measurement data suffers, productivity suffers, morale suffers.
The metrics must be used as "informational" and not "motivational": They should be available to team, first and foremost, so that team can watch for trends. Metrics must never be used to reward or punish the team, or to pit teams within the same program or organization against each other.

If you want (or already have) highly-competitive teams, then consider estimating Cost of Delay and CoD/Duration (aka CD3, estimated by all involved "levels" and "functions"), customer conversions, customer satisfaction, and other Lean Startup metrics; and have your whole organization compete against itself to improve the throughput of real value, and compete against your actual competitors.

A graph sent (unsolicited) to me from one client. Yeah, it'd be great if they had added a "value" line. Did I mention unsolicited? Anyway, there's the obvious benefit of fewer defects. Also note that bugs-found is no longer oscillating at release boundaries. Oscillation is what a system does before tearing itself apart.

Metrics I didn't mention

Velocity:

Estimation of story points and the use of velocity may be necessary on a team where the User Stories vary considerably in size. Velocity is an important planning tool that gives the team an idea of whether the scope they have outlined in the release plan will be completed by the release date.

Story points and velocity (SPs/sprint) give information similar to cycle time, just inverted.

To illustrate this: Often Scrum teams who stop using sprints and release plans in favor of continuous flow will switch from story points per sprint to average cycle time per story point. Then, if the variation in User Story effort diminishes, they can drop points entirely and measure average cycle time per story.

The problem with using velocity as a metric to track improvements (e.g., the use of TDD) is this: As things improve, story-point estimates (an estimate of effort, not time) may actually drop for similar stories. We expect velocity to stabilize, not increase, over time. Velocity is for planning; it's a poor proxy for productivity.

Code coverage:

You could measure code-coverage, how much of the code is exercised via tests, particularly unit-tests, and watch the trends, similar to the graph above (they measured number-of-tests). This is fine, again, if used as an informational metric and not a motivational metric. Keep in mind that it's easy for an informational metric to be perceived as motivational, which makes it motivational. The trouble with code-coverage is that it is too much in the hands of those who feel motivated to improve it, and they may subconsciously "game" the metric.

About 10 years ago, I was working with a team who had been given the task of increasing their coverage by 10% each iteration. When I got there, they were at 80%, and very pleased with themselves. But as I looked at the tests, I saw a pattern: No assertions (aka expectations)! In other words, the tests literally exercised the code but didn't test anything. When I asked the developers, they looked me in the eyes, straight-faces, and said, "Well, if the code doesn't throw an exception, it's working."

Of course, these junior developers soon understood otherwise, and many went on to do great things in their careers. But they really did think, at the time, they were correctly doing what was required!

The metrics that I do recommend are more difficult to "game" by an individual working alone. Cycle-times are a team metric. (Yes, it's possible a team could conspire to game those metrics, but they would have to do so consciously, and nefariously. If you don't, or can't, trust your team to behave as professionals, no metric or engineering practice is going to help anyway. You will simply fail to produce anything of value.)

Please always remember: You get what you measure!

18 February 2015

Benefits of Pair Programming Revisited

Let's take a fresh look at pair programming, with an eye towards systems optimization.

Pair programming is perhaps the most controversial of the many agile engineering practices. It appears inefficient on the surface, and may often measure as such based on code output (the infamous KLOC metric) or number of coding tasks completed. But dismissing it without exploring the impact to your overall value-stream could actually slow down throughput. We'll take a look at the benefits of pair programming--some subtle, some sublime--so you are well-equipped to evaluate the impact.

Definition

Pair programming is the practice of having two people collaborating simultaneously on a coding task. The pair can be physically co-located, as in the picture below, or they can use some form of screen-sharing and video-chat software. The person currently using the keyboard and mouse is called the "Driver" and the other person is the "Navigator." I say "currently" because people will often switch roles during a pairing session.

Low-stress, co-located pair programming looks like this: Neither Driver nor Navigator has to lean sideways to see the screen, or to type on the keyboard. The font is large enough so both can read the code comfortably. We're not sitting so close that our chairs collide, but not so far that we need to mirror the displays.

Misconceptions

There are many misconceptions about pair programming, leading people to conclude that "it's two doing the work of one." Here are a few of the more common misapprehensions...

Navigator as Observer

The Navigator is not watching over your shoulder, per se.

The Navigator is an active participant. She discusses design and implementation options where necessary; keeps the overall task, and the bigger design picture, in mind; manages the short list of emerging sub-tasks; selects the next step, or test, or sub-task to work on; and catches many things that the compiler may not catch. (If there is a compiler!) There isn't really time for boredom.

If she is literally looking over your shoulder, then you're not at a workstation suitable for pairing. Cubes are generally bad for pairing, because the desks are small or curved inwards. Co-located pairing is side-by-side, at a table that has a straight edge or is curved outwards.

Often only one wireless keyboard and one wireless mouse are used. Wireless devices make switching Drivers much easier.

Unless the pair is not co-located, one screen for the code is sufficient. Other monitors can be used for research, test results, whatever. You may want to avoid having the same image displayed on two adjacent screens. It may seem easier at first, but eventually one of you will gesture at the screen and the other will have to lean over to see the target of the gesture.

Pairing as Just Sitting Together

Pairing is not two people working on two separate files, even if one file contains the unit tests and the other contains the implementation code. Both people agree on the intent of the next test, and on the shape of the implementation. They are collaborating, and sharing critical information with each other.

The Navigator may occasionally turn to use a second, nearby computer to do some related research (e.g., the exact syntax for the needed regular expression). This is always in response to the ongoing conversation and task. It is not "oh, look, I got e-mail from Dr. Oz again...!"

Navigator as Advisor

Pair programming is not a master/apprentice relationship. It's the meeting of two professionals to collaborate on a programming task. They both bring different talents and experience to the task. Both people tend to learn a lot, because everyone has various levels of experience with various tools and technologies.

In 2003 I was tech lead and XP coach on a growing team. We had just hired an excellent candidate nearly fresh out of college. He and I sat down to pair on the next programming task. I described how we were planning to redo the XML-emitting Java code to meet the schema of our biggest client, instead of supporting our long-lived internal schema. I explained that we expected to have to change quite a bit of code, and perhaps the database schema as well, and that we'd be scheduling it in bits and pieces over the upcoming months. I reassured him that we had plenty of test coverage to feel confident in our drastic overhaul.

He frowned, and said, "Why not just run the data through an XSLT transformation?!" (XSLT is a rich programming language written as XML, designed for such transformations. Until this point, I hadn't given it much consideration.)

He saved us months of rework! To my delight, I learned a new technology (new for me anyway). My contribution to the efforts was to show him how we could use JUnit to test-drive the XSLT transformations. Both parties learned a great deal from each other.

In software development, there are no "juniors" or "seniors," just people with varying degrees of knowledge and experience with a wide variety of technologies and techniques.

Systemic Benefits

Fewer Defects

This is the most-cited benefit of pair programming. It's relatively easy to measure over time.

In my own experience, it's not clear that this is the main benefit. I've always combined pair programming with TDD, and TDD catches a broad variety of defects very quickly. In that productive but scientifically uncontrolled environment, measuring defects caught by pairing becomes much more difficult.

But this is where systems thinking comes in: Pair programming reduces rework, allowing a task or story that is declared "done" to be done, without having to revisit the code excessively in the future. Pair programming may be slower, the way quality craftsmanship always appears slower: The code remains as valuable in the future.

The benefits that follow reflect this. Pair programming is an investment.

Better Design

I've noticed that even the most experienced developer, when left to himself, will on occasion write code that only one person can quickly understand: Himself. And often even he won't understand it months from now.

But if two people write the code together, it instantly breaks that lone-wolf barrier, resulting in code that is understandable to, and changeable by, many.

Continuous Code Review

Because most (around 80%) of defects and design issues are caught and fixed immediately with pair programming, the need for code reviews diminishes.

Many times I've seen this nasty cycle:

All code must be reviewed by the Architect. The Architect is overburdened with code reviews. The Architect rubber-stamps your changes in order to keep up with demand for code reviews. Defects slip past the code-review gate and make their way into production. All code must be reviewed by the Architect.

This shows up in the value-stream (or your Kanban wall) as an overloaded code-review queue.

Also, if the Architect does catch a mistake, the task is sent back to the developers for repair and rework. This shows up in the value-stream as a loop backwards from code-review to development. And rework is waste. The longer the delay between the time a defect is introduced and the time it is fixed, the greater the waste.

From a Lean, systems, or Theory of Constraints standpoint, the removal of a gated activity (the code review) in favor of a parallel or collaborative activity (pair programming) at the constraint (the most overburdened queue) may improve throughput.

Enhanced Development Skills

The educational value of pair programming is immense, ongoing, and narrowed in scope to actual tasks that the team encounters.

An individual who encounters a challenging technological hurdle may assume he knows the right thing to do when he doesn't, or spend a great deal of time in detailed research, or succumb to feelings of inadequacy and try to find a circuitous, face-saving route around the hurdle.

When a pair encounters a hurdle that neither has seen before, they know immediately that it's a significant challenge rather than a perceived inadequacy, and that they have a number of options. Those options are explored in just enough detail to overcome the hurdle efficiently.

People don't often pair with the same person for an extended period of time, so there's opportunity for a broad, and just-deep-enough, continuous education in relevant technologies, tools, techniques.

Through this ongoing process of shared learning and cross pollination, the whole development team becomes more and more capable.

For example, perhaps your SQL-optimization expert pairs with someone who is interested in the topic today. Tomorrow, the SQL-optimization expert can go on vacation, without bringing development to a halt, and without having a queue of unfinished work pile up on her desk while she's in Hawai'i.

Not everyone has to be an expert in everything. The task can be completed sufficiently to complete the story, and perhaps a more specific story or task will bring the tanned-and-rested expert's attention to the mostly-optimized SQL query at a later time.

This is an important risk-mitigation strategy, because having too few people who know how to perform a critical task is asking for trouble.

Improved Flow

Imagine you are the leader of a development team. You walk in after a nice relaxing weekend and see one of your developers hard at work. "Hey, Travis, how was your weekend?"

Travis gets this frustrated look on his face (generally, developers should not play poker), "Uh...what? Oh. Fine!" and he waves you away dismissively. You've pulled him from The Zone.

What if, instead, you had walked in to see Travis and Fred sitting together, conversing quietly, and looking intently at a single computer screen. Wouldn't you save your greeting for later?

Or, what if you had something important to ask? "Hey guys, are you going to make the client meeting at 3pm today?"

Travis continues to stare intently at the screen, and types something; but Fred spins his chair, "Oh, right! I'll add that to our calendars." He writes a brief note on a 3x5 card that was already in his hand, and smiles reassuringly, "We'll be there!"

See the difference? Fred has handled the priority interruption without pulling Travis out of The Zone, without forcing the pair to task-switch (another popular form of waste). And Travis will be able to get Fred caught up very quickly, and they'll be on their way to finishing the task.

Mutual Encouragement

"Encouragement" contains the root word "courage." With two, it's always easier to face our fears, our fatigue, and our foibles.

Even if both Driver and Navigator are fatigued (e.g., after a big team lunch, or a long morning of meetings), together they may muster enough energy and enthusiasm to carry on and complete a task.

Enhanced Self-Control

Have you ever taken a "brief" personal break, only to discover 90 minutes later that your still involved in that phone call with Mom, Facebook update, or silent daydream?

Don't feel bad. It's natural.

If you and your pair-programming partner agree to a 15 minute break, however, you will be more likely to time your activities to return to the pairing workstation in 15 minutes, and you're more likely to engage in actual restful activities, rather than checking e-mail for 13 minutes before walking to the coffee shop.

Also, while writing code, neither Driver nor Navigator will allow themselves to become repeatedly distracted by e-mail pop-ups or cell phone ringtones. If it's not urgent, it can wait. Or, either person can call for a break.

Human Systems

We have to remember that humans make up this complex adaptive system we use to build things, and so human nature has an extremely large impact on how we build things.

Pairing helps alleviate distraction, fatigue, brittle code, skills-gaps, embarrassment over inadequacy, communication issues, fear of failure. Pairing thus improves overall throughput by decreasing rework and hidden, "deferred" work.

I find that pair programming is usually faster when measured by task completion over time. On average, if you give two people two dev tasks, A and B, they will be done with both tasks sooner if they pair up and complete A and B sequentially, rather than if one takes A and the other takes B.

On the surface, this may seem to contradict my earlier systems-related advice about replacing a gate with collaboration. But there is no gate, explicit or implicit, between these developers or between most software development tasks.

Also, much depends on where a change is applied relative to the actual constraint. If you optimize upstream from the constraint, you'll likely gum up the system even more. (You didn't think scaled agility was going to be delivered in a pretty, branded, simple, mass-produced, gift-wrapped box...did you?! ;-)

But if you discover that your current constraint is within the development process, then allowing the team to use pair programming may considerably improve overall throughput of value. (Emphasis on value. "Velocity" isn't value. Lines of code are not necessarily valuable either. Working, useable, deployed code is valuable.)

Try It

I've used pair programming on a number of teams since 1998, and I've always found it beneficial in the ways described above, and many other ways.

All participants in my technical classes, since 2002, pair on all coding labs. It's a great experience for participants: they often learn a great deal from each other as well as from the labs. It also benefits me, the instructor: I can tell when a pair is in The Zone, stuck, finished, or not getting along; all by listening carefully to the levels and tone of conversations in the room.

I recommend, as with all new practices, you and your team outline a simple plan (or "test") to determine whether or not the new practice is impacting the working environment for the better. Then try it out in earnest for 30 days, and re-evaluate at a retrospective. Pair programming, as with many seemingly counter-intuitive agile practices, may just surprise you.

20 March 2014

An open letter to the Editor in Chief of Dr. Dobb's

Here is the letter I just sent to Andrew Binstock at Dr. Dobb's regarding this editorial:

http://www.drdobbs.com/architecture-and-design/the-corruption-of-agile/240166698

Hello Mr. Binstock,

I agree with your conclusions regarding the productization of Agile, and the zealotry regarding various sets of practices. Within every community or body of knowledge, it seems, strong practices become the dogma of the enthusiastic, and then the core value gets lost in illogical debates where everyone is trying to be “right."

I’d like to take a moment of your time to provide some of my experience with the practice that you selected (likely not arbitrarily) as an example of the zealotry: Test-Driven Development (TDD). Indeed, there is a lot of confusion and passion and zealotry regarding this practice. I feel no need to defend, exactly, but I believe I understand much of the enthusiasm, and much of why most people can’t really articulate their love for TDD. One reason why it’s so confusing even for those passionately in favor of TDD is that the real value appears as a “sub-practice” of TDD. (I’ll come back to that.)

I’ve been writing code since 1976, and in the “Agile” space since 1998, when I first experienced TDD (or “test-first” as it was known then). Unlike many of my bright-eyed young Agile colleagues, I spent half my career crafting quality, nearly-defect-free code before that point, without TDD. Here’s what I’ve noticed over time:

1. From “All in your head” to "fast feedback."

Pre-TDD, I spent a lot more time running various branches and permutations through my mind, or in a flowchart, or in a sequence diagram, and in review with others on my team. All great practices, and necessary in the days when it took an hour to compile the code.

What I do with test-first coding is describe the outcomes of small pieces of behavior in my code, then allow those passing tests to provide confirmation of my investment in that behavior. I no longer have to keep the permutations and possibilities in my head. Great burden lifted. We’ve long had compilers to provide a red flag when I make a syntactic mistake, and now fast tests provide similar near-instantaneous feedback when I mistakenly change run-time behavior. To me, TDD feels like a natural progression of good practices for the computer scientist, and the professional developer.

2. From slow and expensive to fast and cheap.

The code I was building pre-TDD was probably a little more complex than Internet protocols, but not much. We did supercomputer protocol stacks for a variety of machines, some of them with 36-bit words and 9-bit bytes. Two modes: Character and binary, but character was also subdivided into translations between any two of ASCII, EBCDIC, and BSD. Oh, how I would have loved a (really fast) implementation of something like Strategy or maybe there’s even a Bridge hiding in there?

Nowadays I find our products to be, in the aggregate, far more complex than in the 80’s: Voice-recognition, search, security, multi-platform, interesting business rules regarding financial transactions, regional laws, internationalization… Far too much changing (or potentially changing) far too quickly for us to do more than paint broad-stroke architectures at the high level; and far too complex and dynamic for our developers to be doing code-and-fix-and-update-the-UML.-that-no-one-reads.

Overall, TDD now feels far more natural to me. I play a game with my code: “You can’t do this yet,” I say. My code replies “You’re right, I fail at that.” I then tell my code how to do that, and together we confirm that it works, and simultaneously confirm that I haven’t broken anything my code has learned over the years. Then I see if the introduction of that implementation is asking to be blended more cleanly into the code: Refactoring is the oft-neglected sub-practice of TDD. Each time I reshape the design to accommodate this new behavior or the next, I run the tests to let my code confirm that it hasn’t lost any capability.

In fact, “relentless refactoring,” i.e., the constant application of good design practices, is the real “must have” for Agile software development. How else adapt to fresh new unexpected requirements without being able to reshape code? But without some form of fast, automated, deterministic safety-net of tests, relentless refactoring is mere dangerous unprofessional hacking. Ergo, test-first coding.

I’m not attached to TDD or even “Agile" as static, dogmatic sets of practices for the very reason that I already know it was once less efficient than pure human brain-power. Someday we will look back and laugh at our zeal. But I think TDD has some good life left to it, because of the outcomes: Comprehensive, executable engineering documentation describing all the discrete behaviors of the system in a non-combinatorial fashion; and the ability to change the code rapidly and confidently.

Thank you for your time and kind consideration.

Rob Myers

08 October 2013

Solving the Legacy Code Problem

What is Legacy Code?

In his book, Working Effectively with Legacy Code, Michael Feathers defines legacy code very simply: Code without tests. He isn't trying to be inflammatory, or suggest that you are a bad person because you have written legacy code (perhaps yesterday). He's drawing a clear boundary between two types of code. James Shore (co-author of Art of Agile) has an equivalent definition:

"Code we're afraid to change."

Simply put, if it has tests that will prevent us from breaking behavior while adding new behavior or altering design, we can proceed with confidence. If not, we feel appropriately uneasy.

Of course, if you plan to never release a new version of your code, you won't need tests. In my almost 40 years of programming, I've not seen that happen even once. Code needs to change. Ergo, code needs tests.

Build Your Team's Safety Net

Your team may want to adopt this metaphor: Think of your whole suite of automated regression tests as a safety-net that the team builds and maintains to keep themselves from damaging all prior investment in behavior.

If it takes two hours to run them all, you'll run them once per day, and if they catch something, you know that someone on the team broke something within the last 24 hours. If it takes one hour, you'll run them twice per day (once at lunch) and you've narrowed down the time by half. That's probably better than 80% of the teams in the world, but it can be even better.

I'll give you a real-world example: I worked on a life-critical application in 2002. After two years of development, this product had a comprehensive suite of 17,000 tests. They ran in less than 15 minutes. That team often took on new developers, and we gave them this simple message: "You break something, someone may die. But you have this safety net. You can make any change you believe is appropriate, but don't commit the change until you run the tests." In the time it took to walk to the cafe and buy a latte, I would know whether or not I was making a mistake that could cause someone to die.

We made changes up to a day before going live.

It can be that good. Of course, it takes effort to "pay down" the legacy code debt (and a lot of mock/fake objects…another topic for another day.) But the longer you wait, the worse the debt becomes.

Characterization Tests

The product mentioned above was developed from the ground up with unit tests written by a team who embraced unit-test-level Test-Driven Development (TDD). Nice work if you can get it. The rest of the world faces legacy code debt.

You don't have to pay it all down before you proceed. In fact, you mustn't. You have to be thoughtful about selecting high-risk areas: An area of code that breaks frequently, or is changed frequently, should first be "covered" with "characterization" tests.

"Characterization test" is not defined by any particular type of tool. We often use the unit-testing framework, but we're not limited to it.

Like unit-tests, these tests must be deterministic, independent, and automated. Unlike unit-tests, we want to "cover" the most amount of system behavior with the fewest number of tests and the least effort. When you write these tests, you are not bug-hunting, but rather "characterizing" or locking down existing behavior, for better or worse. It's tempting to fix production bugs as you go, but fixing a bug that's escaped into the wild could introduce another bug, or break a hidden "feature" that a customer has come to rely on. It's fine to note or log the bug, but your characterization test should pass when the bug manifests itself. Name the test with a bug description or ticket number, so the team can easily find it later.

Why not fix the production defect? Because the point of creating this safety net is to give you the freedom to refactor. You may be refactoring so you can add new behavior more easily, or even so you can fix a bug more easily later, but refactoring and adding behavior are two distinct activities. Using TDD, they are two separate steps in the cycle. (Aside: Fixing a bug is effectively adding new behavior, because the system wasn't actually behaving the way we expected. You can use TDD for that.)

The unit-testing framework and developer IDE usually gives us the most flexibility, plus the ability to mock dependencies and use built-in refactorings for safety. But in order to lock down large swaths of behavior, teams should think creatively. I've worked with teams who compared whole HTML reports, JPEG images, or database tables; or who have rerouted standard input and output streams. The nature of the product and the size of the mess may dictate the best approach.

And don't aim for a duration target, e.g., "15 minute test runs." Teams sometimes respond to arbitrary targets by sabotaging their own future in order to make the numbers. For example, deleting existing tests! Rather, aim for improvement by looking for the greatest delay in testing. Weigh a "huge refactoring" of the persistence layer against using an in-memory database. There is no in-memory version of your database software? Use a solid-state drive. Developers are naturally creative problem-solvers, particularly when they collaborate.

Resistance is Futile

Code written without tests often resists testing. When you write unit-tests test-driven, they tend to be very tiny, compact, isolated, and simple (once you get the hang of it). It's actually easier and faster to write them with the code using TDD, even though you end up with more of them. Interestingly, if you write your unit-tests after the code has been written, you are really writing characterization tests: They're harder to write, they're often a compromise that tests a number of behaviors, and they often give you the bad news that you made a mistake while coding. This is why most developers hate writing "unit-tests" (me, included). We were doing it backwards.

That may make writing characterization tests seem unbearably painful, but it's really not. Once you collect a handful of simple, "surgical refactorings" for creating testable entry-points into your behaviors, the legacy code problem becomes a bit of an archeological expedition: Find the important behaviors, carefully expose them, then cover them with a protective tarp. It can be rewarding all by itself. But the big payoff comes later, when it's time to change something.

02 January 2013

The Sportscar Metaphor: TDD, ATDD, and BDD Explained

Your Mission, Should You Accept...

You've been tasked with building a sports car. Not just any sports car, but the Ultimate Driving Machine.

The Ultimate Driving Machine

Let's take a look at how an Agile team might handle this...

Acceptance Test Driven Development

What would a customer want from this car? Excitement! And perhaps a degree of safety. Let's create a few user stories or acceptance criteria for this (the line between those two will remain blurred for this post):

When I punch the accelerator, I'm pushed back into my comfortable seat with satisfactory acceleration.
When I slam on the brakes, the car stops quickly, without skidding, spinning, or flipping, and drivers behind me are warned of the hard braking.
When I turn a sharp corner, the car turns without rocking like a boat, throwing me against the door, skidding, spinning, or making a lot of silly tire-squealing noises.

These are good sample acceptance criteria for the BMW driving experience. We can write these independently of having a functioning car to test. That's what makes this "Test Driven" from an Agile perspective: The clear, repeatable, and small-grained tests, or specifications, come before we would expect them to pass. This is fairly natural, if you consider each small collection of new tests to be Just-In-Time analysis of a user story. That's "Acceptance Test Driven Development," or ATDD, in a nutshell.

In order for us to write truly clear, repeatable "acceptance tests" for a BMW, we would need to get much more specific about what we mean by "punch", "satisfactory", "slam", "sharp". In the software world, this would involve the whole team: particularly QA/Test and Product/BA/UX, but with representation from Development to be sure no one expects Warp Drive. The team discusses each acceptance criterion to determine realistic measurements for each vague, subjective word or phrase.

DONE Done

What levels of fast, quick, fun, exciting, and safe are acceptable? What tests can we run to quickly assess whether or not our new car is ready for a demo? How will we know we have these features of the car fully completed, with acceptable levels of quality, so that we don't have to return to them and re-engineer them time and time again?

Once an acceptance test passes (and, on a Scrum team, once the demo has been completed and the stories accepted by the Product Owner), they become part of the regression suite that prevents us from ever allowing these "Ultimate Driving Machine" qualities from degrading.

Test-Driven Development

Now the engineers start to build features into the car. A quick architectural conversation at the whiteboard identifies the impact upon various subsystems, such as chassis, engine, transmission, environmental/comfort controls, safety features.

What would some unit tests (aka "microtests") look like? Perhaps these would be examples (keep in mind that I'm a BMW customer, not a BMW engineer, and have little idea of what I'm talking about):

When the piston reaches a certain height, the spark plug fires.
When the brake pedal is pressed 75% of the way to the floor, the extra-bright in-your-face LED brake lights are activated.
When braking, and a wheel notices a lack of traction, it signals the Anti-Lock Braking system.

See the difference in focus? Acceptance Tests are business-facing as well as team-guiding. Microtests are tools that developers use to move towards completion of the Acceptance Tests.

I used to own a BMW. I couldn't do much to maintain it myself, except check the oil. I would lift the hood, and admire the shiny engine, noting wistfully that cars no longer have carburetors, and I will probably never again perform my own car's tune-up.

Much of what makes a great car great is literally under the hood. Out of sight. Conceptually inaccessible to Customers, Product Managers, Marketers...even most Test-Drivers. What makes the Ultimate Driving Machine work so well is found in the domain of the expert and experienced Engineer.

In the same way, unit tests are of, by, and for Software Developers.

What's the Difference?

In both cases, we write the tests before we write the solution code that makes the tests pass. Though they look the same on the surface, and have similar names, they are not replacements for each other.

For TDD:

Each test pins down technical behavior.
Written by developers.
Intended for an audience of developers.
Run frequently by the team.
All tests pass 100% before commit and at integration.

For ATDD:

Each test pins down a business rule or behavior.
Written by the team.
Intended for the whole team as audience.
Run frequently by the team.
New tests fail until the story is done. Prior tests should all pass.

Which practice, ATDD or TDD, should your team use? Your answer is embedded in this Sportscar Metaphor.*

Behavior Driven Development

For a long time no one could clearly express what "Behavior Driven Development" or BDD was all about. Dan North coined the term to try to describe TDD in a way that expressed what Ward Cunningham really meant when he said that TDD wasn't a testing technique.

Multiple coaches in the past (me, included) have said that BDD was "TDD done right." This is unnecessarily narrow, and potentially insulting to folks who have already been doing it right for years, and calling it TDD. Simply because many people join Kung Fu classes and spend many months doing the forms poorly doesn't mean we need to rename Kung Fu. (Nor should we say that "Martial Arts" captures the uniqueness of Kung Fu.)

I witnessed a pair of courageous young developers who offered to provide a demo of BDD for a meetup. They used rspec to write Ruby code test-first. They didn't refactor away their magic numbers or other stink before moving on to other randomly-chosen functionality. "This can't be BDD," I thought, "because BDD is TDD done well."

TDD is TDD done well. Nothing worth doing is worth doing incorrectly. I had been using TDD to test, code, and design elegant software behaviors since 1998. I wanted to know what BDD adds to the craft of writing great software.

I can say with certainty that I'm a big fan of BDD, but I'm still not satisfied with any of the definitions (and I'm okay with that, since defining something usually ruins it). A first-order approximation might be "BDD is the union of ATDD and TDD." This still seems to be missing something subtle. Or, perhaps there is so much overlap that people will come up with their own myriad pointless distinctions.

However we try to define it in relation to TDD, BDD's value is in the attention, conversations, and analysis it brings to bear on software behaviors.

In hindsight, I have already seen a beautiful demo, by Elisabeth Hendrickson, of TDD, ATDD, and (presumably the spirit of) BDD techniques combined into one whole Agile engineering discipline.

She played all roles (Product, Quality, Development) on the Entaggle.com product, and walked us through the development and testing of a real user story. She broke the story down into a small set of example scenarios, or Acceptance Tests. She wrote these in Cucumber, and showed us that they failed appropriately. She then proceeded to develop individual pieces of the solution using TDD with rspec.

Then, once all the rspecs and "Cukes" were passing, she did a brief exploratory testing session (which, by definition, requires an intelligent and well-trained human mind, and thus cannot be automated). And she found a defect! She added a new Cuke, and a new microtest, for the defect; got all tests to pass; and demonstrated the fully functioning user story for us.

All that without rehearsal, and all within about 45 minutes. Beautiful!

* I have a draft post that further describes, compares, and contrasts the detailed practices that make up ATDD and TDD, along with a little historical perspective on the origins of each. For today, I wanted to share just the Sportscar Metaphor. It's useful for outlining which xDD practices to use, and how they differ.