How to Maintain a Regression Test Suite

“Being busy does not always mean real work… seeming to do is not doing.”

-Thomas Edison

If you’ve recently gone through the gargantuan task of spinning up an automated browser testing suite from scratch, congratulations. It’s not easy. It’s a huge step toward deploying better code with fewer bugs, and toward achieving that coveted holy grail of software testing—Continuous Development (CD) with Continuous Testing (CT).

Now that you’re done, the real work begins. You need to maintain this test suite over time. It needs to keep up with your developers; it needs to evolve with your application. If left alone, it will degrade, it will break, it will throw lots of false positives, and it will stop being used. All that hard work will fall apart.

This need to maintain and expand an automated test suite is why QA automation teams tend to expand alongside the growth of the dev team: There is an approximately fixed amount of work per hour of development. A common rule of thumb indicates a need for one QA automation engineer per approximately four developers.

There are three separate ways to think about stability, all of which are important and are approached differently:

  1. Build-to-build
  2. For stability
  3. Over time

1. Build-to-Build Maintenance

With each new build, there is a risk of a user interface (UI) change breaking your test suite. If anything substantial about the UI changes, or the workflow being tested changes, the test will fail because it cannot find an element it’s trying to interact with.

To keep these well-maintained, the fastest and simplest approach is to break the tests as early as possible. Sometimes teams attempt to have developers consistently communicate to QA engineers the UI changes coming down the pipe, but it’s difficult to ensure consistency here. Furthermore, it’s difficult to discern which hooks to use for a test until you see the application itself in a browser. Instead, if you simply test the application as early as possible in the development cycle, QA teams can investigate failed tests and fix them before the new code is deployed to production.

2. Maintenance for Stability

Test instability (or flappiness, or flakiness) simply means that a test won’t pass or fail consistently against the same application. This can be for many reasons: Frequently a test will time out due to network instability or a slow testing environment; test data can change; browsers are their own kettle of fish and tests can fail across different browsers even if there aren’t bugs.

You don’t want these tests breaking when it’s time to find bugs. To test these for stability, it’s best to run them hourly (or at a similar frequency) in a parallel environment to the deployment cycle, sending results only to the QA team. If they are passing and then fail without a new build having been deployed, you’ll know there are stability issues, and can go fix them before they’ve raised alarms with the dev team.

3. Maintenance Over Time

This is where things get hard. After building your regression test suite, your application will continue to change. In addition to brand new, significant features being added—where adding tests is not easy but the need for new tests is more obvious—features morph and small new features are added. Customer usage changes in ways not directly related to new features as customers discover new happy paths throughout your application. If you’re not careful, your coverage will drop as the application and the testing suite drift from each other. There are in fact two risks: you’ll leave important features uncovered, and you’ll maintain tests that are of low value (this is called “test bloat”).

Maintaining coverage of the most pivotal use cases or user flows requires data. You need to understand factually how your customers are using your application, and compare that information to your current testing coverage. We cover the data-driven testing process in this post, and specialize in driving test coverage using data as a business. We’re passionate about data-driven end-to-end testing because getting it right solves a multitude of problems at once: your test suite is more accurate, more stable, and costs far less time and energy to maintain.

Test maintenance is an ongoing task that requires substantial attention; without that attention, all the hard work of building the browser testing suite will yield diminishing value until it’s no longer paid attention to. The good news is that the hardest work is behind you, and with rigor and a good plan for each form of maintenance, the testing suite you just built will yield useful results for years to come.

Picking the Optimal Number of End-to-End Tests

“Virtue is the golden mean between two extremes.” -Aristotle

It’s not logical to develop an individual end-to-end browser test for each user case. “Then how many tests should our team produce,” you ask? The answer isn’t an easy one, and there are several factors at play when deciding what number of tests is just right.

Too Many or Too Few?

We’ve seen teams that have less than a dozen browser test cases, and teams that have 1,500. Being at either end of this spectrum can bring challenges. While having too few tests might lead to missing real bugs and releasing a sub-par product, having too many can strain employees and resources, not just through maintaining the tests, but monitoring them as well. Returning too many false-positives causes fatigue and decreases the credibility of your test suite. We’ve discussed previously the concept of optimizing tests, and similar principles apply here.

First off, it’s important to be realistic about the complexity of your application’s UI—it’s extremely rare that an application would require 1,500 end-to-end tests because it’s unlikely that there are 1,500 individual ways that end users are interacting with it. If you tracked how users navigate your application, it would be more likely to find less than a tenth of that: 60 or so core user stories that occur frequently, with about half of those being edge cases that occur rarely. Even for very complex applications, we very rarely see more than 100 use cases that more than 0.5% of users traverse. It’s typically much fewer.

Realistically, a company that has 1,500 end-to-end tests for its web application would likely be better off only running a few hundred. Not only does reducing your number of tests save money and manpower, but it also speeds up your testing and production cycle. Your teams can work on improving the features of the application rather than chasing down non-issues.

On the other hand, while there is such a thing as too few tests, it’s important to ensure that most of your well-traversed user stories are addressed in your testing suite. Otherwise, app-breaking bugs will make it into production, resulting in grumpy users and an unhappy and fatigued internal team.

Picking a Number That’s Just Right

As we’ve discussed previously, the best way to decide how many end-to-end browser tests to perform is to determine how many different ways users actually interact with your application.

For many of us, our first instinct when approaching any problem is to seek out more data. In testing, acquiring data about how users routinely and realistically interact with an application is the first step to actually choosing the right test cases. After that, it’s up to you to decide how many of the user stories that actually occur provide enough value to your business to routinely test.

If you were to graph the distribution of cumulative observed user behavior with a histogram, it would look much like the graph above: a steep curve of early behaviors, and then a bend towards a steep asymptote. After about 60-70% of total observed user behavior, the incremental coverage of each additional test case becomes negligible. From our own research, we find that this long tail of behavior doesn’t typically represent uncommon feature usage–most of it is behavior that doesn’t align with features at all. It can be ignored.

In the end, while your biggest obligation is to provide a quality product to your customer, your second obligation should be to do so quickly and cost-effectively. There exists a “just right” space of testing what matters, and not testing what doesn’t. Data is your guide to finding this golden ratio. If you can identify the core use cases in your application, you are no longer picking between a false choice off “high coverage” and good runtime / high stability: the notion of trade-off ends and you’re getting the best of both worlds.

What is Continuous Testing? (And Why Should I Care?)

“Move Fast and Break Things” – Facebook’s Old and Very Catchy Developer Motto

“Move Fast with Stable Infra” – Facebook’s New, Less Catchy Developer motto

Continuous testing is the holy grail of software quality assurance. Get it right, and you get to both move fast and also not break things. But it’s not only the holy grail because it will give your application everlasting life quality, but because it’s elusive, requires a lot of work—work which might end up killing you (think: Indiana Jones and the Last Crusade).

Continuous Testing Complements Development

More specifically, it’s a practice that pairs with continuous development (whether or not your team is deploying continuously). Briefly, continuous testing means that you’re developing continuously and testing every commit as you develop. Slightly more specifically, in continuous development, every developer should be committing fairly small changes, very frequently. Every time a developer commits work to remote, a regression testing suite (at the Unit, API, and E2E levels) runs to make sure nothing in the application has broken. This testing can either occur on a feature branch (perhaps during a pull request) or after the feature branch has been merged on a pre-production environment. If the developer is modifying or introducing workflows, each workflow also gets tested end-to-end. That’s continuous testing.

Here’s why (beyond the obvious benefits of testing generally) it’s so great: your developer has just committed code when it gets tested. Not only have they just committed it, but what they committed is necessarily a fairly small amount of code. Your developers, therefore, get instantaneous feedback while they are still contextually aware of what they just wrote—it’s fresh. And because it’s a small amount of code, there are only so many places to look to find out what went wrong. Your developers can thereby rapidly find, understand, and fix the bug. It doesn’t get to production, it doesn’t make it to a JIRA list, and it doesn’t languish for weeks. It gets fixed in minutes, before production, and generally becomes a non-event.

“Dan, how do I drink the soul-nourishing waters from this mythical cup?”

First, you need to have the infrastructure and culture to support continuous deployment. This is hard and we don’t want to trivialize it, but we won’t focus on it here. If you have the infrastructure and culture to support continuous deployment, then you’re ready for continuous testing.

The biggest challenge to continuous testing specifically is that your tests need to be comprehensive in scope yet also run quickly. These tests ideally need to cover the entire application, and they also need to finish running in minutes. It would conventionally seem like you need to choose between runtime and comprehensiveness, but focus can get you both.

Improving Browser Testing Runtime

By far the slowest part of your testing suite will be your browser tests, so this is where you need to do the most work to limit the number and parallelize. We discuss how to assess which user flows to focus on for browser (E2E) testing here, but in brief: focus on the core workflows your users are actually following frequently in your application. This maximizes bang-per-buck and lets you cover everything important without extending runtime so much that it gets in the way of continuous testing. After you’ve built the E2E tests you want, parallelize the ones you can, using a parallel-friendly framework such as TestCafe.

What if my Needs Are Complex?

If you want to test for edge cases, test cross-browser and cross-device, or otherwise have a very large browser testing suite, you may need to prioritize which tests get run with every build and which tests get run out of band, perhaps every few hours. This will allow you to make sure no regressions get to production that would affect core user flows, and any edge cases or low-impact regressions are caught a few hours later at worst.

All of this requires a lot of work even when you have “made it.” It can feel like a truly gargantuan task to get there if you’re not close. Read on, and we’ll be able to help you break the process into discrete steps.

Optimizing Your End-to-End Browser Test Coverage

“[Q]uality is never an accident; it is always the result of intelligent effort”.

-John Ruskin, leading art critic and Victorian era writer

The necessity for intelligent effort is especially true in the world of software development, where thoughtful effort must be applied at every step in the development, particularly so during testing.

It’s imperative to continuously test a product—during development, during and after deployments when it goes live, and monitoring periodically as it’s used—as a form of preventative care, ensuring your customers get the most value possible from your web application and walk away content with their user experience.

But when it comes to browser testing, it can be difficult to decide how much testing you need. The practice of unit testing is conceptually simple: in order to determine test coverage, all you need to do is count the lines of code and see whether or not each one has been tested. This metric quickly devolves based upon whether particular lines of code really matter and whether all of a line’s logic has actually been tested. At higher levels of integration, API, or browser level / end-to-end testing, most organizations don’t even attempt to come up with a metric due to the combinatorially vast number of possible uses.

The Old Standard

Typically, teams (perhaps between product and engineering) determine a set of core test cases based on feature sets or a specified set of requirements. Coverage is calculated against this determined list of test cases: if a test exists to cover the case, it counts towards the coverage metric. Once all determined test cases have automated tests, typically teams will say they have full coverage if they’re not audacious enough to say 100% coverage.

This metric, however, can be misleading. Has the team correctly identified all of the important use cases? Have they accounted for the many permutations of possible user flows through those use cases?

This brings a further question: how many user stories are actually realistic to cover? Which potential user stories should actually be covered? Certainly, one engineer can’t test every potential user story. Not only would doing so take an unrealistic amount of time and resources: much of that time would likely be spent testing scenarios which rarely happen in the real world. Trying to test more and more potential use cases leads to test bloat, in which engineer-hours and test runtime trend ever-upward while test stability and usefulness trend downward.

Using this method, any subset of user stories chosen for testing is likely to suffer from an issue as old as computer science itself: garbage in, garbage out. If the user stories being tested aren’t realistic, then the test results you receive will ultimately be of little use to your team. You’ll miss testing important use cases, and critical bugs will make it into production.

A Better Way

Ideally, a team should know exactly which user stories are most representative of real user behavior, and test primarily those. When tests are written to reflect real user pathways, the “garbage” disappears—the tests are close simulations of what will be performed by real users. With no garbage in, we as testers save ourselves from producing garbage out that affects both our users’ experiences and our own revenue streams resultant from flawed testing practices. If every test is founded on reality, every result will be truly invaluable to the software team.

Fortunately, product analytics make discovering real user stories possible. With the right toolset, developers can track how users actually navigate a site, develop data-backed user stories accordingly, and then identify the most likely ways a real user would actually use a web application. Once the most likely user stories are identified, effective testing can correspondingly be built, providing testing efficiency and optimal coverage.

Ultimately this change—from determining test cases with a team in a room, to detecting test cases from what your users are doing on your application—will become an industry standard. The technology to enable this pivot is new to the scene compared to traditional testing customs, and will require many teams to alter their expectations and KPIs to allow their testing teams to win while making the shift.

We built ProdPerfect to plug product analytics directly into testing, making that transition easier and more immediately impactful. Whether or not you work with us to shepherd this transition towards a more data-driven testing suite, the seismic shift towards test case detection will be one that means a higher quality software product for your business and the industry as a whole.

ProdPerfect Featured in Tech Crunch for $2.6 Million Seed Round Raise

“ProdPerfect started when co-founder and CEO Dan Widing was VP of engineering at WeSpire, where he saw firsthand the pain points associated with web application QA testing. Whereas there were all kinds of product analytics tools for product engineers, the same data wasn’t there for the engineers building QA tests that are meant to replicate user behavior.

He imagined a platform that would use live data around real user behavior to formulate these QA tests. That’s how ProdPerfect was born. The platform sees user behavior, builds and delivers test scripts to the engineering team.”

Read the full Tech Crunch article here.