Preventing Human Burnout: A Meaningful Approach to Measuring E2E Test Coverage

TEST COVERAGE IN A LIVING ECOSYSTEM

I like to see any company’s Quality Assurance (QA) as a living, breathing ecosystem. The ecosystem is defined by your business needs, the complexity of your application, and the innumerable ways in which you QA your system. Together, you, your developers, stakeholders, and customers all live in this ecosystem and vie for the free energy therein. In order to maintain ecosystem equilibrium, your company must balance each of these moving parts. Every ecosystem has rates of churn of their constituents. If you push your developers or customers too hard, they will burn out and leave the ecosystem. In this way, your QA ecosystem is as much about maintaining a stable application as it is about maintaining stable humans. 

The QA ecosystem is what makes designing metrics for test coverage uniquely challenging. Each business has different needs for different types of tests, just as each business has a distinct web application and a unique user base. And because there are so many types of tests and ways to test, there are countless ways to measure adequate coverage. Despite this complexity, the basis for making measurement-related decisions remains largely unspoken: Should we prioritize mere statistical coverage of code, features that are more important to our company’s needs, or areas that the internal team deems more likely to break?

There is no single solution. If we want to measure end-to-end test coverage successfully, we must first identify metrics that are truly meaningful to our individual businesses and which reflect that individual humans are part of this system. In distinguishing proper QA coverage metrics for your business, here are three key considerations to keep in mind: 1) moving fast, 2) having acceptable coverage, and 3) covering business priorities.

  1. Moving fast: Are you impeding your developers unduly? To move fast, we need to balance the cost of adding tests with the cost of repairing bugs and the cost of runtime in deployment. The more tests you write, the more stressful and time-consuming writing code becomes. The fewer tests you write, the more stressful and time-consuming maintaining code becomes.
  2. Having acceptable coverage: Since no QA system will realistically have 100% coverage for any application or cover every single use case, we must focus on the level of coverage we actually need, accounting for where the limited resources of our ecosystem are best allocated.
  3. Covering business priorities: As we measure coverage, we need to balance business-level metrics or key performance indicators (KPIs) with the ways that users are actually using the web app. If your sign up breaks for a few hours but your business model focuses on another conversion metric (such as adding items to a cart), this may not be the worst possible bug for you. It’s up to each business to determine which test efforts directly yield ROI.

WHY DESIGNING QA METRICS IS PARTICULARLY HARD

  1. Engineers have to manage complex product interactions in a changing ecosystem. Worse yet, different types of testing often have different owners with different stakeholders who have different needs. This means managing variations in developer expectations, management needs, and in test suites. In the QA realm, since there is so much variation in what applications do, there are many ways to test: Do you test at the unit and the controller level? Do you stub out or do you test calls to the database? Do you need to test your load balancing? And since the test ecosystem needs to remain dynamic, the test suite will change every time the product changes. Even if nothing changes other than an increase in the total number of users, engineers need to make sure that there are an adequate number of tests for load testing the site.

  2. There are no tools nor individuals fully available and equipped to effectively measure coverage. Product Owners tend to think of high-level features and development processes, then delegate responsibilities. Product Managers and QA Managers tend to focus on regression testing and manual testing. DevOps tend to own building test environments. And Product Engineers (and most engineers) tend to be the ones writing unit, integration, and API tests. There’s no singular vantage point for fully understanding what has been covered. It gets even more complicated when thinking of feature complexity. Take the example of a login: there are innumerable ways in which the simple concept of logging into a website occurs. You can login from the homepage, from the checkout, or even through your email provider. Today, it’s impossible for a human to navigate this complexity. When I was a Product Owner, I remember reviewing the design of my web applications and realizing, “Oh my. No one is using this.” Looking back, I’m ecstatic that I screwed up. It showed me that there are an incredible amount of biases and assumptions in how we think about applications we’ve built that simply don’t align with the reality in how they’re actually used on the ground.

  3. Product Owners often don’t have an effective means of managing incident response. Most of the time, when something breaks, the go-to response is, “let’s write a new test for the bug we found and make sure it doesn’t break again.” This put-out-every-fire-as-it-comes approach to writing tests leads to a bloated test suite with hundreds of web regression tests and thousands of unit tests. This tactic may sound sensible, but in reality, regression testing must consider speed—not just of test suite runtime, but of human developer productivity. In the industry, we talk about burnout because of too many bugs, but what we don’t talk about as often is burnout from too much testing. We still lack an organized ethos for how companies can balance test coverage with other variables, including speed and fighting employee burnout. To fully solve this problem requires fundamental shifts in how we assess our QA ecosystems.

TWO CONSIDERATIONS FOR MEANINGFUL DESIGN

  1. Set a defined framework for meaningfulness. Determine what should be tested in a way key stakeholders agree yields intrinsic value for your company. In addition to building an ecosystem which prioritizes preventing developer burnout, what’s meaningful to us at ProdPerfect is focusing on a data-driven, user-based framework: we test what happens most—based on how users are actually using a site. We’ve defined a framework for covering a critical amount of everything that goes on on a given application. It ensures that the things people do most frequently on an app aren’t going to break, while avoiding overburdening developers with maintaining bloated test suites. As an organization, we’re devoted to minimizing the time needed to keep up with QA in order to prevent developer burnout and customer burnout from being exposed to too many bugs. Our principle is that just as we need to account for the burnout of having too many bugs, we likewise need to account for the burnout of having too many tests.

  2. Determine an acceptable level of coverage when it comes to bugs. Once the framework is defined, set expectations for what level of coverage is acceptable with respect to the impact of a bug when it reaches production. Is there a critical mass of users on certain parts of the webapp? Is some functionality hypercritical for customers getting value out of the product? For your internal KPIs? If you can answer the question of coverage acceptability in a way that allows you to have a sustainable business model and is backed by quantitative analysis, your employees will be more likely to maintain their zeal for their work, directly impacting the development of your product and the satisfaction of your customers.

A RUBRIC FOR COVERAGE

QA is a living, breathing ecosystem inhabited by humans. These humans are limited resources that can burnout from of any number of factors. For this reason, the rubric should never be to have 100% coverage. The rubric always needs to consider the humans at the company and the humans using the app when deciding how much test coverage is meaningful.

When it comes to QA metrics, there are no right or wrong answers. There is simply the question: “Is your ecosystem stable and sustainable?” For some companies, stability means: Yes, we will burn out developers. They will spend a year here and leave, because we are writing so many friggin’ tests. For ProdPerfect, because we prioritize maintaining a balance between the three considerations of 1) moving fast, 2) having acceptable coverage, and 3) covering business priorities, we’re empowering both ourselves and customers to let the right things break and stop the right things from breaking. And we’re going to keep building on whatever parts we can in order to meet changing needs in the changing ecosystem of QA, humans included.

ProdPerfect Removes the Burden of QA Testing

There are typically three levels of quality assurance testing maturity. One is the classic waterfall approach where it takes weeks to get a deploy ready. Then, there is the continuous development and continuous delivery approach where QA engineers are put in place to handle automation. The most mature way of tackling QA is removing QA engineering as a separate practice, and making all your engineers responsible for the quality of features.

The problem is that none of these levels of maturity seem to be able to get QA right.

“No one has a good answer. Enterprise are failing in waterfall structures. Agile teams are failing or running into difficulty hiring and maintaining QA engineers. Silicon Valley is having to hire only the most senior folks, and even then it is through force of will and pain they are able to keep test suites to a point they are happy with,” said Dan Widing, founder and CEO of the automated QA testing provider ProdPerfect.

Automating QA
There is a better way. ProdPerfect removes the struggle it takes to set up a QA engineering department, and automates QA testing using live user data. This is “dramatically cheaper, dramatically faster, gets you a result faster, [and] is going to nearly guarantee that you catch bugs as part of your process,” Widing explained.

ProdPerfect is able to obtain live user data by analyzing web traffic and creating flows of common user behavior. That behavior is then built into an end-to-end testing suite that ProdPerfect maintains, expands and updates based on actual user data over time.

According to Widing, QA testing is “incredibly difficult, painstaking work that almost tends to be underappreciated by the organization itself,” and the folks who are having to deal with this are just overburdened with work. “We have a mechanism that lets us shake out the environment the customer needs us to test against… and then we are using a testing framework that lets us plug in our learnings from these steps to produce an automatically updated test suite,” he continued. “The experience the customer gets is a black box QA engineering department… What you get at the end is an auto-updated test suite that can run continuously in your CI system that just tests your application.”

ProdPerfect covers every core workflow with applications, provides 95 percent or more of test stability, less than four-hour regeneration of broken tests, and less than 48-hour test coverage for new feature sets.

“You don’t need to do anything to build, maintain, or expand the testing suite. We got it. You need to respond to bug reports, of course, and keep a stable testing environment up and running for us, but that’s all. Very frequently people call this ‘magic’ or ‘too good to be true,’” the company stated on its website.

Getting the right metrics
ProdPerfect not only works to ensure QA testing is covered, but also works to help teams understand what the right metrics to quantify success are.

“That is something we put into our service every step of the way. What your browser automation should be doing is catching as many significant bugs as possible whatever stage it is testing at and then otherwise staying as much out of the way,” said Widing.

You will know you have a solid testing foundation in place when you don’t ship a fire drill-style bug and have to wake up in the middle of the night and figure out how to deal with it or who is on top of it, Widing explained.

Since ProdPerfect is already analyzing what users are doing, it can project how things should be working and make sure they stay working. The solution tests features continuously, detects any significant bugs and verifies the feature set is actually working.

“We aim to stay out of the way by crafting what are the other metrics that are important to make sure you are not slowing down the software team,” said Widing.

Additionally, the solution will measure against minimum-frequency thresholds to confirm its performance.

“If you don’t set up your design and data strategy or set up the right tooling, everything falls apart and you have to work particularly hard to make sure all the pieces work together otherwise any singular improvement is not going to help you at all,” Widing said.

This article was first published on SDTimes.com.

Who Should Determine End-to-End Test Cases?

“A knife has the purpose of cutting things, so to perform its function well it must have a sharp cutting edge. Man, too, has a function…”

-Aristotle

In the distant (in software-years, which are much like dog years) past, a company’s development team would focus on new product code, and then a dedicated quality assurance (QA) team would write corresponding test code (including any unit tests). One of the pitfalls of this practice was that developers might get “lazy” about code quality, and might throw quality concerns “over the wall” to QA. This slowed down development and led to an ultimately antagonistic relationship between developers and QA teams, so it fell out of favor.

The “QA does QA” practice has mostly given way to moving testing into the hands of the developers themselves. Most of the time, developers now write their own unit tests and API tests. This makes sure developers take ownership of quality and thereby incentivizes them to put more focus on writing high quality code in the first place. How this is implemented varies: some teams use test-driven development (TDD) to write tests first and then build code to pass those tests. Some teams add peer code review. Some teams embed QA within dev teams to help them plan for quality at the onset. These practices are similarly meant to keep developers from building tests that are easy to pass.

The swing from QA-driven test-writing to developer-driven test-writing has, for some teams, crept into browser or end-to-end (E2E) testing. Contemporary dev teams either assign E2E test-writing to developers or to QA automation engineers, and different leaders can have strong opinions on who should really be taking point, us included.

At ProdPerfect, we believe that developers are the right choice to take point on writing unit and API tests, but making the right tradeoffs in what should be a core E2E test is near impossibly hard. Developers have a strong sense (through the context of developing them) of the intent of unit-level and API-level code, so they know best how to reliably test their own code. But it’s a stretch to expect developers to bear the burden of comprehensive end-to-end testing themselves. Adequately testing the full application for the myriad of probable user journeys throughout involves monitoring, analyzing, and accounting for complex interactions between many code modules. Then, from that set of possibilities, they must accurately choose the right set that deploys developer time, server resources, server time, and stated outcomes to balance business objectives. And they must re-evaluate those choices on a regular basis. Developers typically focus on small slices of an application at a time. To expect developers to fully bear the burden of comprehensive E2E testing is asking them to understand the entire universe of the application’s development and usage forwards and backwards in time. Truly no one is positioned to do so.

Developers are good at doing what they’re hired to do: developing code to innovate product—and even testing that code—and should remain primarily focused on doing so. It’s a waste of resources to task developers with end-to-end testing, and they’re not positioned to do it best.

Instead, due to the complexity of effective end-to-end testing, the ideal person to determine and execute end-to-end user tests is someone whose core expertise and focus is in understanding the entire user journey and the outcomes thereof, not someone who is asked to tack on end-to-end testing as an afterthought. E2E testing should be driven by an independent group with a mandate to focus on it and the time invested to maintain it: this can be the product team or it can be QA as a whole (a QA analyst, QA automation engineering team, etc). These groups can, with the help of tools and data, wrap their arms around the different user journeys, develop test cases for them, and write tests designed to catch bugs at the user journey level, and maintain them over time. This level of testing doesn’t require intimate understanding of the underlying modules of code behind the application; it’s instead meant to ensure that users can always use the application as they want to. Software teams should leave testing of lower levels of the application to those lower levels of testing—unit and API/integration testing.

Ideally, QA teams should not simply be tasked with guessing at how users are using their applications. They can and should employ advanced product analytics to understand these user journeys and how they are evolving with time. In this way, focused testers are then able to fully understand which test cases are most relevant and write the best corresponding tests to ensure quality without bloating the testing suite.

In any successful business, different roles are designed to allow talented individuals to specialize and focus. Whether it’s specializing in sales operations vs. selling and closing, marketing content vs. advertising strategy, or development and testing, specialization allows teams to operate with focus and excellence. With E2E, it follows that a specialized and complex need should be filled by a designated individual with a specialized focus and toolset in order to get the highest quality result without wasting resources.

What is End-to-End Testing?

“You can test every servo, every wire, every component; the hardware, the firmware, the software. But you have no idea whether your robot is going to work, and whether it’s going to work in the wild, until you let it out into the wild and let it fail. And it will fail, even when it seems like every component works individually. Even well-built things get hairy when you connect them all together.”

We didn’t have a truly applicable analogy for end-to-end testing until we heard it from a customer that had previously built robots before moving over to web software. Sure, there must be some sort of theoretical, Platonic ideal of testing in which exhaustive testing of components and subsystems will guarantee that the robot—or your web application—will work without needing to be run “in the wild.” But, we’ll wager, nobody’s found it yet.

The Advantage of End-to-End Testing

This is the essence of end-to-end (E2E) testing, and why it’s so important. Your web application is probably less complex than a Mars rover, but you similarly won’t know whether it’s going to work once you put all the bits together until you’ve done just that. Your unit tests will test individual blocks of code for their core functionality. API/integration tests will make sure that your “subsystems” are working as intended. But, E2E tests are intended to test the entire application as real users would use it, in conditions similar to how real users will use it.

Therefore, an E2E test will actually launch the application in a browser and interact with it in a way that will test every layer of the application: the user interface itself, the browser (and compatibility with it), the network, the server, the APIs, the codebase, any 3rd party integrations, and any hardware—the whole kit. As with the robot, you don’t really know how all of these components and layers will work together until they’re doing just that—working together. You therefore don’t want to be shipping changes to your application without testing it end-to-end (unless you don’t mind bugs sneaking through.)

E2E tests can assume many names, in part depending on their level of rigor. They can be called browser tests, smoke tests, user acceptance tests (UATs), or (less accurately) UI tests. Typically these all mean the same thing: you launch a browser to interact with the application, and check that specific behaviors still work as intended.

There are two ways to launch this browser and interact with the whole application: the first is with a human who checks for failures by clicking around, referred to as manual testing. The second is by having a machine virtually simulate a human, using predetermined validations to check for failures, referred to as automated testing.

Data-Driven Testing

And as with our Mars rover, it’s ideal to test the application by simulating real-world usage as precisely as possible: testing the application in the same way that your users are using it, or are going to use it. This requires having data which tells you how your users are in fact using your application. Utilizing real user data is always possible when testing for regressions. But, user behavior needs to be estimated (or, frankly, guessed) when testing brand new features because you don’t have data about real usage quite yet.

Some teams might be tempted to do “kitchen sink” testing and try to test the application in every possible way, rather than in a way that reflects user behavior. We discourage this elsewhere in more detail elsewhere, but the primary consideration is that E2E tests are the most expensive, least stable, and slowest tests you’ll run. Having too many is going to incur dramatically increased costs for steeply-diminishing returns.

Limitations of E2E Testing

Finally, a word of caution: E2E testing has limitations. It’s great at testing that the application will generally function: that users can always move through workflows without errors or breakages. It’s great at ensuring that all of the gnarly bits of code are working together when a user checks out or logs in or executes an analysis. But E2E testing isn’t great (or efficient) in testing that the right data is going to be displayed for a user or stored in the application—this is a place for unit tests to thrive. E2E testing also isn’t great at showing you where in your codebase your bug is hiding—just where in the user’s journey they’re going to find that the application is broken. Finally E2E testing isn’t great at telling you whether your page is formatting incorrectly or “looks wrong.” It can do a bit of this, but it’s a heck of an expensive way to do so. We recommend using a tool such as Percy.io for testing visual regressions, instead.

In short: ignore E2E testing at your own peril, but over-relying on it won’t do you any favors, either.

How to Design End-to-End Testing for Continuous Development

“Even if you’re on the right track, you’ll get run over if you just sit there.”

– American Humorist Will Rogers

Continuous development (CD) has become the new gold standard in today’s tech marketplace, and with good reason. Continuous development is fast, efficient, and allows for a company to get the latest version of their product to their customers as quickly as possible. It furthermore allows developers to get rapid feedback on small chunks of code through continuously testing that code as it’s developed, as long as continuous testing is also in place.

But with any change comes new challenges, especially if a development team is transitioning from a longer (for instance, monthly versioning) development cycle to a continuous development cycle. As part of this shift, a team must tackle rearranging end-to-end testing cycles accordingly, moving from monthly testing to (ideally) continuous testing on every build. Every release must be adequately tested, and with continuous development multiplying the number releases, the quantity of required test runs rises in turn. Each new change or release carries the risk of breaking an application, and each new change or release must be tested.

Because of this challenge, many testers turn to automated testing solutions for help. With regular continuous updates, manually testing an app prior to each release is now infeasible (not to mention unbearably expensive). Automation makes testing each deploy possible, and removes the primary testing burden from the development team and allows them to focus instead on other priorities like product improvement. Though automation is relatively simple to implement in unit or API testing, a proper approach to automating end-to-end testing requires further consideration.

Outsourced Testing

If a team’s end-to-end testing is outsourced to a manual testing company (crowd-sourced or otherwise), they are most likely to pay per test. This cost might be bearable for a team shipping and testing code once per month. However, a team deploying five or 10 times per business day is suddenly looking at 20 to 40 times the cost of the team deploying once per month. We discuss some other pros and cons of outsourced/crowdsourced manual testing in this blog post.

Test Code Automation

Instead of employing outsourced manual testing, a development team can alternatively use “lean and mean” testing code to automate their tests. To keep up with a CD release cycle, they need to develop a core regression testing suite that covers important workflows: it needs to run in minutes, rather than hours. This means prioritizing tests to focus on workflows that impact the most users, impact core functionality, and/or impact revenue. We at ProdPerfect have already discussed choosing the correct number of end-to-end tests, and that information is just as relevant here as otherwise.

What About Less Common Flows?

Continuously-testing teams will only be testing core workflows with each build; what about less common flows, edge cases, and the like? These tests should be maintained as a second suite altogether, and run out of band, at timed intervals, on a stable testing/staging environment. This strategy can also be used for cross-browser/device testing, which if extensive will also take more than a few minutes. By splitting your testing suite in two, you get the benefit of continuous testing on continuous delivery, without losing the ability to find edge case bugs.

It may feel like a daunting task to begin a CD cycle for the first time, but with foresight and intelligent testing plans, any team can take full advantage of its benefits to improve their product and make their customers ever happier.