Is Latency Slowing Down your E2E Tests?

In many modern software development workflows, before a developer can deploy their work, they need to run a set of automated tests as part of the build process. These tests usually involve many quick, small unit tests, some integration tests and fewer bulky end-to-end (E2E) tests.

A little while ago, I was tasked with helping a dev team improve their build times. When we drilled into these build times, we found that testing was the main lag. To measure improvement I found that are two essential metrics that are vital to track when reducing software build times; test duration and test failure rate.

If the test takes too long to run, it will obviously delay the build. Additionally, if the test fails too often and the build needs to be rerun in order to pass (a flaky test), that will also delay the build process.

Selenium offers a suite of open-source tools useful for E2E testing. In a previous HackerNoon article, Rahul Jain introduced the Selenium WebDriver for E2E testing and also described how you can use LambdaTest’s Selenium Grid to speed up the process.

LambdaTest does offer some nice tools to help speed up software tests, like running tests in parallel, showing dashboards to monitor which tests are flaky and automatically rerunning failed tests, but there is still more that can be done to improve test speeds.

Recently LambdaTest announced their new HyperTest offering, which promised to

run tests "up to 70% faster than other traditional cloud test execution platforms". I was curious to understand what the underlying improvements were, and found that much of this comes down to latency.

Traditional cloud testing platforms will run different parts of the testing lifecycle on different machines. This introduces latency when running the tests. Indeed, sometimes latency is unavoidable given there is ultimately a limit on the speed of light. By removing this complexity and instead of performing test execution on one machine in the cloud, LambdaTest achieved these speed improvements whilst still offering their customers the benefit of the cloud.

Now, this also has an impact on build failure rates. Whilst you might be thinking that latency can only cause a tiny percentage of tests to fail, it turns out that cascading failure means these small failure rates can lead to bigger consequences.

Suppose you have a test suite of 150 tests which each individually have a 99% pass rate. Instinctively, you’d think that the entire test suite would have a 99% pass rate too. However, cascading failure means that a 99% success rate would compound 150 times to just a ~22% overall success rate for the test suite (0.99^150).

Cutting latency isn’t just something for LambdaTest to do though. In my own experience helping developers cut build times, I found that latency can often play a big role in E2E tests. Slow third-party API requests or unnecessary third-party assets can also be the latency that causes slow and flaky tests.

Whilst cutting latency can improve your software builds, I think this story also has a lesson for all software engineers about the harm that unnecessary complexity now can cause later.