Performance Testing in a Distributed Computing World

Computing has changed a lot since the days of Digital Watches, Televisions, Telephones and Apollo Spacecraft — where in the software component (called embedded software) merely existed as a means to control specialized hardware that were typically not thought to be as computers.

Testing in the pre-browser PC days

Computing, just a few decades back was a term restricted to personal computers(PCs) that people used for spreadsheets(VisiCalc, Lotus) and other desktop applications such as email clients that people occasionally connected over the local network or internet. The software applications were lightweight to account for the limited memory and computing power. These softwares were meant to run on few dedicated hardwares/computers and hence, testing mostly meant functional testing more than performance testing — though developers had to account for performance as an integral part of development to even make the software work on limited resources.

Testing in the Internet era

The introduction of browser and growth of internet connectivity opened up client-server computing at a massive scale. New SaaS software companies were created for every vertical — Salesforce for CRM, Dropbox for file sharing, GMail instead of email clients. More maturation, more affordable mass storage, and the advent of service-oriented architecture saw Cloud Computing being formally established across domains.

Photo credit: Google.

Testing and analytics tools grew in proportion to the growth of the internet.

More centralization on the server-side and lightweight computing on the client-side(typically web browsers executing javascript code) meant more sophisticated tools were made available at the server end— ranging from application monitoring(AppDynamics, New Relic), machine data analytics(Splunk) to cloud monitoring(Amazon CloudWatch). With such rich data from these tools and decoupling from client-side interaction, it was matter of time before it was automated through continuous integration(CI) builds and deployments to understand how code changes would perform in the real-world(production environment). Continuous Delivery meant reliability beyond the added advantages of cost and efficiency.

CI helps in root cause analysis and precision backed by data. The idea is to reduce the variables in a controlled environment to see the effects of new code changes. Facebook, for instance, developed a tool that would accurately measure changes as small as 20 ms in page load time with 95 percent confidence. It is important to note that the clients are typically web-browsers simply rendering javascript functions and hence could be assumed to work in a deterministic way.

Shift to Mobile

The introduction of mobile apps as a platform was a starting point towards distributed computing. However, it did unleash a high variable environment with limited and fragmented resource capacity. It is just not enough to have a functional app but there is a need to understand the performance implication of code changes unlike the way it worked on a PC/Laptop.

Mobile apps get complex with time and even small changes impacting performance accrues innocuously until it makes the app-interaction feel slow and consume more data, memory and battery.

App speed, data usage, and battery efficiency is often difficult to measure in a highly fragmented world. App dependency on hardware components like bluetooth, beacons, network adapters from different manufacturers assembled by different OEMs running different OS versions(some custom) often makes it a difficult task to even isolate problems. Now, imagine the complexities to do A/B Testing. Is it going to be online(pulling test assets) or offline(hard-coded into the app) A/B testing, how easy is it to run new set of testing scenarios and how to resolve conflicting ones?

Photo credit: Google.

Building a system to maintain or improve development speed while minimizing the number of regressions in performance such as speed, data usage, battery consumption, and memory footprint requires dedicated processes.

Design Principle

While it is impossible to get an automated performance infrastructure right on day one, it is irresponsible to just rely on production analytics data to improve perf metrics. Building a sophisticated system takes time and it is important to design a platform that reflects — detection and analysis at each stage of the release life cycle:

Patterns(Detection): If it is possible to capture the right metrics early, perform statistical analysis, and possibly apply some machine-learning techniques to predict any potential issues — it saves a great deal of time and complexity.
Diagnostics(Analysis): Once the patterns are detected and with the right data, it becomes easier to narrow down the root cause and attribute to specific code changes.

Each of the release life cycle — namely development, staging, and production would have an underpinning resolution process — detection and analysis.

Photo credit: New Media Labs.

Mobile Release Life Cycles

The mobile apps that are being run on mobile devices are tightly coupled to the device hardware(Network Adapter, SSD wear-outs, etc) than obvious hardware spec or OS version implications. It is crucial to have a controlled environment by clustering devices(even when all of them are iPhones for instance, but vary in often small details such as SSD wear-outs and thereby having a significant impact on I/O intensive apps) in terms of how the app functions consistently across a set of devices before they are put into use.

Development

This phase is meant for developers to specify desired configuration(device type, OS version, network conditions) and schedule the experiments. The system runs A/B test-style experiments — with and without the new code revision — and notifies back with the results. Calibration of experiment time, prediction accuracy and understanding data/results are crucial for a timely and actionable feedback.

Perf experiments are generally computationally intensive and time consuming. This process typically takes about 30–60 minutes — where likely affected scenarios/interactions are run in parallel, where jobs are scheduled on a dynamic device pool with enough artifacts being shared back as possible. This process is not meant to catch regressions or prove improvements in a comprehensive way as it is designed to run only for key scenarios. The objective is mainly to help developers iterate on changes quickly.

It is to be noted that perf experiments are generally noisy and hard to get deterministic results in a quick turnaround time environment. Distribution views and charts for data points from each run aid better in understanding than absolute stats or scores(as they get affected drastically by outliers in a small sample set).

Staging

This phase entails two components — continuous experiments and triaging.

For every commit into the main branch, the system does the heavy lifting by running all the possibly affected interactions with representative configurations(device type and network condition). These runs, called continuous experiments, are computationally intensive and typically are clubbed to run with every N diffs(as a result of code changes) — depending on queued jobs and resource availability. The goal here is to attribute perf metric changes to atomic commit(s).

Triaging and tracking regressions is a more computationally intensive process but is important in ensuring the performance quality of the apps. Any significant changes in the continuous run above feeds onto this stage automatically, and more experiments are run with profiling for every revision in the given range (i.e N diffs). Custom profiling scripts(augmented with stacktrace and other metadata) are often used in conjunction to help diagnose the issue better.

Production

Realistically, it is not possible to detect all the perf metric changes in a controlled environment. Perf metrics need to collected in the real-world for deeper analysis. The objective here is not to optimize for every device available in the market but determine those perf metrics that have statistical significance. An efficient and effective system would work with randomized samples on small percentage of clustered users(by function of the device and how the app functions on them) so as to capture the metrics without any visible impact on the user experience.

Building a scalable and extensible system

Building a sophisticated system for mobile perf system is a long and hard process. There would be gradual additions to the continuous experiments, diagnostic and profiling runs before it could detect regressions in more holistic ways. They also need to be supported by adding different mobile devices to the device racks/farms. Machine learning algorithms have to be deployed to build smarter profiling and regression models — one that builds new flows when applicable, one that automatically compares the impact on existing flows, and one that prioritizes the importance of each flow in the way it mimics the real-world and determining the frequency of these runs accordingly. Over time, the system needs to represent the real-world as close as possible. A continuous learning testing system not only makes makes every release stable but it is also future-proof at the same time.

Computing Devices of tomorrow

More and more devices are getting connected to the internet with remarkable computing abilities such as being able to run sophisticated machine learning algorithms on GPUs. Over-the-air software updates on connected cars are already the norm today. The centralized computing power on the server-side of the past decade is now slowly moving back to the client-side to be able to act in more autonomous ways(self-driving cars for instance) and act as intelligent agents(like vehicle-to-vehicle communication). It is one thing to simulate self-deriving AI scenarios at a criss-crossing junction and on the other hand, it is altogether different to see how the car actually performs coordinating different functions in conjunction with other self-driving cars. Precision and reliability would have to become the baseline norm when it comes to performance, leaving absolutely no room for error. And not just autonomous cars, decentralized systems are proliferating across domains — that are enabled by sophisticated sensors/hardware, AI, blockchain technology — moving to a more distributed computing world.

Photo credit: Google.

Performance testing needs to and would adapt to these changes, while the non-proactive ones would ultimately be weeded out. It probably wouldn’t be long before we see performance testing AI agents working with AI agents in tandem, and helping each other evolve and co-existing in these distributed computing devices.

The question that you need to ask is what kind of system you have in place today. Is your company prepared to navigate the impeding tectonic shift as distributed devices add more computing capabilities?

Note: The design principle advocated above is not something that I can take credit for and is derived from CT-Scan(developed by Facebook) and other publicly available sources.

Performance Testing in a Distributed Computing World — From Mobile and Beyond

Testing in the pre-browser PC days

Testing in the Internet era

Shift to Mobile

Design Principle

Mobile Release Life Cycles

Development

Staging

Production

Building a scalable and extensible system

Computing Devices of tomorrow