Getting Your Head Around Performance

Written by terrycrowley | Published 2017/05/05
Tech Story Tags: programming | web-development | software-development

TLDRvia the TL;DR App

Performance is one of the most interesting and challenging areas in systems development. Exponential improvements in performance of computer technology over the last 10 decades (if you include the transitions from electrodynamic relays, vacuum tubes, discrete transistors to integrated circuits) is the key reason why software development is such a dynamic and ever-changing field. Improvements in algorithms and architectures are critical, but they are almost always enabled or made necessary by underlying improvements in performance. This is similar to how virtually every breakthrough in science is a consequence of some breakthrough in technology.

Analyzing performance requires a deep understanding of the current behavior of elements of your design, but also needs to be informed by what is changing and likely to continue to change in the future and whether that will make your design better or worse over time. Having a framework in which to observe these changes can be helpful in developing a deeper understanding rather than seeing this changing technology as “one damn thing after another”. The framework that follows is one that I have built up over time to help understand the changes I have experienced over the last four decades.

Here are the base elements we are working with. Not all systems have all these elements.

  • Input. Some form of input — mouse, keyboard, touch, sensors.
  • Output. Some form of output — screen, printer, sound.
  • Stable storage. A way to store data persistently and the mechanisms to load it into memory for processing.
  • Memory. Active memory available to a CPU.
  • CPU. A processor for performing computations using memory.
  • Communication. Communicate between components in a system.

We are typically concerned with the latency and bandwidth available between components, the capacity of a component, as well as the power use and cost of a component. We may also be very interested in the size and weight as well.

OK, this seems like Computers 101 (or perhaps read Understanding Computers by my dad on its 50th anniversary)! How can we use these elements to get a deeper understanding?

As trivial as this seems, it can be useful to recognize that this is all there is. That is, everything ultimate resolves down to these simple elements. I would often see “magical thinking”, especially when using some component that another team was providing, about what the performance characteristics of that component should be. It was useful to have an end-to-end perspective and recognize that there is no magic on the other end of a wire — just more IO, memory and computation driven by the load you are placing on the system.

It is helpful to understand two additional powerful influences on design. Designers are strongly motivated to isolate components of a system — this makes it easier to manage complexity and allow for independent evolution and improvement. So improvements in performance will often be used to “buy” isolation. If you look at some hot new trend like serverless computing (e.g. AWS Lambda service) it is likely this is not just some fad but it is more likely that some underlying change in the relative performance of components (and requirements) has opened up a new possible design point. Serverless computing is an example of one point on a classic choice spectrum of “bringing the data to the computation” versus “bringing the computation to the data”. The design strategies that catch on are the ones that become feasible and then see performance trends make them a better and better trade-off over time.

Alternatively, it is often the case that sharing some resource across multiple consumers can result in more efficient resource management. One example might be a font service that keeps fonts loaded into memory so they can be shared across multiple processes. Or a distributed in-memory cache like redis that keeps data in memory across multiple server requests in order to reduce latency for each request.

When some new design approach or component shows up it is usually the case that someone did not come up with some breakthrough new design strategy — it is that a common design strategy has now become feasible or useful in a specific domain because of some underlying performance change. It is often helpful (although not always necessary) to understand what underlying change made some strategy useful or feasible now. If you do understand the underlying change, you can typically evaluate more quickly whether it is relevant to your own application.

The focus on Moore’s Law and a singular rate for exponential improvement is misleading in multiple ways. Even when the improvements are driven by the same underlying technical trend (e.g. chip fabrication for memory and CPU), exponentials diverge between components as well as between the different performance characteristics of those components (e.g. capacity, latency and bandwidth). Diverging exponentials end up being surprising over time — we just did not evolve to have a good intuitive grasp of exponential change. In other cases you have different underlying technological basis for improvements (e.g. the technologies for stable storage) and that drives different exponentials. So the relationship between the components of the system are constantly changing and diverging which means that performance improvements in the individual components will often enable or require deeper overall changes in design.

This recent paper on the effect of improving latency of stable storage and improving internal data center latencies and the importance of developing new strategies for dealing with microsecond-level delays is just one recent example where improvements in one area demand new strategies in response.

The focus on what is improving some times results in a lack of focus on what is not improving (or improving at a much slower rate). Certainly one of the biggest challenges for PC applications through the 2000’s was the rapid increase in both disk capacity and memory capacity but much slower improvement (until the advent of SSD’s) of IO rates, especially random IO. This significantly impacted startup speeds (dominated by paging in code) or — heaven forbid — when the device actually needed to page as an application was swapped in and out. Different techniques were developed to allow applications to discard memory (e.g. the memory storing a decompressed image or some type of cache) rather than let it get paged out. Paging it back in was significantly slower than just recomputing the information. Recomputing also allowed the application a much better understanding and control of its behavior and performance during the period where the information was being reconstructed. Paging that computable data could take literally 10’s of seconds on these laptops with slow drives leaving the application effectively frozen during the process.

Sedgewick had a quick and pithy comment on the impact of these different improvements. If I have an O(N log N) algorithm and double my processor speed, I can now compute this in (N log N) / 2. But my memory capacity also doubles and my problem probably expands to fill the memory — so I now have (2N log 2N) / 2 or N log N + N. So I have gotten slower despite parallel improvements in the underlying technology!

Latency improvements trail bandwidth improvements between components for deep fundamental reasons (definitely read the linked paper for deeper background and understanding). This means that techniques like caching and replication (which trade capacity for latency) can become more important over time. These approaches have tradeoffs and costs of course so they are more like mitigations rather than solutions.

We live in a physical universe, constrained by the speed of light and three dimensions. One fundamental consequence is that we will always have memory hierarchies — simple physical layout requires it. Economics also drives us towards memory hierarchies since we will always be willing to pay a premium for faster memory and be willing and able to buy more slower memory.

When we think about a distributed system, we are always making a choice about whether to bring the data to the computation or to bring the computation to the data. Many factors impact the choice here (size of the data, how fast the data is changing, whether it is shared between requests, how complex the computation is, what bandwidth is available, what latency is available and what latency is required for an answer). Changes in these properties or the characteristics of the problem can cause the right point on this spectrum to flip from one stable state to another, e.g. how to split rendering and layer composition between a CPU and GPU.

More recently we have seen how the design for a web or mobile application can transition from between different stable states as richer functionality drives changing decisions on where to keep the bulk of the data (and computation) between service and client. This can be traumatic for the team if different members are invested in the one “right” decision. In fact the whole team might be surprised by how their “thin front end” has fattened up despite (or perhaps because of) a lack of an explicit strategy around it.

The reality of a physical universe results in the continuing importance and surprising effectiveness of locality, locality, locality in performance. Every time a designer adds another level to the memory hierarchy (L1, L2, L3, disk caches, distributed memory caches), locality — in both time and space — just becomes more important. This is especially important to remember when capacity is also increasing exponentially — if using that capacity results in less locality than you are likely to see surprising and often hard to diagnose decreases in performance. The same mechanisms that computer designers use to have additional caches “automatically” improve performance of existing workloads makes it hard to recognize when those caches fail to be effective.

I was struck when looking at performance research in multi-threaded algorithms and data structures how much of the work is involved in improving the cache friendliness — locality — of the algorithms. Of course some of this work is focused on ensuring that the multiple threads do not induce cache coherence overhead that would destroy benefits of using multiple threads, but much of it is just focused on good cache locality in general. When you look at the factors of 100x latency difference between L1 cache and main memory, it makes sense that an algorithm that is trying to ensure some small N improvement from using multiple processors would also focus on those much larger benefits that result from good locality. Of course a multi-processor algorithm also allows you to leverage more of that per-processor L1 cache.

The continual importance of locality also means that data structures and approaches that were motivated by smaller machines with smaller capacities tend to exhibit good locality on bigger machines and continue to be effective for somewhat different reasons. Cache analysis always tends to be hard especially in large systems with many different components running in the same memory space. Often it is simpler to just focus on being smaller — you not only improve your own locality but you have less impact on other uses of available cache. Whether it is fitting in a single disk page or fitting in a single network packet, size is often much easier to analyze than the secondary effects of size. The same dynamics that make being smaller surprisingly effective makes being larger surprisingly costly.

To some extent languages that make following pointers anywhere in memory look free feel like the QUERTY keyboard of computer technology. Our familiar keyboard layout was originally designed to slow down a typist so the keys would not jam. Following a pointer to some arbitrary place in memory is an incredibly expensive operation on a modern system. Just being smaller ends up being the easiest way of reasoning about it. Packing data that is accessed together (rather than just logically associated) is effective and often simpler to reason about.

This focus on locality and iterating on improving locality to improve performance often means that a data representation that starts out general and flexible gets more and more correlated over time with the code that operates over it (adding indices, denormalization, other forms of repacking tied to the dynamic interaction of the code and data). This is not necessarily avoidable but it can make the overall system and the performance impact of changes to the system harder to reason about over time — external evidence for growing complexity.

Locality also means that talking about a time/space tradeoff in performance is often misleading. Often less space and less time go hand in hand. Alternatively, if you can compute and store some small result and use that result as a proxy for walking over a larger data structure, that can be highly effective. I have used that technique for 40 years (following others before me) and it continues to pay dividends. Hashes, timestamps, change counts, dirty flags are all mechanisms that can be used to trim a larger data structure walk.

Memory hierarchies make caches a fundamental part of any architecture. Caches can be very tricky to manage since a focus on optimizing a particular use path can fail to fully characterize all the costs associated with managing a cache. Caches rely on a model of the dynamic behavior of the code to work — when the behavior of the code changes that model may no longer be accurate and the cache may fail to be effective. Most experienced programmers have examples of caches developed during one product cycle that deteriorated over time. Good practice is to ensure that caches are instrumented so that their behavior and effectiveness can be actively monitored.

These trajectories of exponential change in device capability intersect with real-world requirements and constraints. Some of the most interesting constraints are based on the human that often sits at one end of the system. Screen or printer resolution, screen refresh rates, application response times, image, video and sound encodings are all influenced by constraints of the human sensory system. There is always a messy period of settling on encoding and communication standards during the period where things barely fit — and then we move past them and open up new scenarios. It is interesting to look at how digital video recorders struggled to fit things on a hard disk, then expanded to larger and larger capacities and now are reverting to designs where the video is “recorded” in the cloud because the infrastructure to deliver it on-demand enables a much more flexible design.

I was struck in looking at the design of the Exchange email service that key strategies and models around per-person CPU and IO load and the number of users each server can support are fundamentally informed by the inherent limits on the rate at which humans can view and absorb information.

One of the key strategies for graphical applications is “virtualization” (e.g. virtual list controls) where the application carefully constrains the cost of rendering and updating the view to be relative to the size of the screen and the ability of the human visual system to perceive change. The underlying application data model may be arbitrarily large and may be changing rapidly so ensuring that costs associated with the view (which can be very high) are constrained is an important optimization. Early graphical applications were designed with this very much in mind but at various times in the industry’s history the desire to use powerful layout mechanisms has made it difficult to also use virtualization. At various times framework developers have foolishly declared this optimization unnecessary because of improving performance — they have consistently been proven wrong partly because rendering also continues to get more and more expensive with higher resolutions and animations requiring faster updates as well as the fact that model sizes expand as well. Virtualization provides isolation of these concerns which is a powerful benefit.

Not all constraints are directly tied to human characteristics. In-memory databases are an example where increasing memory capacity opened up new performance scenarios as an entire database could fit in memory and no longer be constrained by disk bandwidth and latencies (especially for analysis scenarios where persistence was not required). This was also an interesting example where lots of focus was placed on compression techniques during transitionary periods as things barely fit. Column stores were used (where database tables store rows by column using a highly efficient run-length encoding of data with many duplicate entries — consider the US state column in a database storing addresses). This is also a good example of an encoding where many interesting operations (e.g. the filter, sort, and aggregation operations typical in a pivot table analysis) can be performed directly on the compressed representation.

Consumer cloud file storage is an interesting case study where so much effort in the early days was spent on peer-to-peer and other strategies to do anything but store the complete file data in the cloud. Exploding disk capacities (especially for many of these “write-only” scenarios with limited IOPS requirements) then made the much simpler and more functional model of storing all the data in the cloud economically feasible. This exponential change in storage capacities (with doubling times of a year or less) happened just as the demand for such a service driven by multiple devices (phone, tablet, PC) per person also exploded. However, the exploding use of larger and larger images and now video has still made the economics difficult for consumer services. This is a good example of a case where there are a lot of intersecting exponential trends required to understand how to make both a functional system and a healthy business.

The distributed nature of so many of our problems makes careful understanding and handling of asynchrony critical to performance. Virtually everything resolves at some point down to a request and response. The decision to model something as a synchronous interface and hide the asynchronous nature underlying it is often made with a focus on simplifying the code logic but has many secondary implications. Some direct implications are that any resources (memory, locks, etc.) held by the code are held across the underlying asynchronous call. Additional resource demands required to fully satisfy that request are implicitly throttled while waiting for the synchronous call to complete. In fact the desire to allow multiple requests to proceed in parallel and escape this implicit throttling is often the motivation to move to an asynchronous model. As a system grows larger, the complexity of managing the overall resource usage in terms of resources allocated and reserved while a request is being processed or how overall resource demands are throttled ends up scaling better with an asynchronous model where these concerns can be explicitly managed.

Another important factor in the performance of this distributed system is how timeouts are handled. For a distributed system, timeouts are the fundamental mechanism of recognizing error. Reducing the time to error discovery (and subsequent recovery) requires being able to effectively reduce those timeouts. This is especially important when trying to compress the tail of your overall latency distribution. The challenge in being too aggressive with timeouts is that aggressive recovery (retrying the request, accessing a backup) can further overload a system in distress. It is important to have a good model for what the failures typically look like. This recognition that timeouts are fundamental to the overall system also helps demonstrate how crippling variance in performance is. Not only must the code deal with variance in success, but it also adds delay equal to the worse successful request when dealing with failure. If the variance can not be fundamentally reduced, a better approach is often for the distributed component to quickly acknowledge receipt of the request and then return the final result later.

Flow control is fundamental to any distributed system. You essentially have the choice of over-provisioning, shedding or rejecting load, or providing backpressure that allows the clients to throttle their requests (typically a combination of the final two). How important it is for clients to have an explicit throttling mechanism is often determined by how expensive is sending, receiving but then subsequently rejecting a response. The simplest example of these very different costs is comparing a read request which involves just specifying what to read with a write request that requires sending the data itself.

In analyzing a complex system made of many layers, a strategy I have found effective is trying to first identify what is “light speed” for the scenario. That is, what are the fundamental constraints determined by looking at an end-to-end analysis of latencies and the actual information that needs to be transferred or processed through the layers of the system. This establishes a lower baseline to measure against and develop intuition around rather than immediately getting bogged down in small iterations from a potentially poor starting point. It also allows you to discover whether a team you are working with really has deep appreciation for the end-to-end issues in their problem space.

Related to knowing light speed is knowing your design point. There are basically no architectures that scale across all potential dimensions, so you need to have some model for what you are designing for. For example, looking at an application like OneNote one could ask how many notebooks will a typical user have open? How many sections with how many pages? How big and complex is a page? A corollary to “know your design point” is “your design point will change” — either because of changing usage scenarios or changing performance characteristics of the devices it runs on and how that drives requirements. With modern telemetry systems, a significant advantage for modern developers is the ability to actively monitor and track how well real usage matches that design point and responding proactively to the inevitable changes in use patterns.

I was a fan of the description Ales Holecek used to describe how he organized the performance effort for Windows 7, the first version of Windows that did not significantly increase system requirements (in fact it reduced them). The team talked about “no tricks”. The focus was on less code, less services running on less data. That is, do not spend your effort on some super intelligent strategy to prefetch a cache or other approaches dependent on optimizing a complex model of runtime behavior. Have less code running on less data. This approach was critical in cutting through the complexity of driving an effort across a large team — a good example of the “simple messages” management technique.

Performance engineering is a vast area and touches on almost everything a developer does. You want to be able to match a deep understanding of all the elements of your design with the ability to step back and see a broader perspective. Hopefully some of the ideas discussed above can aid in developing this broader perspective.

Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising & sponsorship opportunities.

To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.

If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!


Published by HackerNoon on 2017/05/05