Async / Await is Not Going to Save Your App

I don’t usually get too worked up about syntactic arguments in language design. Most of these battles are fought over 20 line code snippets while the actual war is won or lost based on much larger design and architectural decisions (OK, bad military analogy). Typically those snippets are annotated with some text like “oh, yeah I haven’t really talked about error handling” as if error handling is a secondary concern in any real system.

I do feel like I might have something to add to all the excitement and heavy breathing about async/await showing up as a native feature in JavaScript. When Anders Hejlsberg was adding async/await to C#, I sat down with him to discuss some of my concerns about overselling this as the “solution” to asynchronous programming. Those concerns were real in C# (or C++) and are also real in JavaScript.

Syntax and the related API design decisions can permeate up and have larger design consequences at the level of application architecture. We saw this play out with the Win32 API on Windows. Win32 generally defines a synchronous API. If a caller is concerned that an API may block (e.g. because it wants to access the API from a UI thread that must not block), it is up to the caller to launch a separate thread and invoke the API from that thread. Eventually Windows defined various APIs for thread pools and task libraries to provide the infrastructure to make this simpler but for much of Windows history it was “some assembly required”.

One consequence of this decision to expose these APIs with a synchronous interface was that there was no syntactic difference between an API that would complete consistently with low latency variance and an API that could block for seconds. In fact, sometimes an API’s behavior might change release-to-release as new features (distributed file systems or distributed directories) were added. Despite the syntactic similarity, the actual use of such a blocking API needed to be managed very differently inside an interactive application. The consequence of misuse was generally application hangs — far too common in Windows applications.

A further challenge is that high-variance, blocking APIs should generally provide a mechanism for cancellation and in some cases a mechanism for progress reporting. Again, there was no standard mechanism to do this and in fact most APIs provided no such capability. The lack of these basic mechanisms in core low-level APIs tended to propagate up through middleware as well. There was also lack of standardization around thread affinity constraints that further complicated correct usage.

OLE automation (more generally COM) also suffered from this same synchronous API design. The implementation of automation is built around the core message pump also used for processing window events. (“Pumping” messages means removing messages from the queue and dispatching them to an appropriate handler.) When making a synchronous out-of-process call, the application needs to pump messages in a nested message pump in order to receive a response from the other process. Pumping messages from this nested message pump (including dispatching user input or other window events) can cause unexpected reentrancy where data being accessed by functions on the call stack can change out from under them. Many a Windows developer’s head has exploded tracking down odd reentrancy bugs.

This is a case where early design decisions ended up having very long-lived (and user-visible) consequences in the overall ecosystem.

The motivation for async/await is very clear and in fact is similar to the motivations that initially drove Win32 to a synchronous approach. The programmer would like to be able to write their algorithm in a way that looks like a simple sequential program and they would like an approach that naturally composes. The constraints are also similar. The algorithm and the data the code accesses needs to be very, very isolated or the code needs to be extraordinarily careful about how it reads or writes data that is visible outside the algorithm itself. In the Win32 threading case you might have to deal with complex locking issues. In the JavaScript case, the single-threaded nature of the execution environment simplifies those locking issues but it still means that any data external to the algorithm might have changed out from under it any time an await completes. This runs counter to what use of async/await is trying to achieve — making this asynchronous state machine behave like a sequential synchronous program and relieving the programmer from dealing with these additional complexities.

The discussions of async/await that I’ve seen on the web tend to gloss over the fact that these constructs are used in very different execution environments when comparing a JavaScript client application to a JavaScript service implemented with NodeJS. Individual service request handlers are typically short-lived and therefore maintain minimal long-lived data. They are usually isolated from other service request handlers. When encountering errors, a reasonable strategy is often abandoning the request and returning with an error result. This takes advantage of the fact that any client needs to have some kind of retry logic in place in any case. Getting a fast error return is often a better outcome than a slow (or high variance) success result. Attempts to recover within the service can often make a bad situation worse. Failure to shed excessive load, often caused by poorly tested error handling code, is a common cause of cascading failures in a service environment.

This contrasts in very significant ways with a client application. A client application typically has important long-lived data that represents the current application state. Failure is an expected outcome for virtually any asynchronous service request as transient communication failures are a common occurrence for devices in motion. Failures are often global to the application (since all communication is impacted) and might need to be presented and handled in a consistent fashion — gating how communication is reestablished and operations are retried. When transient issues subside, the application should proceed to a fully consistent state — completing operations or fetching any data to present.

The client application also has that pesky user that wants to see information as soon as possible and is poking and prodding the application state, changing their mind and in generally causing things to shift out from under that asynchronous execution environment.

An alternate approach is not to concentrate on syntactic issues and compare async/await with standard callbacks or promise-based callbacks but rather to focus on the state you are managing and the operations and events that transition that state. In the article about JavaScript async/await that I linked to above, the simple version of the async/await algorithm (of course with no error handling) was:

async function asyncAwaitIsYourFriend () {const api = new Api()const user = await api.getUser()const friends = await api.getFriends(user.id)const photo = await api.getPhoto(user.id)console.log(‘asyncAwaitIsYourFriend’, { user, friends, photo })}

In this case we can imagine our long-lived state is a set of users and then for each user, we want their photo and their list of friends. Fetching their friends and their photo are asynchronous operations. (In the linked example there is only a single user but let’s get ambitious.)

If we view this as part of the long-lived state of the application, we immediately recognize that at any given time, we may have zero or more users and we may or may not have yet fetched their list of friends or photo. Other parts of the application need to be prepared to view these as robust long-lived states of the application rather than unexpected transitionary states — we might always have some of our users for whom their photo and friends are not yet known. The component managing this user set is responsible for fetching any missing information — dealing with errors and retrying as necessary. The quiescent state is that all information is available; the user manager is responsible for getting to this quiescent state. This makes the component state easier for the rest of the program to reason about. If I perturb this state (e.g. add another user to the set) the user manager automatically fetches the right information rather than having the code that added the user imperatively request it. This also concentrates the knowledge of how to get to the consistent quiescent state in a single place rather than having it distributed throughout the application.

A challenge with this approach is when we need to implement an operation that needs to fetch some information and then operate on it. Of course this is exactly where a pseudo-sequential approach using async/await seems attractive. Consider I want to implement an operation to ‘poke’ all my friends. This needs to first get the list of friends and then proceed to perform a ‘poke’ operation on each one.

A common state-based approach is to model this as long-lived state. I will give one trivial example here. To poke all the friends, I might set a ‘pokeID’ on a user. If a friend on the user friend list does not have this ‘pokeID’ set, the user manager launches a poke request. The quiescent state that the user manager is trying to get to is that the user and friends ‘pokeID’ match.

A significant advantage of this type of approach is that the state of the activity of the application is not buried in inaccessible black box callbacks or continuations, it is available as the clearly modeled and observable state of the application. As applications grow large, this ends up being a major advantage.

A robust way for managing how activity gets triggered in this type of model is to take an approach that looks much like model/view synchronization. Rather than modeling with a Rube-Goldberg-esque set of callbacks and triggers that tightly bind separate components, you have robust independent components that understand how to stabilize their own state. That state might be dependent on other components’ states, so you loosely couple observers on state changes in those dependent components. For example, in the model/view case the simplest “observer” just sets a dirty bit when the model changes. The view is responsible for efficiently figuring out what changed and efficiently updating the display. This is essentially how Facebook’s React (and many older technologies for building graphical applications) work.

This can appear inefficient, but the key is recognizing where the performance cost and latency really lies. In the display case, costs are dominated by the actual cost of rendering and display, not the cost of figuring out what to render and display. In most of these network-based asynchronous use cases, the costs are dominated by the network operation itself, not determining which operation is required in response to changes of state in other components.

Local code clarity is important but over time understanding how all the different parts of your application interact at run time is both far more complex and far more important. Focusing on state design and how that state transitions in a loosely coupled way is more effective in managing growing complexity than over-focusing on small code snippets.