What to Do When Things Get Complicated

I’m a heavy user of the New York Times app and recently noticed that the upgrade from this summer was a complete rewrite in Apple’s Swift programming language. It unified their iPad and iPhone app, according to this thread on Twitter. It was an ambitious rewrite with more personalization, more support for offline, a fancier layout and more integration on the news canvas with native iOS controls for sharing.

It has also taken quite a beating in the App Store, currently trending at less than two stars. My personal experience has been consistent with that rating; it’s less stable, sometimes unresponsive or glitchy and at least prior to this most recent 6.1 point release was a ferocious killer of battery. Apparently this was due to aggressive background caching of stories, which seems plausible. The latest upgrade backed off that a bit.

I don’t have the inside story (although perhaps we can leverage six degrees of separation and some reader connection can get one of the developers to add a comment here)? It does have a little odor of second system effect.

Why do things break when we try to get complicated? I just want the application to be responsive, to not crash and to not use so much battery. Is that so hard or too much to ask? I’ll explore this in a bit of depth by looking at the issues this kind of application faces in an application I am quite familiar with, the Mail application that shipped with Windows 10. It was built from the ground up by the Outlook team and internally was called “Modern Outlook” with aspirations of someday unifying the Outlook code bases across all the platforms. It also takes an interesting approach to controlling the complexity that grows as an app gets more functional.

Let’s step back and look at the basic problem we are trying to solve. Most applications these days (like both Mail and the New York Times app) are the front end interface to an underlying data model that starts up in a cloud service somewhere. Part of that data model is locally cached on the device and then part of that local cache is loaded into the memory of the running application. There is usually a sync (or replication) component that brings the data down from a cloud service and caches it locally in preparation for loading it into memory (or perhaps the data is loaded directly into memory and later pushed into the cache). This might happen prospectively as part of some background sync activity or directly in response to a user action (clicking on a mail message or a news story).

Now, for all parts of the app, we want it to have great performance. We want the code to be tight, the data compact, the algorithms tuned. But ultimately, it is really important to understand what performance you can control as an app writer and what you can’t control and design your app with that in mind.

The most important point is that any network IO takes “for-evvv-ver” (cue “The Sandlot”). Yes, I want to see my new mail milliseconds after launching, yes I want to immediately see the editorial commentary on the latest crazy thing our fearless leader has done. So happiness with your app might depend on great performance between the service and the client. But you better not design your algorithms, data structures and user experience in a way that requires it. Because S*** Happens.

In fact, even local IO might take a really long time. Yes, it can be fast and in general, it is fast. But don’t depend on it when you want to provide guaranteed interactive performance.

If your app is simple with a relatively small data set to manage, being careful about local and network IO is about all you have to worry about to deliver acceptable interactive performance. There are lots of important and useful apps that do not have very big datasets (TideTrac is one I use often). Even if your dataset is pretty small, you still can write really bad O(N²) algorithms. That is actually surprisingly easy to do (just use an O(N) insertion or append algorithm O(N) times). That can come to bite you if you need to manage variable-sized user data that grows unexpectedly or if something accumulates over the lifetime of the app so you don’t catch it in early testing (because N starts small and only grows over time). But it is also relatively easy to fix.

The harder problem as datasets get larger is that the amount of work you need to do to process the data you are bringing down from the server gets harder to manage. If you are working in a single thread — which is all we had in the early days of Windows or the Mac and is typically all you have within a JavaScript-based application — you need to start using the approach we have been using forever in graphical applications that need to stay responsive while still doing a variable amount of processing. You break the work up into small chunks that can be processed within a fixed time slot and then get back and check for any user input before continuing to progress down that chunked work list. That’s the approach Word uses to format a long document into separate pages or the way Excel recalculates a large complex spreadsheet model while still being able to process user input.

There are quite a few challenges in doing this well. It might not be easy to break down the work into reasonable semantically useful chunks. Those chunks need to have carefully bounded processing requirements and your app will be running on different devices with different memory and processing capabilities so it can be hard to be really clear cut about this. Breaking the work into chunks can also add non-trivial overhead so in an effort to make your app more responsive, you have also made its processing requirements larger. If you get the chunking wrong, your app starts getting glitchy.

Even if you have multiple threads, the problem to solve is not trivial. A background thread manipulating the model needs to make sure that the UI thread always sees a consistent view of the model, rather than a partially updated one that is not in a consistent state. It also wants to make sure it does not lock the model for an unpredictable amount of time and cause the UI thread to block or not have access to model data — in that case having multiple threads would not be any help and just makes things more complicated.

So as the amount of data you need to manage goes up, there is really a step function in complexity.

Once you have the data in memory, you still need to build the view. The complexity in constructing the view varies a ton by app, from very straight-forward — like a scrolling list of fixed size items — to very complex — for example requiring incredibly complex layout computations in Word or a browser. Mail generally has it pretty easy, with a mapping from model to view that is mostly direct (and in fact has tuned the client-service protocol and the data it caches to ensure it is direct). The latest upgrade to the New York Times app got more ambitious with fancier layouts but is still pretty constrained. It seems unlikely this is the particular source of its problems.

The user interacts with the app and typically invokes actions that require either getting more data to display (e.g. switch to a new folder), or modifying the data and then displaying the changes (e.g. flag a single message or delete 10,000 messages). So new data is being loaded and existing data is being modified. The loading can take “for-evvv-ver”. The modifications are usually tuned with the data structures to make them local, small and quick, but for productivity applications inevitably there are some modifications that are on a scale O(dataset) rather than O(fraction of dataset that happens to be in view). Think deleting those 10,000 messages rather than setting the flag on a single message.

The view code needs to be really twitchy. It can’t commit to something that’s going to take a long time. It can’t block waiting for data to be loaded from disk or network. It just wants to ask for it and then find out when it finally arrives. It can’t wait for a potentially long-running action to complete; it just wants to request the action and find out when the data is all pretty and ready to be admired and mapped into view.

So the general challenge any app faces is you have the data model that wants to be performant but can’t really make any hard guarantees and then the UI/view code that wants to be super-responsive — think 60 frames-per-second video-game responsive.

For native applications, someone is going to say “no problem; that’s why you run your UI code on a separate thread!” As I mentioned above, the view still needs to query or synchronize with the model in some way. If you do this by providing multi-threaded access to the underlying data model, you need to be really, really careful about locks and careful about how the view can reason in a consistent way about the state of the model. Because it doesn’t help to have your UI thread separated from your model thread if the UI is sitting there blocked by some lock on the model or unable to query the model because it is in an inconsistent state for an arbitrary amount of time. You don’t want to have to write super careful code all over your app to keep blocking from happening; you want blocking to be impossible by design.

One approach is to set up asynchronous non-blocking message queues between the view and the model. The only data the view has access to from the model then needs to flow through these queues. This approach can work pretty well when the data is clearly defined and constrained and Office uses that approach in a small number of places in their apps. The downside is that you’re pushing all this data around, and typically going through some kind of data transformation process that needs to be designed and maintained and extended over time rather than just letting the view directly read into the data model (which is generally how things were and still are done when there isn’t a separate UI thread).

Mail solves this challenge in an interesting way. The underlying data model “ticks” through generational states. Any thread (specifically the UI thread) can be looking at the data model as it existed at one or more of these generations. While it does, it has a completely consistent, immutable snapshot of a subset of the data model that it can traverse to display UI or formulate action requests. In fact, it can have access to multiple such immutable snapshots. That is especially convenient in addressing a common sticky design challenge when implementing incremental view update where you want to figure out what has changed between the different model states. This approach makes it a model responsibility to support this capability efficiently rather than requiring some independent cache in the view.

This concept of leveraging immutability is not new. It is a key feature of functional languages and there is deep research on how to implement these persistent data structures efficiently. Microsoft Word leveraged immutability from its very earliest days in order to make it possible to partially load and make edits to large documents without having to read and then write the entire document. Immutability has also become on a hot concept with JavaScript developers recently with its use in React-Redux and in libraries like immutable.js.

The approach taken in Mail is a neat and elegant solution that greatly simplifies how we think about the interaction of these different components. However, it’s only a “solution” if we can do it with great performance and minimal memory overhead. This is a nice example of a common pattern in good system design. Figure out how to solve one really tricky sticky problem efficiently and then leave the rest of the system straight-forward and boring. I like to call this the “rocket science” pattern. Concentrate your complexity in one small tight place (use “rocket science”) and keep things simple everywhere else.

So how does Mail accomplish this efficiently?

First, most of the Mail data model is a set of objects identified by a unique ID (UID) and represented by a compact property bag. Objects reference other objects by UID rather than direct pointers. This means that 1) there are not swarms of internal pointers in some kind of complex graph data structure to manage as we implement these generational revisions and 2) every reference to an object goes through a consistent UID-to-object lookup throttle point.

Also, Mail (like most well-designed apps) has a specific point of view about the design point it is targeting for dataset sizes. While the service-side data store can be multiple 10’s of GBs and the client-side cache might be a non-trivial fraction of that, the actual number of message objects that need to be in memory and in view at any one time is relatively small — in the 100’s to 1000’s. A cluster of objects that will tend to be referenced with temporal locality can be faulted into memory and represented in compact form.

A component of the model is responsible for providing this UID to object mapping. It keeps track of which objects it has passed out references to. Internally, these objects actually contain a list of property bags along with the tick at which each property bag state first became available. When a thread is referencing the object state, it does so in the context of a specific tick and is returned the appropriate state from the object property list that corresponds to that tick (or really, the latest property bag with a tick less than or equal to that tick). This has the nice property that the data structure can be incrementally updated but the new state only becomes available when the generational tick is “published”. This makes the system less sensitive to precise chunking boundaries since the view can still be accessing a previous model generation while a new one is being created.

The “tricky bit” is the code that maintains this property list when new object state is faulted into memory and then the code that walks that property list. This is done in a very careful non-blocking, non-locking way and is exactly the kind of rocket science that should be carefully isolated in your app design. The model also has an understanding of which ticks are actually in use by all the other threads in the system so that it can go back and release the previous state of an object at some convenient time, effectively garbage collecting the storage (this is all C++ based). It can also use this fine-grained knowledge of which objects are in actual use to clean up and optimize the underlying collection of sets of compact property bags. Knowing which objects are in use also allows pre-filtering of change notifications, another important optimization.

This very careful accounting and management of the memory that is actually in active use keeps overall memory use down and is critical to performance, not just responsiveness. Beyond keeping the UI always reactive, of course great responsiveness requires that the app actually respond!

Our experience with complex applications (and certainly complex applications that are maintained over time by a large and constantly evolving set of developers) is that it is critical to have very robust mechanisms to ensure that key design goals are met — especially a critical goal like never blocking the user interface.