Maintaining Dynamic Characteristics of Software Architecture

Some Microsoft Office Case Studies

A key challenge in evolving a mature complex product is maintaining core architectural pillars over time as new features are added. These architectural pillars include both static elements of design — the design of key data structures and class hierarchies — as well as dynamic characteristics of design — how the code actually executes in practice. I have often found that the dynamic characteristics are much harder to maintain over time.

The static components are inherently easier to understand and describe. You can look at the code and understand the data structures and algorithms for modifying them. For dynamic characteristics, you need to look at and understand runtime behavior. In some cases, teams work hard to design the software in a way to ensure that runtime behavioral constraints are not invalidated as they do new development. For example, an approach Microsoft Office started to use much more frequently over time was to rigorously separate components using threading boundaries and asynchronous coding patterns. There still are complex runtime characteristics of these threaded designs that need to be considered like resource management and congestion control, but these asynchronous techniques can be helpful in making the runtime design more explicit and more maintainable.

Alternatively, many products have used techniques like chunking of idle time work that tends to be harder to maintain. In this approach, the work is broken into small chunks that can be executed when the program is idle, usually on the same thread as all other UI activity. The program executes a small chunk of work and then hopefully return to processing user input without any detectable delay. This simplifies thread coordination (since all the key work occurs on a single thread) and was in use before threads even existed as an OS concept. The challenge with this approach is that to be effective the programmer needs to ensure a number of subtle properties of the chunking: the chunks really need to be kept small so the UI is not delayed, the cost of chunking needs to be small so there is not unreasonable overhead to perform some basic computation, and the overhead of restarting a chunk needs to be small so that the program actually makes effective progress through the work. Note that the goal of keeping chunks small and granular often directly conflicts with the goal of keeping the overhead of chunking small. These are dynamic characteristics of the code and need to be maintained by the code base in a more diffused way.

As an example of one complex code base, Word 2016 had over 70 idle-time activities to manage — all without introducing any detectable UI delay.

Most teams obviously also use engineering practices like extensive performance benchmarking which explicitly measures one aspect of runtime behavior. This is important but inherently downstream in the design and implementation process. Modern applications — both client and service — also make much heavier use of runtime telemetry than was common for earlier desktop applications.

I have lots of examples where problems arise when we break these implicit dynamic assumptions. An interesting case in Word required a change in a core data structure that had been in place for over 20 years. One of the core internal data structures in Word is a “plex” which is basically an extensible array of structures (that is, the array grows or shrinks as you insert or delete new elements). The implementation of this data structure used a common approach where there is an allocated size which can be larger than the actual used size. Effectively there is empty space maintained at the end of the array so that new elements can be inserted (or deleted) at the end very cheaply (including factoring in the amortized cost of needing to reallocate a new array and copy the existing elements if the allocated size is exceeded). Deleting or inserting randomly into the array is relatively expensive operation — O(N) since elements need to be copied around to compact the open slot or make room for a new slot. The core behavior of Word was such that this operation didn’t have to happen very frequently. In particular, loading a document filled the various plex data structures in either a bulk or a linear way and editing with high locality in one spot in the document did not require significant readjustment.

When loading of HTML was added, the dynamic usage changed. The document was initially loaded and then further passes were made over the document that resulted in many insertions and deletions happening in various plex structures that described the formatting. Effectively you had an O(N) algorithm being executed O(N) times — or O(N²) running behavior. This issue was initially missed because for relatively simple HTML, N was small. As Word started to deal with more complex HTML documents (especially in the context of Outlook’s HTML mail component), N grew large and performance suffered.

I had an intuition that a gapped buffer would address the dynamic usage pattern better. I’ve written on gapped buffers before, but briefly it is an alternate implementation of an extensible array that allows the empty space to occur anywhere in the array and supports inexpensive insertions and deletions anywhere in the array if the access pattern exhibits good locality. Before doing the work (actually someone else did the final implementation) I added telemetry to the plex data structure and verified that in fact the access pattern exhibited high locality. We converted the plex to be a gapped data structure. It still has very implicit constraints on its dynamic usage but better handled a wider range of dynamic patterns including the pattern used by Word’s HTML load code.

Another common architectural pattern in document apps is designing algorithms that are incremental relative to the cost of the user edit action rather than scaling with the size of the overall document. Application designers work hard to ensure that they limit processing so as the user makes small edits, they only do a small amount of work. This is key to maintaining good responsiveness as well as important for overall power usage and battery life.

One example in Excel is how it determines what to paint on the screen after some arbitrary edit. It starts with a known top-left cell and then can incrementally walk through the column widths and row heights until the screen is filled. This algorithm scales with the size of the screen (which is well-bounded) rather than scaling with the size of the workbook (which may be very large).

The addition of Office Art shapes to Excel added a twist. Shapes are anchored to a particular cell, but can apply an arbitrary offset from that cell position. A consequence of this is that after an arbitrary user edit, Excel really needed to examine the location of every single shape anchored anywhere in the worksheet to determine whether it overlapped with the current viewport. This violated the fundamental incrementality of the view construction process since this work scales with the size of the user content. In actual usage, the number of shapes tends to be small so this can (usually) be computed quickly. But in reality the dynamic architectural design had been fundamentally violated. Over time, use cases arose that involved documents with a large number of Office Art shapes. Editing these documents would get extremely sluggish (even when operating outside of the region with shapes). Eventually Excel needed to expand the design of how shapes were integrated to ensure they had a faster incremental way of determining viewport overlap.

This is also an interesting case study since Office Art really arose with PowerPoint as the core design point/model and PowerPoint can leverage its strong slide boundary to provide a degree of incremental update using an approach that was not available to Excel. We often find that some of the most difficult challenges in maintaining dynamic constraints are when integrating large components that originally evolved separately with different constraints.

The constrain of only allowing an amount of process that is incremental to user input seems especially prone to violation. If a new feature does not just fit into an existing design pattern, it often requires non-trivial extension of core data structures and algorithms in order to support it. A common engineering pattern that teams need to work against is where an initial non-incremental version of the feature is designed and implemented with the intention that “we’ll go back and make it incremental later”. Since the incremental design may be the hardest part of the overall feature, this is rarely a good approach. I have explicitly cut features that were “working” when we could not develop a cost-effective way of implementing incremental update.

Within a team, an important part of the training and learning that needs to happen is making sure dynamic constraints are as well understood as the static design of the core data structures. Dynamic characteristics are fundamentally harder to understand and maintain so extra effort needs to be taken to ensure the whole team has a deep understanding. In the best cases, there are rigorous tests in place (e.g. performance tests) that specifically test that dynamic constraints continue to be met. That is especially challenging (as in the examples above) when the completeness of validation is strongly dependent on characteristics of the data set being operated on.