Taking Office Agile

OK, that title is click bait. I always found the term “agile” a little cultish. What I really want to talk about are the engineering and organizational changes the Microsoft Office team went through as we changed our core product development cadence from shipping a new major version of Office every 2–3 years to shipping new features monthly or faster.

This post will focus on the engineering changes, but there were a very tightly integrated set of issues and changes that revolved around product, business model and engineering. I’ll touch on the product and business model changes in order to give the context and motivation for the engineering changes.

For most of Office history our strategy was driven by the large cost of deploying new versions of Office to customers machines. The term “cost” here applies to all the things that made this an expensive proposition for our customers, including distributing new updates and versions, the planning and labor costs of actually doing upgrades as well as any challenges in business process disruption, application compatibility, device compatibility and capability, training or support required to roll out new versions.

An additional large factor driving the business strategy was customer acquisition cost. The primary way that new versions of Office were acquired and deployed was as part of the process of acquiring a new PC or, less frequently, deploying a new version of Windows to an existing PC. All these factors drove Office, like other major PC applications, to a model of large releases that shipped every 2–3 years. Additionally, the focus on shipping “Office” rather than a set of independent applications drove towards a process where the entire suite, including shared code and a consistent set of shared features were all shipped together as a single unit.

The engineering system was tuned to deliver on this. A months-long planning process determined a consistent vision and shared big bets that were implemented across the suite. Features were implemented over a series of coding milestones and than integrated and stabilized across a series of wider and wider beta releases. A long bug burn-down process finally culminated in “RTM” — Release to Manufacturing — the point where disks were stamped, boxed and shipped to stores (although for later releases our biggest enterprise or OEM customers only received a small number of disks or disk images distributed over the Internet that they duplicated and deployed).

The shared schedule and planning process was critical in allowing us to optimize development of a consistent set of code and features across the suite. That consistency and the engineering system that delivered it was a key part of the integrated Office suite strategy.

At the same time as we were optimizing and executing on this strategy, there were a lot of underlying changes happening both at Microsoft and across the wider industry. From very early days, the Internet had been used for application delivery. As Internet use expanded into the consumer world, browsers trail-blazed rapid updating of new releases. When I joined Microsoft as part of the FrontPage acquisition in 1996, we released a Beta of the first Microsoft branded FrontPage release on the Internet and saw 500K downloads in two weeks. That was definitely a “we’re not in Kansas any more” moment for the team. The automatic delivery of updates over the Internet became even more critical as security issues exploded over the next few years. We gained lots of experience with the challenges of automatically updating software with minimal disruption (or not so minimal during that early learning process).

Over this period, more major products were being delivered as services through the browser. Search engines, home pages and other services had a different business model (mostly advertising of one form or another) that did not require large intermittent releases to drive new purchases. Updating a large Internet-scale service is in no way trivial, but it is vastly simpler than trying to update 100’s of millions of client machines. This enabled a much more rapid updating process for services. The rapid updating process in turn enabled new engineering strategies where teams could run real experiments on the live system and move to a rapid data-driven design process. Test-in-production enabled testing of both the success of new designs and the stability of new code using real customer activity. While test-in-production has been criticized as turning the customer into your test team, in reality the incredible variability of customer scenarios and environments means that it is virtually impossible to fully stabilize a complex product prior to release. We have effectively always been doing test-in-production, we just were not optimizing the engineering system to enable it. Embracing test-in-production makes it possible to minimize the customer impact of failures and maximize what is learned from real usage.

From a product perspective, a huge benefit of services is having all customers on the latest release. From Office 97 on, the most significant competitor to new versions of Office were prior versions of Office. Especially for the most valuable features that exhibit network effects (like sharing and collaborative authoring), having all your users on the same version is a huge benefit. Critical is having both the product strategy and the business model (typically subscription or advertising) that are aligned to deliver that.

The agile software movement (the Agile Manifesto was published in 2001) has a long history with many roots and many different methodologies that are in current use and that share characteristics. The key consistent strand is a focus on getting code into customers hands quickly. This enforces engineering rigor and enables rapid learning and iterating to a better product more quickly. This focus naturally aligns with a service product model so agile and its various incarnations have been tightly linked with service engineering.

There were a number of changes happening both in the Office product and engineering system that were driving more of a focus on a service mentality during this period. More of “Office” now consisted of servers like Exchange, SharePoint and Lync (later Skype). Important features like user assistance, clip art and templates were developed as services accessed by the rich Office clients rather than shipped as part of the packaged product. This enabled a rapid content update process. In 2006 we started on development of the Office web applications which were both delivered with our servers and integrated with the OneDrive cloud storage service and delivered as a free consumer service. Beyond the integration with content services like clip art and templates, it was clear that key value in the rich clients could be delivered because they were tightly integrated with services. We had been living this and delivering it in the product for mostly enterprise customers since the early 2000’s with the Office client integration with FrontPage and then SharePoint servers. The competition with Google Apps was also a significant influence during this later period. While our work to integrate the clients with SharePoint and Exchange had made it clear to us how much value a server infrastructure could provide to the client experience, Google Apps really demonstrated how effective it was to be able to guarantee a service infrastructure was always available to every customer as you were designing features. Key competitive features like multi-device access, sharing and collaborative editing are only possible with an underlying service infrastructure.

The transition from Office clients and servers to the actual Office 365 service product has many threads through the years. I am not going to give a full history here. There had been hosting providers that would run Exchange or SharePoint servers for companies since the early days of those products. This provided customers with some of the benefits of outsourcing the IT responsibilities for managing their own servers but had a number of short-comings compared to a real cloud-based product design and service infrastructure. The biggest long-term advantage of a real service model is creating a full feedback loop from the customer to the engineering team (“DevOps” writ large). Ultimately this has consequences for everything about the product, product architecture, engineering infrastructure and business model. In fact, most of the challenges in managing a transition like this is creating and managing the right feedback loops so that feedback arrives in a form where it is actionable by the right set of people in the right timeframe to drive the changes you need to make.

The final large external trend driving change during this period was the rise of mobile and the importance of cross-platform development. With a service model, the client application becomes a projection of the service onto the device. In contrast to older cross-platform dynamics where an additional platform was about expanding your addressable market of customers, the multi-device world is often about a single user accessing your service from multiple devices, essentially simultaneously. The lack of a client projection on a device (or a poor projection) impacts a much larger slice of your customer base. Our long product cycles had been optimized for a time when Windows represented over 90% of the client install base. I will talk more about the consequences of this below.

Let me move away from all this background and get into more specifics of what was happening on the ground in Office.

I took over as head of Office development (one of a triad of leaders respectively for dev, test and program management) half way through the Office 2007 product cycle in November of 2005. Office 2007 was a major release with the introduction of the Office ribbon and the new Office Open XML file formats. After every product cycle, we do a “post-mortem” to figure out what went well and what needed to be improved in the engineering process. There were a couple key engineering changes that I wanted to implement as we transitioned into the Office 2010 cycle in early 2007.

The first was getting much more consistent about the cadence and quality of the central Office build process. Best practice for virtually every software project is to build a clean validated build with all recent changes from scratch on a regular basis. Where possible, this happens on virtually every submitted change, but depending on the complexity and size of the overall system, how well dependencies are isolated and identified, and how complex and time-consuming the build and validation process is, some amount of batching of sets of changes is necessary. The more batching that happens, the harder it is to isolate any problem to a specific engineer or change and the longer it takes to recover — which then leads to more changes getting batched together for the next build and an even longer time to recover. It is a vicious spiral. For Office 2007, it had sometimes taken weeks to get a clean build released. This was not completely catastrophic because individual engineers and teams could isolate themselves within branches, but exacted an ongoing tax in labor, delay and uncertainty when trying to integrate changes across teams.

For Office 2010, I wanted to get to a much more predictable cadence, initially at least twice a week. (If you’re wondering “what dinosaur did I ride in on?”, I’ll just ask you to give me the benefit of the doubt that building a system with 100’s of millions of lines of code, terabytes of build output, thousands of output product images in 100’s of languages that supports thousands of engineers making thousands of changes a day and that builds on thousands of build and validation machines is a non-trivial task.) I told the manager of the team responsible for the build infrastructure that I wanted him to tape a report that tracked our build performance to my office door each day. It took us two months of iteration to actually make sure that what we were tracking in that report actually reflected the performance of the build system as experienced by an individual engineer. I taped it to my office door for visibility — the truth was the value wasn’t in my seeing a report, the value was getting to a state where I knew the team was focused and looking at the problem in the right way. For any kind of complex system, just making sure you are measuring and iterating on the right things is a huge part of the problem.

The other key change was to make sure we were dogfooding on a regular basis. Office 2007 had a large set of disruptive changes (especially the new file formats) that pushed out dogfooding to very late in the release. For Office 2010 we targeted producing a dogfood build for the first official build of the release, at the end of each milestone and then regularly during the final beta periods. One of the key decisions was to produce a dogfood build as part of the first official build of the project. That might seem trivial — after all, no significant development is supposed to be in progress at that point — but in fact we had a history of introducing all kinds of destabilizing changes in the period immediately after the previous project completed (“M0” or “Milestone Zero” in internal parlance). The long stabilization period leading up to RTM with only a small number of changes getting accepted into the product would often provide opportunity for teams to get a “head start” on disruptive architectural changes as well as being the period where major changes to our engineering systems were implemented. By producing a dogfood build right at the start, as well as requiring teams to match or beat the performance metrics of the shipping product, we ensured that we did not start the release in a hole of instability.

I also switched our strategy for changes to the engineering systems. We had previously tried to front-load our engineering changes in M0 so the engineering system was stable and ready for development at the start of the coding milestones. The actual result was pretty much the opposite, since it resulted in all the big changes getting jammed into this period and inevitably leaking, disruptively, into the regular coding period. Instead, we started the transition to have our build engineering team move to a service mentality. They could continue to make changes through the release, but they needed to maintain a continuously stable system and roll the changes out in a non-disruptive way.

The benefits of these changes paid off significantly during the 2010 cycle and for the Office 2013 cycle we decided to go with a “simple message” — a build every day that was delivered as a dogfood build every day. The ambitious target for the build team meant that they needed to think disruptively about our whole engineering process top to bottom. This was a great example for me of how you unlock innovative thinking and empower your team by framing an ambitious goal. We did do significant modeling and analysis that convinced us the goal was feasible before we committed the team to delivering on it.

The dogfood goal was also one that combined an ambitious target with sufficient analysis of feasibility to give us confidence we could make it work. In previous releases, simply the time required to uninstall and reinstall a new build would have made broad daily dogfooding onerous. For Office 2013 we were able to leverage (and significantly enhance) the “Click-To-Run” technology that we had integrated for delivering and installing new Office updates. This technology (originally App-V) now underlies the isolation mechanisms that supports Win32 applications in the Windows 10 store.

The daily build and daily dogfooding message also reinforced a message of continuous stability that we had been pushing for a while. One thing this experience reinforced was that you can jawbone all you want about “proper” engineering strategies, but if you don’t have some way of putting real teeth in it, you won’t get the behavior you are looking for. As with most agile methodologies, the teeth came through real deployment and usage. We had a sufficient number of internal users to force a focus on maintaining a much higher degree of stability than we had ever achieved in prior releases. It didn’t hurt that the head of Office at the time, Kurt DelBene, arrived into the office early and was always one of the first users of a new daily build. Getting a bug report from Kurt tended to focus a team’s attention on both the immediate problem and how to change development and validation strategies so impactful bugs didn’t slip through in the future. The simple to state target of “dogfood build every day” drove deep changes all the way back through to developer designs and implementation plans.

Maintaining ongoing stability is a good example of how externalities can arise in big organizations. It can be easy for an individual engineer or team to optimize their own development timeline in a way that leaves the build in an unstable state and exacts real costs on the rest of the team. Those costs are hard to measure but significant. Forcing every engineer to maintain stability (and putting teeth in it by rapid dogfooding) drives the right way of thinking about these costs. How to internalize these types of externalities was another big theme of how we organized our processes.

A big potential issue that we worried about in moving to daily dogfood was how to deal with file format compatibility on a build to build basis. The file formats are designed with general mechanisms to implement forward compatibility — the ability to enable new features in new versions that can be safely ignored by older versions of the applications. Forward compatibility is always surprisingly tricky and sometimes expensive to implement properly, so in prior releases we only enforced it at major milestones. Allowing incompatibility build-to-build could destroy people’s confidence in using the daily build for their real work, so an important requirement was to maintain build-to-build compatibility throughout the entire release. We had never done this before and I spent time with the application development managers to make sure they were fully signed up. They signed off on the plan and despite our worries, it never became an issue through the release. It was a good example of something that seems “impossible” until you just make it the normal operating process of the team.

Another outcome we noticed due to the faster build cadence is that a whole set of coordination issues just melted away. In previous releases, teams integrating changes would spend lots of times coordinating when changes were checked in and when they would appear in new builds so the next team could stage their dependent changes. If a team missed a build it could push out a deadline by weeks. The faster build process served as a rapid “carrier frequency” that made many of these coordination issues disappear. “It will be in the next build” — which was soon enough that no special coordination was necessary. When you can get away from “optimizing” coordination and just make it disappear you have made a key breakthrough. Trunk-based development (where a development team avoids long-lived branches and forces all changes into a single main trunk) is a related example of a simply-stated approach that ends up having deep consequences in upstream behavior and large impacts in simplifying coordination.

The Office 2013 release marked a big milestone as we introduced both the consumer subscription as well as embraced a cross-platform approach with Office for iPad, then follow-up releases of Office for iPhone, Office for Android and the versions built on the new Windows RunTime API (WinRT). The subscription offering meant that we were moving to a business model that aligned with continuous delivery of new value.

Our cross-platform strategy is worth a separate post on its own. The key engineering decision was that we were not going to continually “port” changes between these different versions. We believed that going forward, the ability to rapidly deliver features across all endpoints was going to be important so we wanted to have a single stable trunk of shared code that could be built for each platform rather than using a branching and porting approach that would require complex coordination and force feature skew or hard deadline tradeoffs between platforms. Although Office has a long history of cross-platform development across Mac and Windows (essentially since the origin of the main applications), code was usually branched and ported from a version built for the main target platform (Win32 for most of its history) and then to secondary platforms. In this new approach, there would still be platform-specific code, but the large bulk of the code base would be isolated as platform independent components that could be mostly validated on one platform but then built and shipped for other platforms immediately.

As we started the Office 2016 cycle in early 2013, we made an important, but still staged, transition. The main trunk needed to stay stable at all times. We would move away from a schedule based on distinct coding milestones and stabilization periods and instead do short-term planning based on monthly sprints where the results of development needed to be integrated into a build and reach ship-level quality each month. We would still maintain a separate branch for what would become “Office 2016”, but key components of code from this branch would ship to customers regularly as part of the web app releases as well as eventually key core components for our cross-platform products (especially Office for Android in January 2015).

We were still not to the stage of releasing the full development output as a public release every month, but we expanded our internal test loop from the Office team (1000’s of users) to a broader slice of Microsoft product teams (10’s of thousands of users). This increase in usage also meant we could leverage automated telemetry to a much higher degree.

We discussed whether to directly leverage some specific agile methodology but decided that our scale and requirements meant that it would make more sense to talk about the process changes we were making in our own terms (although we did leverage common industry terms like “sprint” and “backlog”). We described the process as a cycle of “the 5 C’s” — Continuous Planning, Continuous Integration, Continuous Stability, Continuous Deployment, Continuous Insight. (And yes, it was hokey to get to 5 C’s by sticking the word “continuous” in front of each stage.)

These describe a cycle where each stage interfaces and feeds into the next. A key part of planning was how to break down the development and release of major new features in small testable chunks that could be developed, validated and deployed incrementally. Each of these changes needed to leave the product in a fully stable state. Continuous deployment, as I mentioned above, is really the stage that powers the overall cycle — without deployment and usage the motivation to really embrace continuous integration and stability is missing. Deployment, with the proper underlying mechanisms for rich telemetry, enables you to gain insight from real usage and then use those insights to drive the next stage of planning for the next turn of the cycle. We tracked key metrics for each of these stages but the real enforcement came from maintaining a continuous cadence of deployment.

As part of the overall process changes, we also rolled out a big change in our organizational strategy. For several decades, most Microsoft engineering teams were organized into three coordinated discipline-based teams — dev, test and program management in a rough ratio of 2–2–1 (that is, 1–1 for the core dev/test ratio). Over the two decades I was at Microsoft, the relationship between dev and test had grown closer. In earlier releases, testing of some piece of code might happen months after core development had occurred. Over the years, this grew tighter and tighter with more and more validation and writing of automation happening before a feature was considered done. Despite this, we still depended on a relatively long stabilization period. There was also a long-simmering dissatisfaction with the general approach of depending on a large suite of end-to-end automation tests that were extremely hard to make as reliable as we needed. It was clear that we could not simply compress our previous processes and the changes we were trying to make would impact the type of work and skills we would need going forward.

We began de-emphasizing test-written end-to-end automation and pushed out in both directions — with a much greater emphasis on dev-written unit and component tests that could be made fast and reliable on one end and then a greater emphasis on telemetry and other automated mechanisms to manage how we could control what features were enabled and what code was executed and what we learned from actual usage. We experimented in different teams with different ways of organizing the existing dev and test teams, including in some teams moving to a single combined engineering team. Part way through the release there was enough evidence to make it clear that we could make the best local prioritization and resource decisions by having a single combined engineering leader on each team.

I came to this final decision when I sat down with the dev manager of our Office web apps team and asked a simple question — if she was responsible for both the test and dev teams, would she have the same priorities for the test team as what they were working on now? The answer was no, backed up by thoughtful reasoning and a clear set of alternative priorities. The key purpose of organizational structure is to align resources and priorities — it was time to make a change.

This was not an easy transition. Our testing strategy had been focused heavily on automation for some time so recent hires out of college were relatively easy to transition into development roles. Leads and managers in the test organization often did not have as deep programming experience. As we looked to combine organizations, most of the leadership positions went to engineers from the development organization, who also tended to have longer tenure in role. Many former test leads and managers in smaller roles ended up leaving over time. There were arguments for doing this reorganization in a more “ruthless” — and perhaps quicker — manner, but ultimately I think we saw results that were the right outcome for the team while still ensuring that each engineer had an honest opportunity to find an appropriate role. The underlying driving force was well understood — the deeper integration of initial development and validation changed the type of expertise we needed in the organization. It is clear that the systems and strategies needed for validation are no less — and in many ways more — complex than the product itself. One can argue it was past time for this change.

The release of Office 2016 represented a “final” watershed where we moved to a model where the complete output of the engineering team was targeted for a monthly release cycle. Those are pretty big quotes around “final”, since we spent the next year tuning tools and processes so that we could execute on that process with super high reliability and consistent timing. It continues to be an ongoing process, especially gaining the confidence of enterprise customers who had grown comfortable with a process where they would freeze the Office version they deployed for anywhere from 2 to 6 years. This was terrible for our ability to provide new value quickly and inevitably we would get discordant feedback from the very same customers about the need to not change things and the need to innovate more quickly. Another big investment has just been getting to a faster and faster build process — that is such a fundamental constraint in moving quickly.

Looking back, an obvious question is whether we could have made this journey more quickly. Whenever you navigate such an organizational change, if you avoid catastrophe or undue drama it can feel inevitable in retrospect. Certainly at the time, I was very aware that we were undertaking a fundamental change in a business that generated 10’s of billions of dollars of revenue — and profit — for the company and that was in daily use by 100’s of millions of people for core parts of their lives and businesses. Care seemed wise. There were other examples of cautionary tales playing out at the same time within the company that had not gone as well. One incident that struck me that maybe what we were doing was hard was when my peer in program management spoke at an internal technical conference. He was describing the upcoming changes in Office 2013 and almost in passing noted the investments in engineering we were making that allowed an average of 1000 checkins a day into the main branch and then built and deployed to thousands of desktops within a day. Senior engineers from other parts of the company were convinced that he had gotten the details wrong — that simply wasn’t possible.

It was possible. Being able to articulate a critical business imperative that was driving the changes was key. Establishing feasibility before pointing the team down a path was also critical. “Establishing feasibility” sounds bloodless but often the process was actually some key passionate engineer completing a deep analysis or building a model or prototype that created trust and belief. The click-to-run installation tools was one example where one of our senior engineers, known for an ability to bull through complexity, was convinced he could take on the notorious complexity of the install process and build something that would change the way we worked. My style is usually point by point deep analysis but sometimes you need someone who will combine some key clarifying insight and then just keep on digging and chipping away at a nasty problem until it is solved.

We were continually trying to create a structure and processes where individuals and teams got clear feedback about how their local investments and process changes were improving their own world. I mentioned in another post that a big part of my job was designing — and debugging — feedback loops. That was a huge part of the changes here because we were trying to alter the behavior of the entire organization.

If your engineering system isn’t improving, it’s degrading. Developers are writing more code. They are changing the engineering system underneath. They make some change to use a new language feature and you suddenly find that your build has slowed down 50%. The system you are building is getting more complex — which also means it is getting exponentially harder to validate each change. Having hard criteria you use to measure your engineering performance — like a build a day, deployed each day — helps establish clear goals to drive investment levels in the underlying engineering system.