Windows as a Service?

The recent issues with the quality of Windows releases has reopened discussions of what it means to deliver “Windows as a Service”. When Windows 10 was released, Microsoft made a big deal that this was “the last version of Windows”. From now on, Windows would be delivered as a series of continuous ongoing incremental feature releases rather than as a big bang release every 3 years or so.

Peter Bright’s article on Arstechnica that I linked to above tries to debug what is going on with Windows. The point he makes in the article is that the problem is not in the pace of Windows updates but rather the process they are using to develop those updates. There was a lot of speculation and naiveté (about software development in general) in his discussion but I think he probably got the basic diagnosis right.

I will say that trying to diagnose a vast, complex process by making a big deal about a few bugs is very problematic. It is reminiscent of all the huffing and puffing that went on last year about the quality of Apple releases based on some high-profile bugs. It is like saying we have a global warming problem and using the temperature of Death Valley as evidence. It may very well be the case that we have a global warming problem (we do) but you need better evidence and a better understanding of the overall model and system to actually make a compelling argument.

This summer I posted an article that described the process Microsoft Office went through to make a similar switch from large multi-year releases to regular ongoing monthly releases across all platforms. That involved changes to business model, organizational structure, engineering roles, engineering infrastructure and processes and large changes to the product and code itself. It was a decade-long process that is still ongoing. It was probably the hardest thing I did while at Microsoft.

When you do hard things as part of a large organizational effort, it can sometimes be difficult to identify what difference you actually made. For this transition, I had the interesting comparison of looking at another large organization going through the same transition at the same time. It served as a model for an “Alternative History” that you don’t typically get the opportunity to compare to.

We made the decision in Office at the start of the process that we could not simply shrink our existing processes down. We needed to embrace from the individual engineer through leadership that this was a different process and was going to require major structural changes in the organization and major day-to-day changes in the way individual engineers worked.

We encapsulated and communicated this as the “5 Cs” — Continuous Planning, Continuous Integration, Continuous Stability, Continuous Deployment and Continuous Insight. These elements are a cycle with planning breaking down features into smaller component changes that can be integrated safely, while maintaining full stability, deploying quickly (daily or faster to thousands of desktops, devices or services), generating telemetry and insight that then feeds back into the planning process.

These elements will not be unfamiliar to anyone working on an agile process — we were just trying to do it on a scale and level of complexity that was unprecedented.

At the same time, Windows was also embracing the “Windows as a Service” strategy. They were making fundamental changes in organization and process (notoriously eliminating a large part of the testing organization before actually establishing replacement validation processes). However, fundamentally they took an approach that looked much more like compressing and shortening the existing processes rather than overhauling them.

The biggest difference was that Windows would continue to allow partial and incomplete features with known defects to be checked into the build and then have a separate process and period to finish and stabilize them.

My point of view is that in any large process, there are only a few key points of intense leverage you have that can generate the right feedback signals. If you get it right, these points of leverage provide a clear signal to both individual teams and engineers in the organization as well as up the chain to leadership. In our case, it was driving this model of continuous deployment of stable, usable builds. That requirement generated lots of creative strategies on individual teams and a large amount of engineering tooling, but it was very easy to communicate and relatively easy to measure as well. It was essentially a leap of faith (well, heavily vetted and argued) that generated an incredible burst of creativity across the organization.

If you don’t enforce this, it is very easy for a “tragedy of the commons” to develop where individual engineers and teams optimize locally and don’t understand how their instability and incompleteness incurs cost and overhead on other teams. In fact we used to pay heavily for these costs even in our previously “big bang” engineering strategy but did not have the business model forcing a major change in strategy. The switch to a multi-platform, service and subscription based business served as the motivating change.

The Windows team has made lots of arguments about why they are different but I think they are going to continue to suffer from these problems until they make the leap to continuous stability (and develop the tools and infrastructure necessary to deliver it). It has to start at the individual engineer. If the engineer does not understand it as part of their day to day job responsibility, you have lost your leverage. The alternative then involves imposing painful heavyweight processes that crushes the engineer’s spirit rather than harnessing their creativity.