Software Ecology

I was forwarded this recent blog post by Nikita Prokopov and felt like it deserved a response. The post falls into the “we programmers suck” class of rants. It stands out by providing a rich set of examples and even some data about how exactly we suck. It also ends with a “Better World Manifesto” as he tries to start a movement to improve software quality.

The parts of the rant that are least contestable are related to software bloat, both on disk and in-memory. The on-disk cost is mostly code and translates pretty directly into program launch times as the code is loaded and then additionally the cost of holding the code in memory. Data for iOS, Android and Windows all reliably show massive growth in both OS and application image sizes. It is a bit more subjective to argue that this size growth is not balanced by equivalent growth in features but I don’t actually have much argument with that conclusion.

The classic response to accusations of bloat is that this growth is an efficient response to the additional resources available with improved hardware. That is, programmers are investing in adding features rather than investing in improving performance, disk footprint or other efforts to reduce bloat because the added hardware resources make the effective cost of bloat minimal. The argument runs that this is in direct response to customer demand.

It is most definitely the case that when you see wide-spread consistent behavior across all these different computing ecosystems, it is almost certainly the case that the behavior is in response to direct signals and feedback rather than moral failures of the participants. The question is whether this is an efficient response.

The more likely underlying process is that we are seeing a system that exhibits significant externalities. That is, the cost of bloat is not directly borne by the those introducing it. Individual efforts to reduce bloat have little effect since there is always another bad actor out there to use up the resource and the improvements do not accrue to those making the investments. The final result is sub-optimal but there is no obvious path to improving things.

My experiences in Microsoft Office and with Windows certainly reinforce this model of what is going on. Keeping sizes down is hard. It was easiest when there was a clear resource cost or limit — e.g. cost of floppy disks needed to install the entire suite. Moving to CD did not radically change overall constraints and we still required a “sweatbox” process of trying to drive the image size down at the tail end of the release. Moving to DVD released some of that forced constraint and quickly resulted in increased disk image sizes. Moving to Internet delivery re-imposed constraints and required significant work to drive down sizes. Sharing across mobile platforms with lower constraints also imposed additional limits and forcing functions.

Enforcing these constraints was a top-down process — individual teams would tend to grow their disk image and a central performance team worked to push back. One notorious example was a 100MB PowerPoint template that the PowerPoint team wanted to include in the disk image by default. (We pulled that one for the final release.)

The Windows 10 team invested significant resources just to get down to a single version of the C runtime across the entire product. I don’t remember the exact number of distinct versions that had been included previously but it was in the high single digits.

One of the arguments Nikita made in the post referenced above is being careful about dependencies. Getting this right is pretty hard, as in the example above with the C runtime example. Often the approach to reducing dependencies is to include some slimmed down custom version of a larger component. But if you are part of a larger ecosystem of components and products, soon you have multiple divergent copies of that component and you are worse off than before. It is way easier to diverge than it is to converge. In organizations producing large software systems, significant effort needs to be put in to not diverging and to ensure everyone uses as many common components as possible.

Runtime memory use has similar dynamics. A little extra memory use spread around lots of components results in significant growth. Caching information makes one code path faster at the expense of overall increased resource use. Pre-caching is especially pernicious when you have lots of components doing premature initialization and computation with no validation that the user will ever actually get around to using that functionality before exiting the application. Much better to only initialize on use.

The key overall point is that when you see some widespread misbehavior, it is likely that there is something systematic going on rather than simply lots of bad actors. Fixing those problems can only come from the equivalent of either top-down regulation, expensively enforced, or some more distributed solution (carbon tax, anyone?) that attempts to internalize the externalities. In a single large organization top-down regulation can work if sufficient resources are allocated. In wider ecosystems, there needs to be some kind of distributed feedback and enforcement. Just urging everyone to try harder or be better is unlikely to work.