Tensorflow & PyTorch Design Tradeoffs

ML meets general purpose programming

The recent Tensorflow Sucks post not a new sentiment, but struck a nerve with me and this is my reaction.

Knowing what sucks is cheap. Knowing what’s hard, on the other hand, is valuable.

Knowing what’s hard keeps you from buying snakeoil sold as solving all things for all people. It focuses the mind on choosing the right problems and saying no to others. Finally, it has the benefit of providing actual insight, and opportunity for genuine progress.

Machine Learning has a unique design challenge, because the domain of ML itself is trending towards universal computation as a subject, not as a mere implementation detail. This means it will have greater and greater footprint in what we think of as “traditional programming.”

Together with this comes traditional engineering challenges. Things like reliability, reuse, architectural flexibility. Solutions for these don’t come for free and by default, because this is a new game. They need to be designed for.

This post is about the some of the hard problems that flow out from becoming a general-purpose language, and what solutions Tensorflow in particular offers in comparison to some other choices. Not because Tensorflow is a silver bullet, but because there seems to be a general confusion, verging on FUD, as to why Tensorflow works the way it does.

What’s hard about ML frameworks?

While the past of ML is as a limited, domain-specific language, the future of ML is as a universal, general-purpose language.

The current state of affairs is in the middle. Like many DSLs, ML has grown to need general purpose programming primitives: conditionals, loops, (gasp) recursion. It has also grown to subsume a bigger set of supporting infrastructure or “runtime” for things like IO, distributed computing, serialization etc.

Compared to many DSLs, ML is further distinguished by the fact that many of these are non-trivial extensions. What recursion means in a ML program is of a different character than in C, because it must be used in a specific way to allow for learning to happen.

How to address this when building an ML framework? You could invent a whole new language from scratch, such as Stan. You could extend an existing language by adding new primitives, such as PyTorch. Or you could do something in the middle, by building a new language, but embedding it in an existing one to get the bells and whistles. Tensorflow is an example of this last category.

Each choice has problems.

Inventing a language from scratch means going it alone, but that might be literally the only choice if your language semantics cannot be forced into an existing option.

Adopting an existing Language X sounds great to users of X. But since Language X isn’t used by everyone, presumably its not great at everything and has some significant shortcomings of its own. And if you want your ML framework to be used in places Language X can’t go, then thats not an option.

Embedding a DSL inside another language is a third option. You use Language X as the fabrication machine to build your component, a sort of meta-programming. Metaprogramming creates a mental burden to track distinctions between compile-time and run-time, and makes debugging harder.

From this we can already see there are no silver bullets. But to go deeper we need to get more specific about the problems we face.

99 Problems, Modelling Ain’t One

“I came. I saw. I modelled” seems to be the credo of budding AI enthusiasts, waiting to be etched onto the monuments embodying their prowess. Visions of singlehandedly revolutionizing the field, or at least getting a nice CTR bump, waft through the mind. Get out of my way, I’m using MAGIC!

Unfortunately, the cognitive act of modelling is rounding error even for those who do it for a living. After you spend 90% of your time cleaning the data, you’ll spend another 90% of your time integrating with other systems and simply getting things to run. (Yes this adds up to at least 180%, which is consistent with experience.)

Bitrot, backwards-incompatible breakage, shabby error-swallowing integrations, undocumented assumptions baked in the heat of the moment, bus-factored legacy codebases continue to be the scourage of even mature software engineering disciplines, and run amock in ML.

ML is not in the vanguard of engineering best practices. Part of this is its academic lineage and associated culture of DIY coding. Another is that traditional software engineering has largely ignored the fundamentals necessary for science: experiments. Only in the last decade has “data” even become a thing. Merely being able to plot points in an IDE is today a progressive innovation. Version control for ML is TBD.

This is all to say, there are massive software engineering frictions to using and deploying this new computational paradigm at a scale that matters.

Lets start with the basics. Does your model run, and will it continue to run over time (reliability)? Can I compose it with other models (reuse)? And how can I connect it with other components to solve the overall problem (architectural flexibility)?

Reliability

Tensorflow’s graph abstraction is a self-contained model definition. A model will mean the same thing today as it will 5 years from now. One key aspect of this graph abstraction is that Tensorflow models contain no external dependencies.

If running your model requires Python, Lua, or any other language runtime, its very difficult to make the same guarantee. To exactly reproduce the model, you must exactly reproduce the entire environment, from the OS level, through 3rd party libs, to user code.

Today, this means freezing it in a docker container. This is a heavy dependency, both in size and in the requirements placed on downstream consumers. Every model consumer now must contend with the VM boundary.

The biggest problem with containerization though is the burden it places on users. Reliability is now the user’s problem to solve. If everyone does the right thing, it will more or less work. But there is no guarantee, and there are complications to rolling your own solution.

“Reliable by design” is a valuable attribute that Tensorflow offers.

Composition & Reuse

If you’re in the modelling business, pretty soon you’ll have lots of models. And eventually, for some reason or another, you’ll want use 2 models at the same time. Maybe for serving, for composing into a bigger model, or simply for running a comparison.

In Tensorflow, this is trivial and a non-issue, because as described above, the models are self-contained descriptions with no extra dependencies. This is reliable enough to form the basis of further automation.

The “adopt Python” approach has a different story. What if Model A has a dependency on the whole environment, and Model B has a dependency on another, separate environment?

If Model A requires Python 3.3, and Model B requires Python 3.4, maybe it will work, but maybe it won’t. You might not even be able to deserialize the model if it was pickled. Or, in some ways worse, you might have conflicting dependencies somewhere in your unbounded dependency tree.

The traditional software engineering approach would be to merge the codebases, resolve the conflicts, and re-run the models. But retraining models can be extremely expensive and slow. And the data may not even still be available. So there are very solid reasons why you’d want to be able to reliably reuse an existing model with no modification.

A disciplined, skilled team can mitigate this risk, by minimizing dependencies in the first place, by testing, and by placing emphasis on backwards compatibility. But how does this scale org-wide, across many teams? To the intern in Finance that wants to play with ML and imports the universe? To the under-pressure senior engineer that is just 1 import away from solving their immediate problem?

“Composable & reusable by design” is a valuable attribute that Tensorflow offers.

Architectural Flexibility

No model is an island. They must connect to sources of data for training, and deployment infrastructure as services or in edge devices. On top of that, there is an unbounded set of development and ops tooling that may bear on the model.

What other software can your model integrate with?

Tensorflow can integrate with anything that can call a simple C api. It deals in only a very few concepts (a DAG, a session, Tensors), and its fundamental structures are serialized in a documented, commonplace format. Essentially any significant system will eventually have a Tensorflow integration, so you won’t even have to do it yourself. This allows for tremendous architectural flexibility, as you can chose which point to integrate with.

If running your model requires Python or any other specific language, the picture is much different. You are now in the business of dragging Python/Language X everywhere, and either using a bridge or some sort of local messaging bus.

If you are Python shop that might sound great, but for people invested in other tech, its a complexity and source of issues. This is not a “run anywhere and don’t worry much about it” situation; you’re now in the “lets make sure the GCs interact properly” branch of the design space. When you deserialize the model in the wrong version of Python, hopefully all that bespoke machinery will fail out gracefully and informatively.

“Plays well with others” is a valuable property Tensorflow brings to the table.

But Onnx!

Onnx is an effort to recover the positive qualities of Tensorflow, essentially by providing the same kind of independent, self-contained declarative graph. By tracing Python execution, this static graph can be recovered from an imperative model. Its a very cool and promising project.

However, it is subject to the same limitations as Tensorflow. It cannot easily represent complex, recursive logic. It is not meant to call out to arbitrary Python functions. Your graph is now static and declarative, just as in TF.

The benefit is, you can still use the imperative model while developing, and export it once its been debugged. You can export it to frameworks like Caffe2 that can integrate in more places. Essentially, more mileage out of the existing tool.

The cost is, well, there is no guarantee your model will export until you try. It can only represent a subset of PyTorch models. And because PyTorch models are just arbitrary code with arbitrary dependencies, you cannot in general know ahead of time if the export will work correctly. How do you know the export is correct? Thats on you.

Even if your models satisfy the right properties, there is still the complexity of multiple DL frameworks, a chance for difference of behavior, more versions to align. Not a gimme, especially at this stage of the game. Tradeoffs.

Tradeoffs are forever?

In well-designed systems, tradeoffs can be impermanent, as missing pieces are stacked on top of the foundation. Onnx is a good example of how tradeoffs can continue to be negotiated. Tensorflows upcoming eager mode is another. Yet another is XLA, where Tensorflow is plugging a gap in the “no dependencies” story, to allow for efficient custom ops without the environmental baggage of lugging user-created C++ code around.

This however requires knowing the problems, and not coupling to a hopeless direction. Most of all, not allowing fanboy-ism to make one blind to what problems are even being addressed. Hope this post is a contribution to that direction.