The Cacophonies of Distributed Systems

If you have never heard about Deutsch + Gosling’s fallacies of distributed computing, you are missing out big time! I encourage you to check them out here. Those delusions are widely considered in the distributed systems field as some of the most painful assumptions any junior systems designer, or architect can make. I like to call them “career-limiting choices”.

When I first learned about these eight fallacies a few years ago, I decided I needed to keep them around. I wrote them in a piece of paper and taped it outside of my office. The piece of paper is still there, except that it now has eleven statements – I guess some random folks added a few more fallacies.

I have named my list the “cacophonies of distributed systems”, given that they certainly do not rhyme well with sound distributed engineering practices. It includes the eight Deutsch + Gosling’s original misbeliefs plus three freebies that I have learnt the hard way over time. Enjoy!

1. The network is reliable. It’s not a good idea to expect services reply 100% of the time. Hardware fails, software gets updated and it happens precisely when that mission-critical “oh-god-please-don’t-fail-right-now” call is invoked. Retries and exponential back-off are usually good safeguards, although fail-fast can be better for latency sensitive services.

2. Latency is zero. Code that makes a remote call while holding a lock or processing inside a transaction is perhaps not a brilliant idea. Things take time with TCP, remember that even a few milliseconds represent an eternity in CPU time.

3. Bandwidth is infinite. Ever tried to download your 30 minute high resolution middle school dance video from the Internet? The network is shared among all of us, so it typically takes time to transmit heavy files around. This is the reason why peer-to-peer services like BitTorrent exist – they are able to sending little pieces of information from multiple places. Nowadays, the standards that allows networks to pass big masses of information around include compression and more generally some data summarization techniques.

4. The network is secure. This is probably the most obvious one now that we are in the 21st century. Trust no-one techniques include SSL, two-factor authentication or any other secure transport mechanism that is adequate depending on the privacy requirements and the sensitivity of the data being transported.

5. Topology doesn’t change. I had some joy the other night when I had to stay late and log into a dozen hosts in order to manually fix the hardcoded firewall rules that someone nicely set-up thinking some well-known clients would never change their IP addresses. DNS names would have helped on that one.

6. There is one administrator. A single person cannot manage a fully fledged distributed system, it doesn’t scale. Having multiple administrators comes with a tradeoff; it implies having the three ‘As’ in place – auditing, authorization and authentication.

7. Transport cost is zero. Odds are that half the development time of a mid-sized organization will go into getting the network setup right, unless someone already set that up for you (ahem… a cloud provider maybe?).

8. The network is homogeneous. Interoperability can define the success or failure of a distributed system. In a microservices world where highly scalable systems have dozens of small services, it is very likely that multiple technologies are at play and they all must talk to each other in a common language. That’s why standards like REST, Web API, etc. become handy at the expense of other “hyper-fast-but-not-well-known” protocols.

9. Updates happen at once. When safely done, updating a distributed system takes time. It is common to update one component at a time and validate that the system is fine in each step. This means that there is some likelihood for the upgrade to be “half way done” for some time. This also means that components of a newer version will be talking to components of an older version at the same time, which leads to the need for backwards compatibility. Most of the distributed systems that I know about are architected to have multiple versions of the system talking to each other.

10. Certificates are easy. Certificate expiration dates are hard to remember because they are months or years away. Keeping one in mind is easy, but can you handle a couple hundred? How do you keep all of those safe and ensure that access is provided only when strictly needed? Certificate management can be a headache if not properly addressed with centralized stores and expiration alerting services.

11. Clocks are in-sync. This is probably my favorite fallacy because I really, really didn’t think that the clocks of my system would go out-of-sync in any perceivable way. It turns out that one of the instances of the first version of a system I naively implemented a while ago was about 5.5 minutes out-of-sync! The night I learned that Kerberos allows a clock skew of up to 5 minutes is one I won’t forget.

Do you have any suggestions for the twelfth cacophony? Let me know in the comments!

Did you like this article? Subscribe to get new posts by email.

Photo by Sarah Kilian on Unsplash