10 Lessons from 10 Years of AWS (part 1)

I recently presented a talk at the AWS Community Day in Bangalore. The tweet following the talk became my most popular tweet ever and I received quite a few requests for more details.

For the last 10 years, I have had the chance to work in companies that embraced the cloud, and in particular AWS. This two-part blog post is an attempt to share that experience with you. Hope you enjoy! Please do not hesitate to give feedback, share your own stories or simply like :)

EMBRACE FAILURE

“It is not failure itself that holds you back; it is the fear of failure that paralyses you.” Brian Tracy

Let me start by saying that scared developers won’t:

try things out
won’t innovate as fast as your business would need to
won’t dare to jump in and fix things when (pardon my French) shit hits the fan
won’t do more than ask for
and won’t stay long in the job

Failure should not be seen as “you are a failure” but simply as moving along the path of experimentation. If you don’t fail, you are probably not trying things hard enough nor pushing the limits of the “adjacent possible”. Innovations flourish in an environment where ideas are exchanged, discussed, tried and improved over time. But most of all, innovations flourish in a environment that embrace failure. Remember, Thomas Edison tested more than 6000 different materials before settling on carbonised bamboo for his lightbulb.

Failure has a tendency to teach you lessons that reading books or blog posts can’t teach you.

Now failing without a plan is idiotic. You must have a plan (e.g. an easy way to rollback or a controlled service degradation) so you can learn and grow from failure without putting your business on the line or loosing your customers.

The most successful teams I have worked with are those I failed the hardest with, but we were prepared to fail and especially we were not afraid of losing our jobs. After every major failure, we grew wiser and eventually, we did succeed. Me

The best way to learn from failure is to practice failure. And the best way to practice failure is to embrace chaos engineering. Chaos engineering is “a discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production”. One of the most popular ambassador of chaos engineering is Netflix, and learning from them is on my top 10 life’s best decisions.

LOCAL STATE IS A CLOUD ANTI-PATTERN

Two of the core promises of cloud is offering elasticity and availability. Elasticity is the ability to grow or shrink infrastructure as the demand for your service changes. Availability on the other hand, is the ability to sustain a failure of e.g an availability zone (AZ) or a region without impacting user experience. AZs on the AWS cloud are in fact the unit of availability and most of our manages services like S3, Kinesis, DynamoDB or Lambda take full advantage of this, transparently for the customer. If you aim at being Well-Architected, your services should be deployed across several AZs.

Everything fails, all the time! Werner Vogels

Having an elastic and available application work across several AZs comes with one fundamental consequence: it should be able to scale up-down in any AZs (or Region for the extreme) at any time. When you scale up, for example, a Load Balancer often registers that information in its routing table and will start sending traffic to that new instance, container or function. And to do so without failure, your code cannot make assumptions about local state.

If you had user specific local state running in your application, well, it’s gone — and if that local state was the content of a shopping cart or a login session, you will make your customer really unhappy since that data won’t propagate to other instances and that user session will be lost.

Luckily the solution is simple: sharing state with any instances, containers or functions by using in-memory object caching systems like Memcached or Redis depending on the structure of your object and your requirements in terms of performances (In some case, you can use databases like DynamoDB coupled with DAX).

Doing so, any new instance of the backend application starting will be able to get the previous state of the application from the cache and get on with the tasks at hands (again think shopping cart). ** Unless you use session stickiness (Note: please don’t use session stickiness for everything, that’s another cloud anti-pattern)

Local-state is the reason fork-and-lift migrations to the cloud are often sub-optimal or simple failure. Many of the “old” applications that use to run on-prem, in one datacenter won’t get the full cloud benefits offered by elasticity and availability unless they are state-free.

So my advice, if you need to fork-and-lift, fine, you already get some benefits by simply moving to the cloud, but your task number one is taking care of the state — only then you will unleash the full power of the cloud. About transient state:

First, let’s define what is transient state: From wikipedia, a system is said to be in a transient state when a process variable or variables have been changed and the system has not yet reached a steady state. Many applications use transient state to operate: think counters, job progress states, lists, sets or any data that, even if useful now, is irrelevant, depreciated or simply mutated shortly after. I have worked on many applications or systems that store that transient state in the databases (most of the time SQL) which is suppose to be a persistent storage. Well, every single time that system had to scale up, those database requests to write, update (often locking) or delete that transient state in the database took the application down; the number of database queries waiting to be executed grew larger proportionally to the number of users, and eventually took the application down.

Do not store transient state in (SQL) databases, use specialised datastore for that. Got some lists, sets, sorted sets, hashes or keys that store transient state? use Redis (for example) since it was built for that very purpose. If you are going to store something in your (SQL) database, make sure it is data that won’t be mutated shortly after, otherwise, it is a waste of database scalability credits.

Redis is just one example. DynamoDB, Memcached or Elasticsearch for example, are great tools that can help take the load off your SQL databases, especially when you start scaling.

IMMUTABLE INFRASTRUCTURE

The principle of immutable infrastructure is simple: Immutable components are replaced for every deployment, rather than being updated in-place.

No updates on live systems
Always start from a new instance being provisioned

This deployment strategy is based on the Immutable Server pattern which I love since it reduces configuration drift and ensure deployments are repeatable from source, anywhere.

Note: Docker does support this pattern really well too, as long as you use Docker for what is it great at, containing an application. In our case, you want to have 1 container per instances, max 2. (reverse proxy nginx + web server)

A typical immutable infrastructure update goes as follows:

START

Create a Base AMI and harden it if you have too.

IF DOCKER:

Create base Container with libraries and dependencies and store it in your container hub (you don’t want to recreate the whole container every time)
Create an as-light-as possible Dockerfile (use a base container )
Copy code on build into Docker Container (run dockerfile)
Deploy Container to base AMI

ELSE:

Bake the AMI (keep this AMI in your account — export it to other regions if you have to deploy multi-regions)
Create new configuration with auto-scaling group (with or without ELB) and use the previously bake AMI ID as reference.
Test in different environments (dev, staging)
Deploy to prod (inactive)
Add new reference (DNS or Load Balancer)
Allow traffic flow slowly to new version (start with 5% and ramp up)
Keep old version around until new version is 100% and you are fully satisfied with the behaviour
Fast rollback if things go wrong

END

Note: If you need new code libraries, rebuild the base docker container. If you need a new Linux kernel version, or new security patches, rebuild the base AMI.

INFRASTRUCTURE AS CODE

Remember the days when provisioning infrastructure to support your project, application or experiment took weeks, more often months? I certainly do and not in a “the good old days” kind of ways.

Today, in few minutes, anyone can deploy a datacenter pretty much anywhere in the world. And not just a simple datacenter, but a full blown, top-of-the-art one, with advanced networking layer with tight security control, with sophisticated applications running in it.

The most beautiful thing about that, is the configuration of all those component is done using code (or template). On AWS Cloud, AWS CloudFormation provides the common language (JSON or YAML) for you to describe and provision all the infrastructure resources you need.

One of the immediate benefit of using code is repeatability. Lets take the task of configuring an entire datacenter and adjusting configurations and deploying applications running in it. Imagine if you had to do that manually and across several regions, each with multiple environments . First it would be a tedious task, but it would most likely introduce configuration differences and drifts over time. Humans are not great at doing repetitive, manual task with an 100% accuracy, but machines are. Give the same template to a computer and it will execute that template 10000 times the exact same way.

Other great benefits of infrastructure as code and version controlling it, is knowledge sharing, evolution archiving but also security verification.

Indeed, when you version control your infrastructure, you can treat that code the same way you treat application code. You can have teams committing code to it, and asking for improvements or changes in configuration. If that process goes though a pull-request then the rest of the team can verify, challenge and comment on that request — often promoting better practices.

Another good advantage is history preservation of the infrastructure evolution — and being able to answer the “why was that changed?” 2 months later.

It is also great for new hires to run through the history of changes of architectures since it give an immediate view of the culture of the team and the company, promoting best practices and avoiding manual changes.

Of course, Infrastructure-as-Code is useless if you allow everyone in the team to go about servers or the console and manually do changes, since that practice introduce configuration drifts and security threads (leaving security groups open to the world because you had to do a hot fix on Friday evening and you forgot about it on Monday).

Few options here can help. Either you do allow, but then you need automated practices for verifying — trust but verify — ( scheduled lambdas to the rescue) or you can also replay or re-run the templates every day in order to restore the state of the environment to what it is in the current master version.

Another service that adds tremendous help in minimising that configuration drift is AWS Config since it records in a timely fashion any changes done to the infrastructure. Plugin controls and alert with AWS Config is a very good way to make sure your Infrastructure-as-Code practice is optimal.

ASYNCHRONOUS AND EVENT-DRIVEN PATTERNS WILL HELP YOU SCALE

Few patterns have totally changed the way I see the world. One of them is asynchronous message-passing and the other is event-driven.

Both patterns are extremely natural (we use them all the time in our daily lives) but they are also extremely useful in scaling applications.The asynchronous pattern using message passing allow applications to outsource a particular task requested by a client to workers. See the diagram below.

What are the benefits?

saves resources on the API backend and faster responses to client
optimises workers for different tasks (CPU bound, memory bound, ..)
decoupling
retries
and more!

**How does it work?**Say you would want to allow your users to upload images. From each images you want to extract metadata, do face analysis and thumb-nailing.If your client, uploading a picture, would need to wait for all the tasks to be successfully executed before it’s API call returns, it would most likely lead to timeouts and errors. But the biggest issue is if any of the tasks fail to execute, the entire request would fail. If you use an asynchronous pattern, your client simply uploads the picture and then forgets about it until it receives a notification that the tasks has been completed successfully. Each tasks can be done independently, repeated if needed and especially each tasks done with resources optimised for the purpose.

Note: I wrote an example in python of that pattern, available on my GitHub account.

Event-driven with AWS LambdaEvent driven architecture are very straight forward to understand. Given a particular event (state change most of the time) happening on A, B is triggered. The simplest example of that pattern is the following: If you have an S3 bucket and you upload a file into it, its state changes. This change of state can trigger an event which can call a Lambda function to execute, giving the ability to do some computing in that context — for example, resizing the newly uploaded image, converting it, etc.

The beauty of that pattern is that unless the state is changed, no infrastructure is required, no machine is left in idle mode.Why is that pattern really useful? Well, it makes it easy to offload intensive tasks to AWS and relieves your API backend. It also inherit the scalability, durability and availability offered by managed services like Amazon S3 and AWS Lambda.

That’s is for the part 1. I hope you have enjoyed it. Please do not hesitate to give feedback, share your own lessons or simply like it :) The next part will be published next week. Stay tuned!

Note: Part 2 is now published.

-Adrian