Using Lambdas in Production

I have been using Lambda in production for about four years now personally, and three years professionally at Volta. Initially, I shipped Lambdas because it was easier than managing servers. At Volta, we now exclusively use server-less services because they are the smartest option for our workloads if we remember to support them correctly. This is a cheat sheet, a checklist of all the things you might want to remember when shipping something new to ensure it runs successfully.

Infrastructure as Code

Regardless of what support you like to build into your Lambdas, the most important thing to do is to ensure consistency. If you’ve deployed CloudWatch alarms for one of them, it can be quite a surprise to see older functions fail silently because they predate your alarm strategy. Writing your infrastructure as code serves you with a way to document and deploy your Lambdas, while also enabling infrastructure conversations through code reviews.

A standard CloudFormation file.

I personally use CloudFormation because I love writing hundreds of lines of YAML, and because I got comfortable with it back when the Serverless Framework didn’t quite have feature parity and Terraform didn’t have remote state. If I could do it again I would spend more time exploring Terraform or Terragrunt.

CloudWatch Alarms

If you’re mesmerized by graphs like I am, you probably already spend time looking at CloudWatch Metrics. Let’s take it one step further and turn them into something actionable; something which can wake you up if absolutely necessary. There are four basic measurable properties of a Lambda:

Invocation rate
Invocation duration
Error rate
Throttle rate

If you’re really ambitious you can also add CPU usage and RAM to the list. These are important because they all characterized the Lambda’s workload. By setting expectations for each of these properties as CloudWatch Alarms they are essentially serving as an abstract test, and we all want more test coverage, right?

To make things more interesting, you can even use a few math expressions to compare metrics. Another recent addition is anomaly detection, which is very valuable if you expect your Lambda’s performance to vary over time.

Look! A wild anomaly!

Continuous Deployment

Traffic Shifting

For me, developer experience generally falls into two buckets: writing less code and deploying more confidently. Nothing has made a more significant impact for my deployment confidence than traffic shifting, it is that magical. This feature basically gives Lambda the ability to slowly move invocations from the old version of your Lambda to a new version, while monitoring some CloudWatch Alarms along the way to see if it should rollback.

Have I gotten too cocky and deployed issues my alarms didn’t catch? Yes. Should they support more than just time-based deployment options? Yes! Is it annoying they call it by three different names throughout their documentation? Absolutely! But all that aside, Traffic Shifting gives you the power of blue/green testing in just a few lines of CloudFormation and makes it easier to test in production and release more confidently on a Friday night, (if you’re into that sort of thing).

Automated Build Pipeline

I’ve spent a lot of time thinking about how to make CircleCI work well for mono-repositories, and think I’ve settled on a pretty good configuration for my needs. That said, there are countless services built for this purpose that may fit your needs better. The most important feature of a build pipeline is that it enables you to quickly release new versions of your Lambdas in both hotfix and feature release scenarios.

Distributed Tracing

The first thing you’ll notice after deploying and running a Lambda in AWS is that your CloudWatch logs were not designed for Lambda. Log groups contain multiple invocations and make no effort to visually separate one invocation from another, making it incredibly hard to parse what is actually going on. Come on AWS, just group by invocation ID already!

The other problem is that Lambdas are quite often invoked by things, and occasionally emit something as well. This concept of tracing across resources is present in AWS X-Ray, but is perfected by third-party services such as Epsagon. With a simple instrumentation call, they capture the event which invoked the Lambda and can visualize each invocation separately. If one Lambda emits and SNS message which invokes another Lambda you can even see both invocations in one trace. Problem solved.

Dead-Letter Queues

Lest we forget the oft-forgotten invocations that we errored out so many paragraphs before. A significant concern of using Lambda is responding slowly, or worse yet not at all. A dead-letter queue to catch your unprocessable events is a good way to ensure you at least have a record of what you could not handle, and also makes it easy to reprocess the events after you’ve improved your function.

The practice behind this is just as important: let your Lambdas error out. Exiting doesn’t hurt the next invocation as it would in a conventional server-full environment since it’s just an execution failing, not the entire service. In fact, Lambda does a lot to account for these failures, such as retrying the invocation for you. This also makes tracking issues easier in CloudWatch, or your other favorite monitoring tool. In short: when in doubt, throw it out.

Strict IAM Policies

The Principle of Least Privilege is another guardrail that frequently gets left underutilized. IAM policies allow you to specify what resources and actions they grant access to, but also allow you to grant unnecessarily wide access with a little *. This is generally a bad idea because if someone compromises your service, they can use that role to impact other services as well. For example, if your Lambda should be able to read from a Dynamo table use this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "QueryMyTable",
      "Effect": "Allow",
      "Action": [
        "dynamodb:DescribeTable",
        "dynamodb:Query",
        "dynamodb:Scan"
      ],
      "Resource": "arn:aws:dynamodb:region:account-id:table/MyTable"
    }
  ]
}

Alternatively, if you had used

"Action": "*"

and

"Resource": "arn:aws:dynamodb:*"

you would be allowing that Lambda to

DeleteTable

for every table in your account.

More to Come

I’m sure this list will continue to grow as AWS adds new supporting features, and as folks like you point out things I’ve missed. Until then, I’ll see you in production.

Previously published at https://medium.com/cazzer/lambdas-in-production-92f8e4ca70a2