auto-create CloudWatch Alarms for APIs with Lambda

Written by theburningmonk | Published 2018/05/13
Tech Story Tags: aws | serverless | apigateway | cloudwatch | hackernoon-top-story | cloudwatch-alarms | lambda-api | create-cloudwatch-alarms

TLDR Yan Cui is an AWS Serverless Hero and the author of Production-Ready Serverless. Yan explains how to use CloudTrail and CloudWatch Events to automate many day-to-day operational steps with Lambda. These are manual steps that often get missed, but can be easily automated using Lambda and API Gateway. Yan is using the serverless-iam-roles-per-function plugin to give the function a tailored IAM role. The function needs the.apigateway:PATCHpermission to enable detailed metrics,. create alarms for each endpoint, and create CloudWatch Alarms for p99 latencies and error counts.via the TL;DR App

In a pre­vi­ous post we dis­cussed how to auto-sub­scribe a Cloud­Watch Log Group to a Lamb­da func­tion using Cloud­Watch Events. So that we don’t need a man­u­al process to ensure all Lamb­da logs would go to our log aggre­ga­tion ser­vice.
Whilst this is use­ful in its own right, it only scratch­es the sur­face of what we can do. Cloud­Trail and Cloud­Watch Events makes it easy to auto­mate many day-to-day oper­a­tional steps. With the help of Lamb­da of course ;-)
I work with API Gate­way and Lamb­da heav­i­ly. When­ev­er you cre­ate a new API, or make changes, there are sev­er­al things you need to do:
  • enable Detailed Met­rics for the deploy­ment stage
  • set up a dash­board in Cloud­Watch, show­ing request count, laten­cies and error counts
  • set up Cloud­Watch Alarms for p99 laten­cies and error counts
Because these are man­u­al steps, they often get missed.
Have you ever for­got­ten to update the dash­board after adding a new end­point to your API? And did you also remem­ber to set up a p99 laten­cy alarm on this new end­point? How about alarms on the no. of 4XX or 5xx errors?
Most teams I have dealt with have some con­ven­tions around these, but without a way to enforce them. The result is that the con­ven­tion is applied in patch­es and can­not be relied upon. I find this approach doesn’t scale with the size of the team.
It works when you’re a small team. Every­one has a shared under­stand­ing, and the nec­es­sary dis­ci­pline to fol­low the con­ven­tion. When the team gets big­ger, you need automa­tion to help enforce these con­ven­tions.
For­tu­nate­ly, we can auto­mate away these man­u­al steps using the same pattern. In the Mon­i­tor­ing unit of my course Pro­duc­tion-Ready Server­less, I demon­strat­ed how you can do this in 3 sim­ple steps:
  • Cloud­Trail cap­tures the Cre­at­eDe­ploy­ment request to API Gate­way.
  • Cloud­Watch Events pat­tern against this cap­tured request.
  • Lamb­da func­tion to a) enable detailed met­rics, and b) cre­ate alarms for each end­point.
If you use the Server­less frame­work, then you might have a func­tion that looks like this:
auto-create-api-alarms:
  handler: functions/create-alarms.handler  
  events:
    - cloudwatchEvent:
        event:
          source:
            - aws.apigateway
          detail-type:
            - AWS API Call via CloudTrail
          detail:
            eventSource:
              - apigateway.amazonaws.com
            eventName:
              - CreateDeployment
  environment:
    alarm_actions: arn:aws:sns:#{AWS::Region}:#{AWS::AccountId}:NotifyMe
    ok_actions: arn:aws:sns:#{AWS::Region}:#{AWS::AccountId}:NotifyMe
  iamRoleStatements:
    - Effect: Allow
      Action: apigateway:GET
      Resource: 
        - arn:aws:apigateway:#{AWS::Region}::/restapis/*
        - arn:aws:apigateway:#{AWS::Region}::/restapis/*/stages/${self:custom.stage}
    - Effect: Allow
      Action: apigateway:PATCH
      Resource: arn:aws:apigateway:#{AWS::Region}::/restapis/*/stages/${self:custom.stage}
    - Effect: Allow
      Action: cloudwatch:PutMetricAlarm
      Resource: "*"
Cou­ple of things to note from the code above:
  • I’m using the server­less-iam-roles-per-func­tion plu­g­in to give the func­tion a tai­lored IAM role
  • The func­tion needs the 
    apigateway:PATCH
     per­mis­sion to enable detailed met­rics
  • The func­tion needs the 
    apigateway:GET
     per­mis­sion to get the API name and REST end­points
  • The func­tion needs the 
    cloudwatch:PutMetricAlarm
     per­mis­sion to cre­ate the alarms
  • The envi­ron­ment vari­ables spec­i­fy SNS top­ics for the Cloud­Watch Alarms
The cap­tured event looks like this:
{
  "version": "0",
  "id": "dee9a69c-8166-1ad7-41d4-1dad201e29f6",
  "detail-type": "AWS API Call via CloudTrail",
  "source": "aws.apigateway",
  "account": "374852340821",
  "time": "2018-04-09T00:17:47Z",
  "region": "us-east-1",
  "resources": [],
  "detail": {
      "eventVersion": "1.05",
      "userIdentity": {
          "type": "IAMUser",
          "principalId": "AIDAIRMUZZEGPO27IPFYW",
          "arn": "arn:aws:iam::374852340821:user/yan.cui",
          "accountId": "374852340821",
          "accessKeyId": "ASIAJNZDKN26DXPZFYQE",
          "userName": "yan.cui",
          "sessionContext": {
              "attributes": {
                  "mfaAuthenticated": "false",
                  "creationDate": "2018-04-09T00:17:30Z"
              }
          },
          "invokedBy": "cloudformation.amazonaws.com"
      },
      "eventTime": "2018-04-09T00:17:47Z",
      "eventSource": "apigateway.amazonaws.com",
      "eventName": "CreateDeployment",
      "awsRegion": "us-east-1",
      "sourceIPAddress": "cloudformation.amazonaws.com",
      "userAgent": "cloudformation.amazonaws.com",
      "requestParameters": {
          "restApiId": "8kbasri6v7",
          "createDeploymentInput": {
              "stageName": "dev"
          },
          "template": false
      },
      "responseElements": {
          "id": "cj2y0f",
          "createdDate": "Apr 9, 2018 12:17:47 AM",
          "deploymentUpdate": {
              "restApiId": "8kbasri6v7",
              "deploymentId": "cj2y0f",
              "template": false
          },
          "deploymentStages": {
              "deploymentId": "cj2y0f",
              "restApiId": "8kbasri6v7",
              "template": false,
              "templateSkipList": [
                  "position"
              ]
          },
          "deploymentDelete": {
              "deploymentId": "cj2y0f",
              "restApiId": "8kbasri6v7",
              "template": false
          },
          "self": {
              "deploymentId": "cj2y0f",
              "restApiId": "8kbasri6v7",
              "template": false
          }
      },
      "requestID": "6e25bd56-3b8b-11e8-a351-e5e3d3161fe7",
      "eventID": "a150d941-7a54-4572-97b2-0614a81fd25b",
      "readOnly": false,
      "eventType": "AwsApiCall"
  }
}
We can find the 
restApiId
 and 
stageName
 inside the 
detail.requestParameters
 attribute. That’s all we need to fig­ure out what end­points are there, and so what alarms we need to cre­ate.
Inside the han­dler func­tion, which you can find here, we per­form a few steps:
  • enable detailed met­rics with an 
    updateStage
     call to API Gate­way
  • get the list of REST end­points with a 
    getResources
     call to API Gate­way
  • get the REST API name with a 
    getRestApi
     call to API Gate­way
  • for each of the REST end­points, cre­ate a p99 laten­cy alarm in the 
    AWS/ApiGateway
     name­space
Now, every time I cre­ate a new API, I will have Cloud­Watch Alarms to alert me when the 99 per­centile laten­cy for an end­point goes over 1 sec­ond, for 5 minutes in a row.
All this, with just a few lines of code :-)
You can take this fur­ther, and have oth­er Lamb­da func­tions to:
  • cre­ate Cloud­Watch Alarms for 5xx errors for each end­point
  • cre­ate Cloud­Watch Dash­board for the API
So there you have it, a use­ful pat­tern for automat­ing away man­u­al ops tasks!
And before you even have to ask, yes I’m aware of this server­less plu­g­in by the ACloudGu­ru folks. It looks neat, but it’s ulti­mate­ly still some­thing the developer has to remem­ber to do.
That requires dis­ci­pline.
My expe­ri­ence tells me that you can­not rely on dis­ci­pline, ever. Which is why, I pre­fer to have a plat­form in place that will gen­er­ate these alarms instead.
Hi, my name is Yan Cui. I’m an AWS Serverless Hero and the author of Production-Ready Serverless. I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless.
You can contact me via EmailTwitter and LinkedIn.
Check out my new course, Complete Guide to AWS Step Functions.
In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, design patterns and best practices.
Get your copy here.
Come learn about operational BEST PRACTICES for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more.
You can also get 40% off the face price with the code ytcui.
Get your copy here.

Written by theburningmonk | AWS Serverless Hero. Independent Consultant. Developer Advocate at Lumigo.
Published by HackerNoon on 2018/05/13