You need to sample debug logs in production

Written by theburningmonk | Published 2018/04/28
Tech Story Tags: aws | sample-debug-logs | debug-logs-in-production | bugs | production

TLDRvia the TL;DR App

It’s com­mon prac­tice to set log lev­el to WARNING for pro­duc­tion due to traf­fic vol­ume. This is because we have to con­sid­er var­i­ous cost fac­tors:

  • cost of log­ging : Cloud­Watch Logs charges $0.50 per GB ingest­ed. In my expe­ri­ence, this is often much high­er than the Lamb­da invo­ca­tion costs
  • cost of stor­age : Cloud­Watch Logs charges $0.03 per GB per month, and its default reten­tion pol­i­cy is Nev­er Expire! A com­mon prac­tice is to ship your logs to anoth­er log aggre­ga­tion ser­vice and to set the reten­tion pol­i­cy to X days. See this post for more details.
  • cost of pro­cess­ing : if you’re pro­cess­ing the logs with Lamb­da, then you also have to fac­tor in the cost of Lamb­da invo­ca­tions.

But, doing so leaves us with­out ANY debug logs in pro­duc­tion. When a prob­lem hap­pens in pro­duc­tion, you won’t have the debug logs to help iden­ti­fy the root cause.

Instead you have to waste pre­cious time to deploy a new ver­sion of your code to enable debug log­ging. Not to men­tion that you shouldn’t for­get to dis­able debug log­ging when you deploy the fix.

With microser­vices, you often have to do this for more than one ser­vice to get all the debug mes­sages you need.

All these, increas­es the mean time to recov­ery (MTTR) dur­ing an inci­dent. That’s not what we want.

It doesn’t have to be like that.

There is a hap­py mid­dle ground between hav­ing no debug logs and hav­ing all the debug logs. Instead, we should sam­ple debug logs from a small per­cent­age of invo­ca­tions.

I demoed how to do this in the Log­ging chap­ter of my video course Pro­duc­tion-Ready Server­less. You need two basic things:

  • a log­ger that lets you to change the log­ging lev­el dynam­i­cal­ly, e.g. via envi­ron­ment vari­ables.
  • a mid­dle­ware engine such as mid­dy

With Lamb­da, I don’t need most of the fea­tures from a ful­ly-fledged log­ger such as pino. Instead, I pre­fer to use a sim­ple log­ger mod­ule like this one. It’s writ­ten in a hand­ful of lines and gives me the basics:

  • struc­tured log­ging with JSON
  • abil­i­ty to log at dif­fer­ent lev­els
  • abil­i­ty to con­trol the log lev­el dynam­i­cal­ly via envi­ron­ment vari­ables

Using mid­dy, I can cre­ate a mid­dle­ware to dynam­i­cal­ly update the log lev­el to DEBUG. It does this for a con­fig­urable per­cent­age of invo­ca­tions. At the end of the invo­ca­tion the mid­dle­ware would restore the pre­vi­ous log lev­el.

You might notice that we also have some spe­cial han­dling for when the invo­ca­tion errs.

This is to ensure we cap­ture the error with as much con­text as pos­si­ble, includ­ing:

Hav­ing debug logs for a small per­cent­age of invo­ca­tion is great. But when you’re deal­ing with microser­vices you need to make sure that your debug logs cov­er an entire call chain.

That is the only way to put togeth­er a com­plete pic­ture of every­thing that hap­pened on that call chain. Oth­er­wise, you will end up with frag­ments of debug logs from many call chains but nev­er the com­plete pic­ture of one.

You can do this by for­ward­ing the deci­sion to turn on debug log­ging as a cor­re­la­tion ID. The next func­tion in the chain would respect this deci­sion, and pass it on. See this post for more detail.

So that’s it, anoth­er pro tip on how to build observ­abil­i­ty into your server­less appli­ca­tion. If you want to learn more about how to go all in with server­less, check out my 10-step guide here.

Until next time!

Like what you’re reading but want more help? I’m happy to offer my services as an independent consultant and help you with your serverless project — architecture reviews, code reviews, building proof-of-concepts, or offer advice on leading practices and tools.

I’m based in London, UK and currently the only UK-based AWS Serverless Hero. I have nearly 10 years of experience with running production workloads in AWS at scale. I operate predominantly in the UK but I’m open to travelling for engagements that are longer than a week. To see how we might be able to work together, tell me more about the problems you are trying to solve here.

I can also run an in-house workshops to help you get production-ready with your serverless architecture. You can find out more about the two-day workshop here, which takes you from the basics of AWS Lambda all the way through to common operational patterns for log aggregation, distribution tracing and security best practices.

If you prefer to study at your own pace, then you can also find all the same content of the workshop as a video course I have produced for Manning. We will cover topics including:

  • authentication & authorization with API Gateway & Cognito
  • testing & running functions locally
  • CI/CD
  • log aggregation
  • monitoring best practices
  • distributed tracing with X-Ray
  • tracking correlation IDs
  • performance & cost optimization
  • error handling
  • config management
  • canary deployment
  • VPC
  • security
  • leading practices for Lambda, Kinesis, and API Gateway

You can also get 40% off the face price with the code ytcui. Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP).


Written by theburningmonk | AWS Serverless Hero. Independent Consultant. Developer Advocate at Lumigo.
Published by HackerNoon on 2018/04/28