Why Pingdom wasn’t enough for me today

Disclaimer: I am the co-founder of Assertible. This is a success story about building and dogfooding a product that solves my own problems.

Today there was a brief outage in one of my APIs. The series of events that led me to identify the issue made me realize just how important effective notifications are in an API monitoring tool. I wanted to outline what happened, and how Assertible helped me identify the problem more quickly than Pingdom.

This particular web service is one that I would consider critical; users rely on the availability of this service for their services. Uptime monitoring is set up with Pingdom and more in-depth validations are running on Assertible. Both of services are continuously checking the API every few minutes, and are set up to alert the team if any of those checks fail.

So here’s what happened

A pull request was merged, and CircleCI started building the app and preparing it for deployment.This is a routine process that happens several times every day. I stepped out of the office, but had my phone with me to receive alerts (fortunately).

At 1:22 PM, I received a notification from Pingdom that the service was DOWN. The default alerts sent by Pingdom don’t provide any meaningful information about the outage — definitely not enough to act on:

Thanks, Pingdom.

What the heck? This is when I first knew something was wrong with my API. In a panic, I pulled up the AWS console app on my phone, but before it even loaded I received a second downtime alert from Assertible:

Assertible’s failure alert — within 1 minute of Pingdom’s

Bingo! From the Assertible failure alert, I immediately knew the issue was somewhere in the AWS deployment. We’ve seen 503 status codes on numerous occasions during deployments. Although extremely inconvenient, this wasn't a rare occurrence.

I monitored the AWS events as it repaired and re-deployed the failing instances on it’s own. After just a few minutes, the API was back up and everything was healthy. I can breathe.

When I got back to the office, I was able to corroborate what I derived from the Assertible alert by looking at the AWS event log. AWS had failed to deploy the new application.

By this time everything was operating normally, so I didn’t have to take any action.

The moral of the story here is that, sometimes, a simple ping is not enough. Web services are complicated beasts, and each one has it’s own unique way of behaving. They should be continuously validated for the business logic they’re built to provide.

The fact that Assertible was:

running health checks on a schedule
set up to send API failure alerts
and had HTTP assertions to validate expected status codes.

…were the key factors in finding the root cause of this issue in under 3 minutes. Context is key, and the default Pingdom alerts do not provide that.

Don’t get me wrong — I will continue using Pingdom. But Assertible will always be running alongside doing more in-depth checks and validation on my web services.

I’m happy that I built a tool that solves my own problems, and I hope other people will find this useful in determining what’s important in web service monitoring.

:: @CodyReichert