How to Monitor Service: The Fundamental Framework

Icons made by Freepik from www.flaticon.com, used under CC BY

The place to start when planning monitoring for a new service or a checklist when revisiting an existing one.

Motivation

While well-architected systems are fault-tolerant and can continue operating correctly when some of its components failed, persistent failures are highly undesirable and can lead to degraded performance and even system collapse. However, with well-planned monitoring you should be able to:

Anticipate disruptions
Quickly identify the source of problems
Trigger automated recovery processes
Trigger alarms

Method

Service monitoring is a broad topic, and there are numerous sub-topics to choose from. However, it is a core piece of system design we need before we can launch a service. Thus we need a principled, structured way of reasoning about and evaluating our monitoring strategies.

While every service is different, can consist of other smaller services (e.g., microservices) and requires a different set of metrics. Almost always an atomic service would comprise an application that encodes business logic, a compute infrastructure that runs it, a number of dependencies and a network infrastructure to exchange data with dependencies and users. Such basic building blocks provide an excellent top-level structure for monitoring analysis.

In this article I want to offer a conceptual monitoring framework that can be applied to a variety of different service architectures, starting from a Startup-like MEAN Stack to an Enterprise Microservices in the cloud. Moreover, should serve as a good starting point when planning new monitoring for a new service or as a checklist when revisiting an existing one.

Application

First, we start with application-level monitoring. That is the most difficult to get right and the most critical pillar of monitoring. Answering the foremost question “Is my service running?”, whereas all the other monitoring levels only help us to pinpoint the cause of a problem.

In practice “running” means many different metrics and dashboards. So the best way to split complexity further would be the following:

Business Key Performance Indicators (KPI)
End-user Experience (EUE)
Service-Level Agreements (SLA)

In practice, however, these subsets are never entirely disjoint.

Business Key Performance Indicators (KPI)

Your business metrics are the best proxy for determining whether your service runs as expected or not. For example, for an eCommerce website, your KPIs would include something like “card abandonment rate”, “average order value”, “products per order” and so forth. However, this set of metrics is the most distinctive of different services. Thus, you should collaborate early on with your product team to understand what would be the minimum set of metrics.

End-user Experience (EUE)

When applicable, the user experience is best tracked at the browser level using Real User Monitoring (RUM). It mostly uses JavaScript injected into a page within applications to provide feedback from the browser. At least you should monitor:

First Paint or First Contentful Paint
First Meaningful Paint
Time to Interactive

Moreover, group these data by different browsers, platforms, and regions. Check “User-centric Performance Metrics” from Google I/O 2017 for details how to implement it.

“User-centric Performance Metrics” by Philip Walton, used under CC BY

Server-side performance monitoring also provides insight into end-user performance. However, we discuss it as a part of service SLA.

One more technique you should employ for collecting both client-side and server-side End-user Experience monitoring is “Synthetic Transaction Monitoring”. Which involves running an external agent that executes pre-recorded user cases at regular intervals and mimics real user behavior.

Service-Level Agreements (SLA)

The SLA (Service Level Agreement) is a promise or contract from the issuer to the customer and often includes:

Availability.Monitor request rates: total, by API and optionally by a client. Also, measure the ratio between failures and the total requests. E.g., if you rely on HTTP(S) monitor 5xx error codes, and keep an eye on 4xx errors.
Latency.Measure both client-side and server-side latencies per API method. If your clients are from different regions, make sure to group client-side latencies by region. If your end users access your service via a browser, you can obtain client-side latency using Resource Timing API. Otherwise, you should rely on latencies reported by your “Synthetic Transaction Monitoring” or Canary Tests. You may read more on canaries here.
Durability.Durability is a tricky measure of data persistence that answers the question “Will my data still be there in future?”. In practice, such SLAs are based on the previous statistic of data loss. A good proxy to start with would be monitoring of internal application errors. You should have log watchers reporting on “ERROR” or “EXCEPTION” in your logs. And when applicable you may monitor browser-side javascript errors.
Consistency.In Database and Distributed Systems communities “Consistency” has different definitions. We refer to the latter. Thus there are two dimensions of consistency: Staleness and Ordering. Monitoring consistency in a cost-effective way is hard. However, starting with Staleness monitoring is easier. For example, you may have a separate scenario as part of “Synthetic Transaction Monitoring”, which creates and removes objects and checks how soon the effect is observable.

Compute Infrastructure

It does not matter whether you rely on Serverless Computing like AWS Lambda and Google Cloud Functions, or rent Dedicated Servers at Hetzner. At some point it, would fail. Either due to physical malfunctions, data center outage or resource exhaustion. Whereas resource exhaustion can be caused by application memory leaks, broken log rotation, fleet capacity misconfiguration or DoS attack.

To better differentiate between different failures and assist in identifying a cause we would further split compute infrastructure monitoring into three ties: CPU, memory and disk usages. Additionally, for each metric, you should monitor aggregated statistics (mean, p99) per host class, fleet, and region when applicable.

Here are some metrics you should consider when monitoring your compute infrastructure.

CPU Usage

CPU utilization and CPU load
Workload versus CPU utilization ratio (cost-effectiveness)
Process and threads count

Memory Usage

System memory used (total and percentage)
Swap space
Application heap used (total and percentage)
Garbage collection count and time spent (when applicable)

Disk Usage

Disk space used (total and percentage per partition /local, /tmp etc)
Number of open file descriptors
Inode usage percentage

Provisioning

Active hosts versus total hosts in host class/fleet
Total hosts versus available (e.g., AWS EC2 has limits per account)

Dependencies

Modern server-side applications largely depend on external services. Think of your payment processing system, Single Sign-on(SSO) authentication or advertisement APIs. However, even old-fashioned monolithic services usually consist of a separate database.

While a specific dependency may have a unique set of domain-specific metrics, make sure to start to monitor the least common denominator:

Availability (e.g., errors, timeouts)
Latency (mean, p99 )
Throughput for Reads and Writes (mean, p99)

A vast majority of services depend on external data storages, either for persistence or for caching. Thus depending on a kind of data-storage (e.g., managed NoSQL, self-hosted SQL database) consider the following set of metrics when implementing monitoring:

Provisioned and used capacity
Throttling rate
Input/Output Operations per Second (IOPs)
CPU Utilization
Used memory and storage (total and percentage)
Number of DB connections
Replication Lag time or size

Additionally, for cloud services, your cloud platform provider itself is a critical dependency. Make sure, you are firmly watching its health dashboards:

Network Infrastructure

Finally, the last pillar of service monitoring — network monitoring.

From a service monitoring perspective, we are primarily interested in whether we hit a bandwidth limit or the maximum number of open connections. Both bottlenecks may have different flavours and come from different parts of your network infrastructure: host-level, load balancer or NAT gateway. Thus make sure you know limits of your hardware or IaaS provider and when applicable consider the following metrics for each networking device:

Open-File Descriptors in OS
In/Out Bits Per Sec
Active Connection Count
Load Balancer Spillover
Load Balancer Surge Queues

Additionally, consider integrating on-premises or cloud DDoS detection and mitigation services, such as AWS Shield or Azure DDoS Protection. That would monitor and protect your network at multiple OSI layers against flood, reflective attacks, and resource exhaustion.

Conclusions

We have finally discussed the 4 pillars of successful service monitoring. The provided framework is only a minimum set of recommendations and provides a good foundation, but is not exhaustive by any means. For example, it does not address advanced topics such as real-time security policy monitoring or distributed application debugging and analysis (AWS X-Ray). Your next steps should be to implement real-time dashboards around your metrics and automate alarming based on thresholds and anomaly detection.

Don’t forget to clap and share the article if you found it interesting. You may also reach me on Linkedin and Twitter.