My Prometheus is Overwhelmed! Help!

Written by ryandawsonuk | Published 2021/07/24
Tech Story Tags: prometheus | monitoring | kubernetes | kubernetes-infrastructure | big-data-analytics | technology | programming | hackernoon-top-story

TLDR Prometheus is an incredibly popular option for monitoring and time series data on kubernetes. Many developers simply install it and let it do its thing. So it can come as a shock when it starts getting overwhelmed. Don’t panic! We’ll help you understand some of the cases you might be encountering and what your options are. The ‘Setup and Forget Strategy’ can take you a long way but you might need to know a bit more if you have a high-volume use case (which you may not even realise)via the TL;DR App

Prometheus is an incredibly popular option for monitoring and time series data on kubernetes. Many developers simply install it and let it do its thing. So it can come as a shock when it starts getting overwhelmed.

Don’t panic! We’ll help you understand some of the cases you might be encountering and what your options are.

The ‘Setup and Forget Strategy’

The setup and forget strategy can take you a long way. But you might need to know a bit more if:

  • You have a high-volume use case (which you may not even realise).
  • You need the data to be there over long time periods (years or at least months).
  • You rely heavily on consuming the data in your apps.
  • Something breaks.

Prometheus and most other time-series databases work very differently from SQL databases. Let’s understand this better.

Arrgh! It’s Broken!

Let’s say you’ve figured out how to expose a metrics endpoint in your app running on kubernetes. You’ve built a grafana dashboard to monitor your apps health or some other data and that looks pretty nice. Or maybe you’ve built your own UI that queries prometheus directly. All is good with the world… until it breaks.

Queries Become Slow

There can be various causes for slowness. First try to rule out the least interesting ones:

  1. If you are calling prometheus in your own code, are you closing and timing out your http connections to prometheus?
  2. Have you allocated enough resources to prometheus? Try increasing it and replicating the load as a test (preferably in an environment with similar underlying hardware).
  3. When you’ve checked increasing CPU and RAM, also try disk. It could be that the disk is getting full and prometheus is having to clean up (more on this in the ‘Data Disappearing’ section below).

The more interesting thing that could be going wrong is that the way the data volume is increasing is causing the queries to slow down. This can happen surprisingly quickly. There could be a change to the systems producing the data but it can also be that apparently small changes have big effects.

The reason we often don’t anticipate how an increase in data will affect prometheus is because we forget to look at cardinality. Here’s an important but slightly complicated part of the prometheus documentation:

“Labels enable Prometheus's dimensional data model: any given combination of labels for the same metric name identifies a particular dimensional instantiation of that metric (for example: all HTTP requests that used the method POST to the /api/tracks handler). The query language allows filtering and aggregation based on these dimensions. Changing any label value, including adding or removing a label, will create a new time series.”

So the upshot is that every different value to every label increases the work prometheus is doing significantly. You have to consider not just every metric value at every scrape, but every combination of labels that is being applied. For this reason, it’s dangerous to autogenerate label names in your code. It’s also dangerous to use labels like userid, which can have many distinct values.

We should think of prometheus as a monitoring system rather than a database. If there’s a lot of variation in the labels then it’s suggested to look at a database (more on this below).

If you hit this problem, there are ways to check the cardinality of your time series. You can run a sum(scrape_series_added) by (job) query. See the presentation slides ‘Containing Your Cardinality’ for more on this and how to reduce cardinality.

Prometheus Actually Crashes

Prometheus crashing could be an effect of one of the problems discussed above. Maybe it is out of memory due to working too hard. Unless you’ve got better information (e.g. from the logs) I’d try looking into the above.

There’s other related failure behaviour that could happen for much the same reasons, like scrapes becoming slow or memory spikes.

Data Disappearing

Let’s say you have some query you run to show how many times something has happened. It keeps going up every time the event happens and all looks good… then the value mysteriously goes down. How can that happen?

Well prometheus does not normally keep data forever. Actually by default it has a retention period of 15 days (which you can configure). Note that’s global, not filtered by any particular type of data.

There’s also an option to tell prometheus how much disk space it should use. So if the data starts taking up too much disk space, then it will start deleting data.

This can be counter-intuitive if you approach prometheus like a traditional SQL database. It isn’t designed for long-term storage (although you can set it to retain for very long periods if you have the space). It’s designed for monitoring so it’s about transient data about what is happening over a constrained time window (you can think of it as “what’s going on now” data).

Out of Disk

Now we know that prometheus will delete data beyond the retention period by default. Newer versions of prometheus might soon also delete data if the allocated disk space is used up, though that’s not the default at the time of writing (you have to configure it). So if you’ve not configured this then it’s possible you can run out of disk.

Queries Exceed Maximum Data Points

If a single query return would too many data points, prometheus simply won’t fully execute it. Instead you’ll get a message back saying that the query exceeds the data points limit, by default 11,000.

Typically you’d need to be running a query over a pretty long time window to hit this problem. I hit it when trying to run a query over a period of months. But it can depend on how much data you are collecting and how dense it is (including how short your scrape interval is).

When I first hit this my first response was big overkill. I wrote an exporter project that read metrics from prometheus, summed up data over intervals to make the gaps bigger (thus reducing number of data points) and put this in as a new time series. (This is an usual use of an exporter, which normally is used for scraping data on behalf of another service rather than pulling from prom and putting back again.)

What I was doing is called downsampling. Taking the data and restructuring it increase the gaps between data points so that there are fewer data points overall. The easiest way to do this is usually a recording rule (which is what I then realised). These are basically queries that you write into your prometheus config that create new time series from existing ones.

Do I Really Have to Reduce My Data? Isn’t There An Easier Way?

What we’ve said so far basically amounts to:

  • Don’t have lots of dense data over a long period.

  • Especially don’t try to query lots of dense data over a long period.

  • If you must query a lot of data, look at restructuring your data to make it less dense.

So at this point you’re probably wondering ‘do I really have to reduce my data?’ It really depends on your situation. For anyone wondering whether there’s some tool out there that can make this easier, let’s take a look at some tools that either complement or replace prometheus and how they compare.

Extending Prometheus

High Availability

You might be thinking, “can’t I handle more data by running more instances of prometheus?” The answer is both yes and no.

With HA Prometheus, each prometheus handles some of the data. The recommended way to do this is ‘functional sharding’. This means for each service being scraped, all of its data is handled by just one prometheus. Functional sharding not the only way to shard data, but is the simplest one. It means each prometheus has a clear, dedicated remit.

Functional sharding is more for scaling numbers of services. If you’ve got a single service producing a lot of data (e.g. too much data for your queries) then functional sharding in itself isn’t going to help you.

Prometheus data can be sharded differently for high availability. When you split the data up then you need a way to put it back together for querying. This is achieved with federation. Basically certain prometheus instance/s collect data from the other ones, so that data is sufficiently consolidated that you’ll know which prometheus to query for it.

The prometheus documentation on federation suggests having some prometheus instance/s that just have aggregated global data. Aggregation would be a way to reduce cardinality and make it possible to run queries that would otherwise hit limits. But setting this up still requires recording rules. So for reducing cardinality in it’s the recording rule that is doing the work, not the federation. However, this does not mean that recording rules are the only way.

Thanos

Thanos is another open source project in the CNCF. It complements prometheus and can be used to better scale prometheus. Its key points:

  • Adds layer to prometheus for scaling.

  • Supports downsampling, long-term storage and aggregation of data from multiple prometheus instances.

  • Has several components inc. prom sidecars, compactor and query module - is quite heavy-weight if all are used.

  • Can be tricky to install and properly test due to number of components.

  • Query module supports PromQL.

For details on all these components, there’s a good overview article from AWS. The main idea is that thanos can help with long-term storage, query data point limits and federation. But it’s not a one-click solution. It has different components targeted at each concern.

If your key concern is individual queries hitting limits (which was the main issue I was facing before) then the particular component that will be of interest is the compactor. That component can automatically downsample data so that queries will be able to run over larger time horizons without hitting max data points limits.

Thanos is not the only tool that can work with prometheus to help it scale. There are a number of tools that can take prometheus data and store it for the long-term and there’s a listing of them in the official prometheus docs.

There’s also alternatives to prometheus out there. Actually there are some tools that can be used either with prometheus or instead of prometheus. This can make the options confusing so let’s try to clarify a bit.

The Time Series Databases Scene

This is a selective look at some time series databases. It is not comprehensive. My aim is to cover a selection that gives a good picture of the range of options and how to understand their approach and purpose.

One thing that confuses people about time series databases is that they’re not based around a standard like SQL or a single design philosophy like relational databases. Any database that works well for storing pairs of a timestamp and a value and associated uses for that data (e.g. monitoring).

InfluxDB

InfluxDB is designed as a time series database suitable for metrics. It can be an alternative to prometheus or it can be a backend for prometheus as long-term storage. If run on its own then it collects the data. If run with prometheus then prometheus collects the data and InfluxDB gets it from prometheus.

Some key points on InfluxDB:

Elasticsearch

Elasticsearch is of course a document-based database and search engine so this one could be a surprise. But elasticsearch can also be used for time series data. And elastic can be used as long-term storage for prometheus.

Elastic can be used for time series natively, without prometheus. They have an example on CPU metrics that suggests it can do downsampling at query time. It can also ingest kube-state-metrics.

The main challenge for prometheus users interested in elastic is that elastic is not so well established for these use cases so detailed examples can be tricky to find (at least at the time of writing but if anyone has some then feel free to contact me e.g. on twitter).

Comparison with influx suggests both elastic and influx can be used for time series and that influx cannot be used for text (e.g. nlp use-cases, EFK log collection). This makes sense - elastic is a document database that can also do time series, whereas influx is specifically for time series.

TimescaleDB

TimescaleDB is relational and based upon postgres. Some key points about TimescaleDB:

Others

There are a lot more offerings in this space that we’ve not covered here such as Cortex, VictoriaMetrics, M3db, Graphite, Datadog and more. The above selection is intended to give a flavour of the variety of the space and help readers explore for themselves.

You Are Not Alone

If your prometheus gets overwhelmed, remember you are not alone. It is quite normal to hit limitations with prometheus. There’s a whole space of tools to address this. There’s even newly-emerging approaches that we’ve not touched on here (such as detection of how and when cardinality explosion happens).

There’s no single easy solution that works for all cases. You need to think about your situation and what matters most for you. My top tips to leave you with are:

  • Really explore the prometheus UI and PromQL. All the recording rules and what is being scraped and so on is in the prometheus UI if you know how to find it.
  • Use slack groups to ask what others did. This article has benefited greatly from conversations on the kubernetes and data on kubernetes (DOK) slack groups.


Written by ryandawsonuk | Principal Data Consultant at ThoughtWorks. Hackernoon Contributor of the Year - Engineering.
Published by HackerNoon on 2021/07/24