Alligator Is a Prometheus Monitoring Agent: Everything You Need to Know About It

Alligator is a distributed unit of infrastructure. It can provide information about FreeBSD and Linux operating systems. It also supports metrics transmission to Clickhouse, InfluxDB, and Prometheus.

The main interface of alligator is OpenMetrics. It connects with spreading Prometheus as a monitoring system in many infrastructures. However, the Prometheus ecosystem has some problems.

First of all - the specialized and simplified agents. Every service on the host needs a specialized exporter. This is the reason for having a lot of exporters per host (like a node_exporter, redis_exporter, sentinel_exporter, and so on).

Secondly, simplified agents. A lot of agents can be used for different tasks (like an SNMP, blackbox, etc.), and it inflates infrastructure for such daemons to collect information.

Sometimes, a pull model has problems with multi-worker applications and cron jobs - nothing can give metrics from hosts when Prometheus makes a request, or it is connected with problems related to synchronizing results between workers. It forces us to use statsd/pushgateway.

Then the question comes up: how to remove outdated metrics? How to make it a high-availability service?

The approach of the alligator is different - it is a smart agent for the Prometheus. Alligator consists of a system exporter and an application exporter. It also provides a push model for pushgateway, statsd, and graphite emulation protocol.

Metrics Export

Alligator supports such sets of metrics:

FreeBSD and GNU/Linux system metrics.
Filesystem metrics (file’s checksum, sizes, and access level).
Certificate metrics (PEM and P12).
Some popular applications (ElasticSearch, uwsgi, Druid, RabbitMQ, and others).
Ephemeral metrics. It supports pushgateway, statsd, and graphite protocol, and it can collect metrics from file on FS. It also has a cluster mode for centralized installation of pushgateway.
Simple JSON to metrics parser.
Simple HTTP/TCP checker.

Use Cases

A part of supported technologies for collecting metrics are reviewed in the official documentation. The rest I am describing here.

Application Server Metrics

First of all, alligator can collect application server metrics. For instance, alligator can gather metrics from uwsgi instances with this configuration:

aggregate {
    uwsgi tcp://localhost:1717;
}

and rerun the uwsgi server with the expose stats port:

--stats 127.0.0.1:1717

Another example, fastcgi-application:

aggregate {
    php-fpm fastcgi://127.0.0.1:9000/stats?json&full;
}

and write this option in php-fpm config:

pm.status_path = /stats

After this, you will have common metrics from the application server.

Application Metrics

I recommend using synchronization structures to collect metrics in the application internally. Alternatively, you can make only 1 worker per instance and use such an abstraction as orchestration (Kubernetes, opennebula, and others).

And next, use alligator as a push server for metrics.

Alligator has many collectors to collect it:

These approaches ensure that metrics will be collected from all workers of the application

Applications have to send metrics to alligator. To remove influence on one metric by different workers, you can use two different approaches:

mark metric with the worker ID

from prometheus_client import Counter, push_to_gateway, CollectorRegistry
    registry = CollectorRegistry()
    c = Counter('my_requests_total', 'HTTP Failures', ['method', 'worker_id'], registry=registry)
    c.labels(method='get', worker_id=uwsgi.worker_id()).inc()
    c.labels(method='post', worker_id=uwsgi.worker_id()).inc()
    push_to_gateway('localhost:1111', job='myapp', registry=registry)

use Counter instead of Gauge metric type if you can. File collector cannot calculate increment, but others can increment if the counter is a type of metric. You can only send increment value to alligator, and alligator will calculate the sum.

More information about the difference between metric types can be found here or on the official page.

What Metrics Do I Recommend to Measure in Applications?

Firstly, it is any outgoing requests to external systems. In worker applications, it often is the reason to overflow workers or cause of low speed of query processing. My recommended list of metrics:

count of requests to external services grouped by request types
request time from external services (avg, or if possible, by quantiles/buckets of time) grouped by request types
status of requests grouped by request type (success or errors)

It would be nice if you could group not only by request type but more by network layer of requests:

DNS resolve count, time, and status
connect count, time, and status
write/read count, time, and status

It is also important to measure internal application metrics:

ingoing requests grouped by types
processing ingoing request time avg (quantile or buckets if possible) by type
status of ingoing request by type
cache status (hit/miss if it exists)

I guess you can make another list of metrics, but these metrics are the most important to reveal problems and give main observability about application health.

Container checks. If your application is working in podman or docker environment, you can expose the container statistics. Alligator mechanics are similar to Cadvisor's when collecting container statistics. To enable, you have to use option cadvisor in the system context:

system {
    cadvisor;
}

Blackbox checks. This is a good way to monitor applications or external sources for reliability. Let’s check the resources on two ports every minute:

aggregate_period 60;
aggregate {
    blackbox http://127.0.0.1:80/check;
    blackbox https://127.0.0.1:443/check;
}

Alligator produces network and application response metrics for this resource (response code, response time, and certificate expiration date).

If the application has a page with JSON, alligator can also convert JSON to metrics:

aggregate {
    jsonparse https://jsonplaceholder.typicode.com/todos;
}

The next level of blackbox metrics - checking the process, service, and listening port:

system {
	services uwsgi.service;
	process uwsgi;
}


query {
	expr 'count by (src_port, process) (socket_stat{process="uwsgi", src_port="80"})';
	make socket_match;
	datasource internal;
}

And in the end: launching external programs and proactive actions (like watchdog):

aggregate {
	process 'exec:///usr/local/bin/mycheckscript.sh';
}

If the script can produce a metric that can determine the exact application downtime, this metric can be used for proactively restart:

query {
    expr 'myapplication_status < 1';
    make app_is_down;
    datasource internal;
    action app_is_down;
}
action {
    name app_is_down;
    type shell;
    expr 'exec://systemctl restart uwsgi;
    datasource internal;
}

System Metrics

Any application metrics are also important. But without system metrics of servers, you won’t be able to clearly understand what happens in your system.

System metrics are a set of metrics that help us gain insight into the health of our systems. This is the basis for making decisions if something goes wrong.

An alligator with an empty configuration doesn’t collect any metrics at all. Any system metrics are classified into the following categories:

base - CPU, memory, time, logical resources, and other common metrics about the server
disk - disk usage and I/O
network - interface statistics, network counters, sockets statistics
smart - collecting S.M.A.R.T. statistics
cpuavg - analog for load average when alligator make by CPU-only load
firewall - counters from the system firewall
process - check process (uptime, ulimits usage, status, CPU, memory, and disk consumption)
packages - metrics with a list of installed packets in the system
services - systemd services stats

This is configured in a block called “system” in alligator.conf:

system {
    base;
    disk;
    network;
    process;
    cpuavg period=5;
}

What about cron jobs? How can it be connected to the Prometheus via alligator?

There are several ways to do this. The simple way is to output metrics to a file:

aggregate {
    prometheus_metrics file:///var/lib/stats/metrics.txt state=save;
}

The parameter called ‘state’ enables autosaving at a point of the file when alligator is stopped. It makes us sure that we can stop the alligator and run again, and it rereads only that part of the file that has not been read yet.

If we want to reread a file from the start (every 10 sec), we have to use state=begin. The third way - is to read only realtime data. To enable this mode, we have to remove any state from the configuration file.

The other way is to use statsd or extended pushgateway protocol (called prom-aggregation-gateway):

It only needs to have this section in the configuration:

entrypoint {
	tcp 1111;
    metric_aggregation count;
}

For instance, we have a code:

#!/bin/sh
TASKS=0
do_something() {
        echo 0
        TASKS=$1
}

res=`do_something 1`
echo "amount_of_done_tasks {method=\"do_something\", status=\"done\"} $TASKS
task_exit_code {method=\"do_something\"} $res" | curl --data-binary @- http://localhost:1111/

This script will make two metrics in alligator:

amount_of_done_tasks {status="done", method="do_something"} 0.000000
task_exit_code {method="do_something"} 0.000000

Sometimes, we might get a situation when the script is not working, but the metric exists (because alligator doesn’t know that the application doesn't work). To solve this problem, alligator has TTL mechanics. You can specify the lifetime in seconds for a request:

curl -H "X-Expire-Time: 30" --data "living_thirty_seconds_metric 4" http://localhost:1111/

Or in the configuration file:

entrypoint {
    ttl 30;
    tcp 1111;
    metric_aggregation count;
}

The next way is using statsd protocol (udp or tcp). TCP server for statsd is enabled by default. If you want to use udp, specify the UDP port in config:

entrypoint {
        ttl 30;
        tcp 1111;
        udp 1111;
}

And your job script might have a section like this:

from statsd import StatsClient, TCPStatsClient
from time import sleep

statsd = StatsClient(host='localhost',
                     port=1111,
                     prefix=None,
                     ipv6=False)

statsd.incr('baz')
statsd.incr('baz')

Graphite protocol is also supported by alligator, and it works like graphite_exporter.

Internet Web Server

If I have an internet web server, what options do I have?

First of all, https is the most popular protocol for web servers today. It forces us to check the certificate expiration time:

To check directories with certificates:

x509
{
	name nginx;
	path /etc/nginx/ssl/;
	match .crt;
}
x509
{
	name letsencrypt;
	path /etc/letsencrypt/live;
	match .pem;
}

Secondly, to check the nginx listening ports:

query {
	expr 'count by (src_port, process) (socket_stat{process="nginx", src_port="80"})';
	make socket_match;
	datasource internal;
}
query {
	expr 'count by (src_port, process) (socket_stat{process="nginx", src_port="443"})';
	make socket_match;
	datasource internal;
}

To check nginx -t to exclude the situation, when nginx after server reboot cannot be started:

aggregate {
	process 'exec:///sbin/nginx -t';
}

To check system, web-server process, and firewall counters

system {
    base;
    disk;
    network;
    cpuavg period=5;
    services nginx.service;
    process nginx;
    firewall ipset=on;
}

We also make alligator port accessible only on local interfaces:

entrypoint {
    handler prometheus;
    allow 127.0.0.1;
    allow 10.0.0.0/8;
    allow 172.16.0.0/12;
    allow 192.168.0.0/16;
    tcp 1111;
}

or you can use the notation tcp <localIP>:<port> in the tcp directive.

Statsd Metric Mapping

Simple statsd doesn’t work with labels. To create a label from a statsd metric name, you must use mapping with metric name templates to change new names and create new labels. Due to this, I recommend using an extended format of Statsd, like DogStatsd (alligator also supports this extension), which supports labels in statsd protocol.

But, some programs have already made pure statsd protocol without labels. One such example is Airflow. You can catch metrics from Airflow like this:

airflow.ti.finish.upload_job.load_data.success:1|c
airflow.ti.finish.upload_job.other_data.success:1|c

where the name is the following template:

airflow.ti.finish.<job name>.<process name>:<increment by>|type

The mapping for such metric name can be:

entrypoint {
    tcp 1111;
    udp 1111;
    mapping {
        template   airflow.ti.finish.*.*.*;
        label dag_id "$1;
        label task_id "$2";
        label state "$3";
        name airflow_ti_finish;
        match glob;
    }
}

This mapping will create such a metric:

airflow_ti_finish {state="success", dag_id="upload_job", task_id="load_data"} 1.000000
airflow_ti_finish {state="success", dag_id="upload_job", task_id="other_data"} 1.000000

Statsd or Pushgateway Cluster

Firstly, you can only make some amount of servers statsd or pushgateway and simply proxying to them.

The first problem with using this: if you want to use gauges, you will have problems with calculating. If there is a distributed application sending a metric with the same name and labels, the gauge metric will also be replaced by the application.

The first way to fix this - is to specify different labels for each application (such as hostname and/or worker_id/process_id).

But that is not all. You also have a problem with the same gauge metric: this metric can be a balancer on different machines. Here is what to do with a metric that was sent at different times, but now they are interfering with each other:

10:12 http_request_time_avg {appid="43"} 12
10:14 http_request_time_avg {appid="43"} 22
10:55 http_request_time_avg {appid="43"} 122
11:58 http_request_time_avg {appid="43"} 1232

You can use nginx upstreams with a balancing method called hash:

upstream alligators {
    server srv1.example.com:1112;
    server srv2.example.com:1112;
    server srv3.example.com:1112;
    server srv4.example.com:1112;
    server srv5.example.com:1112;
    hash $remote_addr consistent;
}

Then one backend will always push all metrics to one server.

On the other hand - cluster mode in alligator.

Let's set it up!

Firstly, we allocated 3 ports:

1111 for getting metrics
1112 for replication works
80 for getting metrics from other applications

Imagine that our server is listening to the internet. Then, we need to secure our push metric ports. We can use the directive “return” with arg “empty” to remove any answers outside.

We also know that internet browsers don't send POST queries to other domains if CORS doesn't allow this.

Thirdly, make an authentication for POST requests.

entrypoint {
   allow 127.0.0.1;
   allow 10.0.0.0/8;
   tcp 1111;
}
entrypoint {
   allow 10.0.0.0/8;
   tcp 1112;
   cluster replication;
   instance srv1.example.com:1112;
}
entrypoint {
   return empty;
   allow 0.0.0.0/0;
   ttl 300;
   tcp 80;
   metric_aggregation count;
   auth basic user:setme;
   auth_header Authorization;
   header access-control-allow-headers 'Authorization';
   header access-control-allow-methods POST;
   header access-control-allow-origin *;
}

That's good. We annotate that we have a cluster called “replication” and the current instance name. Since this moment, alligator allows to answer collected oplog to anyone from network 10.0.0.0/8 on port 1112.

How to command an alligator to join other instances of cluster? For this, we will use the “cluster” directive. We can specify oplog size (the limit of in-memory stored metrics that will wait to sync to other instances), and distribute metrics by its metric name (sharding key in our case, but you can use any other label). The replication factor hides replicated metrics while another primary server is running in the cluster.

If the primary server goes down, hidden metrics will be exposed and made available to Prometheus from other instances.

cluster {
   name replication;
   size 10000;
   sharding_key __name__;
   replica_factor 2;
   type oplog;
   servers srv1.example.com:1112 srv2.example.com:1112 srv3.example.com:1112 srv4.example.com:1112 srv5.example.com:1112;
}

Check the External Sites Status

Blackbox mechanics for collecting information about network services have already been mentioned. Nevertheless, that is not all that alligator can be used for.

Alligator also has a puppeteer parser for loading external pages with a lot of site resources (which can load a page DOM).

puppeteer {
    https://site1.example.com;
    https://site2.example.com;
    https://example.com;
}

They might measure domain lookup name in metric:

puppeteer_domainLookupEnd - puppeteer_domainLookupStart

Connect time:

puppeteer_connectEnd - puppeteer_connectStart

Time to load DOM:

puppeteer_DomContentLoaded

Status of loading:

puppeteer_eventSourceResponseStatus

And even errors in the browser console:

puppeteer_eventConsole

This is not all that a puppeteer collector can do. However, this engine needs nodejs to be installed on a machine with puppeteer dependencies. Install puppeteer dependency:

cd /var/lib/alligator/
npm i [email protected]
npm i ps-node-promise-es6

Conclusion

Monitoring systems is a really big challenge despite the amount of tools to solve it. You can choose between a monitoring system or tools in one system of monitoring (such as VictoriaMetrics, Thanos, some amount of exporters, and so on).

I guess that the approaches outlined in the article can be applicable to any other tools of monitoring. Alligator is a tool that can simplify it.

The main goal of any monitoring is to make your system observable. You must have all measured information from your systems to ensure that it always operates correctly.