Backup & recovery of infrastructure services

Infrastructure should be boring, right? That doesn’t mean that developing tools for said infra can’t be exciting. In one area I believe there’s room for improvement: backing up (and restoring) critical infra service.

To provide some context: I’m talking about cloud native infrastructure, that is, distributed systems that typically manage containers of some sort. What all of those distributed systems have in common is some distributed infrastructure component we use to store state critical to its operation: configuration, metadata about leaders or workers, and so forth:

DC/OS uses ZooKeeper supervised by Exhibitor for both its distributed kernel (Apache Mesos) as well as its services (Marathon, Jobs, Spark, Kafka, Cassandra, and so on).
Docker SwarmKit uses an internal Raft-based State Store.
Kubernetes uses etcd for persistent storage of all of its REST API objects.
Nomad uses an internal Raft-based consensus protocol (as well as a gossip protocol to manage cluster membership).

In any case, you might find yourself sometimes in a situation where you want to take a snapshot of the content of the infra service, be it to debug it or to keep a backup of a healthy state. This was the motivation to start work on a tool that I called burry, for _B_ack_U_p & R_ecove_RY tool:

http://burry.sh

In a nutshell, burry lets you, at time of writing, take a snapshot of the content of ZooKeeper & etcd and then you can:

dump it to the screen, for example: burry --endpoint localhost:2181
store it to the local filesystem, for example: burry --endpoint etcd.mesos:1026 --isvc etcd --target local
store it in a remote storage system, for example:

burry --endpoint leader.mesos:2181 --target s3 --credentials play.minio.io:9000,AWS_ACCESS_KEY_ID=Q3AM3UQ867SPQQA43P2F,AWS_SECRET_ACCESS_KEY=zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG

Note: currently, you can use Amazon S3 and Minio as remote storage systems.

I’m currently working on Azure and Google storage support as well as restoring the state (that’s the recovery part ;).

What would you like to see, next? Please let me know, either here or by raising an issue on GitHub.