Why the AWS, Azure, and GCP CLIs Need to Die

An argument for replacing querying data via AWS, Azure, and GCP CLIs with CloudGraph, a single GraphQL-powered CLI tool that co-locates resource data with insights and relationships.

‍

Note: For the purpose of this article, I’ll focus mostly on AWS, but the same ideas apply to Azure and GCP.

AWS has done a wonderful job of building solutions that let engineers like you and I create systems to power our increasingly interconnected world. Over the last 15 years, products such as EC2, S3, RDS, and Lambda have fundamentally changed how we think about computing, storage, and databasing.

With the proliferation of Kubernetes and Serverless in the last 5 or so years, AWS services have become increasingly abstract on top of racks of physical servers. To end-users, everything on AWS is just an API, so we don’t necessarily need to know how Lambda Functions or EKS work under the hood to be able to use them for building applications. With a little documentation, API or console access, and a tutorial anyone can pretty much create anything they need.

These abstractions have led to massive improvements in the overall convenience and breadth of AWS’ offerings. What was once a painstaking, time-consuming, and error-prone process of provisioning new servers, databases, or filesystems can now be done in seconds with just the click of a button or deployment of IAC. Since everything is just an API abstraction, when AWS is ready to introduce a new “product,” they simply need to expose a new API – yes, I’m of course simplifying slightly :)

Anyone familiar with AWS knows that service APIs are almost always split into modular namespaces that contain dozens, if not hundreds, of separate API methods for single resources. For example, the EC2 service contains over 500 different API methods, with new ones added occasionally. Any company building substantial systems on AWS is likely using many, many different services.

While a masterpiece of datacenter architecture, this choice of hundreds of services and configuration options put the burden of knowledge on how to properly use these services squarely on us engineers. As a result, we find ourselves having to constantly stay up to date and learn about all the service offerings or new changes. This takes a significant amount of time and mental energy. As developers, it can be difficult, time-consuming, and frustrating to use the AWS CLI to make 5 different API calls to describe, as an example, an ECS cluster, its services, task definitions, tasks, container definitions, etc. I often find myself lost in the documentation and having to use half a dozen of APIs to get answers to questions like “What exactly is running in this VPC?”

This means that AWS can feel overwhelming quickly even to seasoned cloud architects. While AWS is fantastic at building the actual services that power our businesses, not a lot of headway has been into simplifying the day-to-day UX of querying these hundreds of services in a sane manner.

New solutions like the Cloud Control API have attempted to create a standardized interface for querying many different types of AWS resources. Unfortunately, the Cloud Control API’s usage is severely limited, and users still need to know how to correctly query their data. This means more time spent reading documentation and understanding how services work and are related to one another.

While the modularity of AWS APIs is a great logical organization system and does make sense, it’s a burden on end-users in terms of the cognitive overhead and learning curve. Having to remember how hundreds of constantly changing services work and are connected leads to caffeine addiction and time wasted playing detective.

Wouldn’t it be great if we, as DevOps/Cloud engineers had a simpler way to get our data out of AWS? One that reflected our need to easily query any data about any service in anyaccount without having to spend hours on docs or stack overflow?

About a year ago, my team and I got to thinking about what such a solution would look like and how we could build a single universal interface that would work across all your AWS accounts (and, for that matter, GCP, Azure, and K8s as well). We thought about how such an optimal solution would function. With this, three important but not currently supported capabilities came to mind.

The first idea is that data should include built-in relationships between related services. When you query an EC2 Instance, you should be able to easily understand information about its EBS volumes, the ASG to which it belongs, target groups, as well as its subnet and VPC. These “connections” between resources often tell a story of how a component is used as part of a larger system, which can be very useful in solving all manner of day-to-day operational issues.

The second key idea is that data should be co-located with insights. When you aws ec2 describe-instances it would be incredibly useful to also see information about billing, security, compliance, CloudWatch, etc. After all, the most fundamental unit of organization for any AWS service is the resource itself. For example, all data that you could ever want to know about EC2, such as cost, compliance, and utilization metrics, belong to an EC2 instance(s), so why not associate the resource with its full insight data? If we treat the resource as the primary “owner” of the data and begin to supplement that resource with additional data, we can vastly improve the cloud insight retrieval process.

The third and final capability is that there should be a single, open-source, standardized API and query interface that allows developers to query any kind of AWS data no matter the service. This API should extend across AWS accounts and even across cloud providers like Azure and GCP. This would allow for a single unified source of data. Just imagine how useful it would be to not have to switch profiles or log out of and then into different AWS/cloud accounts to query your data.

With these three concepts in mind, my team and I created CloudGraph, the free and open-source GraphQL API for AWS, Azure, GCP, and K8s that co-locates insights with data. Using the three guiding principles of connections, data-colocation/supplementation, and multi-environment access, engineers can use CloudGraph to query all their AWS Accounts from a single place. This simplifies things tremendously and will allow you to be significantly more productive.

Let’s take a look at three scenarios, one for each of the above points, where it is currently verydifficult to get answers using the AWS CLI, but simple to do with CloudGraph.

Note: For the sake of this article, I’ll omit the technical details of how CloudGraph works under the hood, but feel free to check out https://github.com/cloudgraphdev/cli to learn more about CloudGraph and the powerful graph database that backs it.

Scenario #1) What assets are running in a VPC?

Today, if you were going to query all of the things that were running in a VPC, there would be dozens if not hundreds of API calls that you would need to make manually. This makes asset inventory an incredibly painful and time-consuming process. Say that you only knew the name but didn’t know the ARN or ID of a VPC. To start, you would probably need to run aws ec2 describe-vpcs to obtain a list of all the VPCs just to find the one that you want.

Now that you know the VPC ID, you would need to start querying all the things that could potentially be running inside that VPC. You might start to run queries like aws ec2 describe-instances and then for each instance aws ec2 describe-volumes. You might also start querying security groups aws ec2 describe-security-groups, NACLs aws ec2 describe-network-acls, and route tables aws ec2 describe-route-tables and so on (for brevity, I’ve simplified the actual calls here). Oh, and good luck with all the additional calls needed to understand relationships that are not explicit.

At this point, 20 minutes have elapsed googling, reading documentation, and running queries, and we still only have a sliver of the services we need. The format of the data returned contains few relationships, so you’re either holding all that context in your head or, more likely, copying and pasting the output to create a literal map of resources that works for your purposes. If you’re like me, then chances are you don’t know all of the services that could even be running in a VPC, so you would need to first figure out how to figure that out!

As is true in most software engineering scenarios, whenever you find yourself banging your head on the wall, you’re probably not thinking about solving the problem in the best way. Rather than having an asset inventory of what is running in your VPC become a herculean task, let’s take a step back and think about how we can reconceptualize the approach to solving this problem.

The good news is that if we start thinking in terms of a graph and GraphQL, things become a whole lot easier. It turns out that GraphQL backed by a graph database is a beautiful and expressive way to understand relationships between entities. Because every node in a query can know about other n many other nodes, we can write effortless queries like this that allow us to traverse up or down, between relationships at will, to figure out what assets are running in a VPC:

query {
  queryawsVpc {
    id
    arn
    ipV4Cidr
    # Other attributes and connections here...
    alb {
      arn
      # Other attributes and connections here...
      vpc {
        id
        arn
        accountId
        ipV4Cidr
        state
        # Other attributes and connections here...
      }
      ec2Instance {
        arn
        # Other attributes and connections here...
      }
    }
  }
}

I've skipped the full example of how to get every resource here, but those who are curious can find it here: https://docs.cloudgraph.dev/vpc#re-kitchen-sink.

As you can see, using CloudGraph makes it incredibly simple to effortlessly “join” across service relationships, and users can easily query a VPC, any ALBs that are deployed in it, and EC2 Instances targeted by the ALBs, etc. You can even go from ALB back up to VPC in the same query!

It is important to note here that because we are using GraphQL to do this, everything is type-safe. This means that if you use one of the built-in query tools that CloudGraph ships with, like Altair or GraphQL Playground, all of your queries will not only be autocompleted as you type them, but there is also automatically generated documentation. Not only will you be able to quickly see all the possible things that could be in or related to a VPC, but you will also know if your query is valid before you even trigger the HTTP request to get your data!

Scenario #2) How is my EC2 Instance configured?

Note: To keep things simple, I'll leave third-party cloud insight tools out of this conversation.

While admittedly a contrived question, if I wanted to know how an EC2 Instance was configured, I could simply run aws ec2 describe-instances and find the instance that I’m looking for. Perhaps the raw EC2 metadata that’s returned is good enough for what I’m doing, but things get tricky when we start talking about needing additional data and insights related to security, compliance, billing, utilization, etc.

All those additional insights require additional CLI queries that you likely have to google the documentation for and then write/aggregate by hand. For billing, this would likely include queries to the Cost Explorer API. For compliance/security, it might mean queries to AWS Config (note that you still must set up Config/Security Hub in the first place), and for utilization, you will likely be using the aws cloudwatch get-metric-data query for CloudWatch.

As we have seen with relationships between resources, there is tremendous difficulty associated with gathering and aggregating all this supplemental data. This suggests that it might not be the optimal way to query our data. While I understand the importance of AWS itself having data modularity, I’d go so far as to say that we currently conceptualize insight data access backward, similar to how we used to believe the sun revolved around the earth.

While there is no arguing that EC2 resource metadata can be less important than, say, EC2 security insights, insights really revolve around a specific resource or resource configuration, not vice versa. In the case of potential secrets in an EC2 Instance’s user data, it is the offending EC2 instance and its user data configuration that is the source from which the potential security issue originates. Accordingly, from a hierarchical perspective, it makes sense to use the EC2 Instance as the container for all the insights that relate to it.

Using this paradigm, any time your access your EC2 Instance data, there is a wealth of information co-located with the resource metadata just waiting to be discovered. This is information that you get for free without needing to configure anything. Because CloudGraph is FOSS, this information could be extended and supplemented with any relevant data you have, including third-party tools or proprietary internal data specific to your company:

query {
  queryawsEc2 {
    id
    arn
    # Billing data
    dailyCost {
      cost
      currency
      formattedCost
    }
    # Utilization 
    cloudWatchMetricData {
      lastMonth {
        networkInAverage
        # Other attributes
      }
      lastWeek {
        networkInAverage
        # Other attributes
      }
      last24Hours {
        networkInAverage
        # Other attributes
      }
      last6Hours {
        networkInAverage
        # Other attributes
      }
    }
    # Security and compliance
    CISFindings {
      severity
      description
      ruleId
      result
    }
  }
}

Note: You can find more EC2 examples here: https://docs.cloudgraph.dev/ec2

This is much, much simpler than manually looking up how to query all of the different AWS services listed above to return EC2 Insight data. Remember, just as was the case with relationships in a VPC, thanks to the type-safe nature of GraphQL, every AWS service in CloudGraph has automatically generated documentation and autocompletion when it comes to writing queries. This means that you can always know everything you can access for a given resource, including what metadata attributes and supplemental data are available.

Oh, and by the way, if you want to just query the insights directly, that’s not a problem: you can always query all of the CIS compliance/security findings for your AWS accounts like this:

query {
  queryawsFindings {
    ruleId
    description
    result
    iamUser {
      name
    }
  }
}

And then you can get back all your billing data for all of AWS accounts like this:

query {
  queryawsBilling {
    totalCostLast30Days {
      cost
      currency
      formattedCost
    }
    totalCostMonthToDate {
      cost
      currency
      formattedCost
    }
    monthToDate {
      name
      cost
      currency
      formattedCost
    }
    last30Days {
      name
      cost
      currency
      formattedCost
    }
    monthToDateDailyAverage {
      name
      cost
      currency
      formattedCost
    }
    last30DaysDailyAverage {
      name
      cost
      currency
      formattedCost
    }
  }
}

Scenario #3) How Does Staging Differ from Production?

Note: I won’t get into the configuration and setup details in this article, but feel free to read more about setting up CloudGraph for multiple accounts here: https://github.com/cloudgraphdev/cli#install

For many organizations, it’s important to have a staging environment that closely mirrors production to allow for everything from sales demos to full QA regression tests. Whatever the need, if you were to try to compare two environments, chances are you’re not only going to be writing lots of queries to compare the required resources, but you’re also going to have to change AWS credentials before you run your queries against different accounts.

Whether you’re doing this via aws sso, manually changing the profile being used, entering different roles/keys, or via some other mechanism entirely, the fact remains that it is annoying to essentially log in and out of different accounts to query your data. This annoyance increases linearly with the number of accounts you have (exponential annoyance in my case) and is amplified even further by the fact that most of us have at least some resources also running in GCP, Azure, or elsewhere.

There is, of course, the logical and obvious necessity for AWS, Azure, and GCP to have accounts/subscriptions/projects so users have, well, different places to store different resources owned by different people. But when multiple accounts owned by the same company are part of an OU that has a shared function (i.e., shared networking), it can make sense to query data from multiple accounts at once.

Rather than making you log in and out of your accounts, CloudGraph supports as many AWS accounts as you can throw at it, allowing you to avoid the headache of switching environments and running the same queries repeatedly. Let’s say you have 5 AWS accounts that you want to query the EC2 Instances for all at once. With CloudGraph, here is what that query looks like:

query {
  queryawsEc2 {
    arn
  }
}

Want to query a single account?

query {
  queryawsEc2(filter: { accountId: { eq: "123456" } }) {
    id
    arn
  }
}

Want to access your AWS EC2 Instances, Azure Virtual Machines, and GCP Virtual Machines in the same query for all your accounts, subscriptions, and projects?

query {
  queryawsEc2 {
    id
  }
  queryAzureVirtualMachine {
    id
  }
  queryGcpVirtualMachine {
    id
  }
}

Bonus: you can query everything in AWS account 123456 by tag:

query {
  queryawsTag(
    filter: { key: { eq: "Environment" }, value: { eq: "Production" } }
  ) {
    key
    value
    ec2Instance(filter: { accountId: { eq: "123456" } }) {
      id
      arn
    }
  }
}

Bonus #2: query everything in all your AWS accounts by tag:

‍

query {
  queryawsTag(
    filter: { key: { eq: "Environment" }, value: { eq: "Production" } }
  ) {
    key
    value
    ec2Instance {
      id
      arn
    }
  }
}

As with the two previous scenarios, what was once a painful process for obtaining multi-account and even multi-cloud data now requires just a few lines of code using CloudGraph.

Conclusion

It’s interesting to think about what the future holds for multi-account and multi-environment cloud data queries. If we shift our thinking away from the current paradigm of API and data modularity to one where we can query any data about any service on any cloud, our lives as DevOps engineers will become much easier! We would spend less time looking at the documentation, holding burdensome context in our heads, and having to constantly stay up to date with new changes.

Just as Lambda made it easy for engineers to run code without worrying about managing servers, the DevOps/Cloud ecosystem is long overdue for a powerful API query abstraction like CloudGraph. After all, why shouldn’t we be able to query our cloud resources and solve the host of compliance, asset inventory, and billing issues we face every day in a saner fashion?

Check out CloudGraph to simplify your AWS, Azure, and GCP data access needs, and please let me know what you think of our solution. We’re committed to building the open-source GraphQL API for everything related to the cloud, and we are just getting started! For more information, please check out the CloudGraph site: https://www.cloudgraph.dev/ and the CloudGraph docs: https://docs.cloudgraph.dev/

Also Published Here