Exporting Your GKE Cluster to Terraform Cloud: A Guide with Challenges and Solutions

Hello everyone. In this article, I'll share our journey at ANNA Money from using our Kubernetes installation in GCP to the managed Kubernetes service by Google and GKE. Initially, we migrated our test environments through the GCP web console, which was easy but not reproducible. Before migrating our production environment, we realized the importance of following the (IaC) Infrastructure-as-Code principle and documenting our setup using code.

The problems we needed to solve were:

It was ensured that the existing GKE cluster was not broken or taken down during the process
Creating an exact copy of the cluster in a way that was idempotent and could be easily repeated.

The options available are to use either the Ansible module or Terraform to describe a cluster.

The choice was based on personal preference, and Terraform was selected for two reasons:

The use of Terraform Cloud to manage different environments;
Terraform CLI tool that allows exporting existing GCP configurations into Terraform files (including not just GKE but also networks, DNS and others).

Initially, a test GKE cluster configuration was exported using Terraformer with the following simple command:

terraformer import google --resources=gke --connect=true --regions=${REGION_GKE_CLUSTER} --projects=${PROJECT_ID}

By default, the Terraformer CLI creates folders for each exported service at the path /generated/{provider}/{service}. However, you have the option to modify the file structure through CLI options or move the files to another location.

If you are new to Terraform, you will need to download the corresponding plugin for your provider and place it in the proper location:

$HOME/.terraform.d/plugins/darwin_amd64

After import we should get this file and folder structure:

terraform
└── test
    ├── backend.tf
    ├── container_cluster.tf
    ├── container_node_pool.tf
    ├── output.tf
    └── provider.tf

After the export process, a local backend was utilized, and a terraform.tfstate file was generated to store the current state of the GKE cluster. The cluster specifications were stored in the container_cluster.tf file, and node pool information in the container_node_pool.tf file.

A terraform plan command was used to verify the exported configuration locally and small modifications were made using the apply command, which were successfully implemented. As there were no issues, the configuration was then run on the cloud using the Terraform Cloud web console. A new workspace was created, and the exported configuration was linked to a GitHub repository.

Workspaces are useful for organizing infrastructure, similar to build configurations in CI. They contain Terraform configuration, variable values, secrets, run history, and state information.

We updated the provider.tf file to pass secrets variables through the Terraform Cloud engine, and added the secrets to the workspace:

variable "google_service_account" {}

terraform {
  required_providers {
    google = {
      source      = "hashicorp/google"
      version     = ">=4.51.0"
    }
  }
}

provider "google" {
  project     = "{{YOUR GCP PROJECT}}"
  credentials = "${var.google_service_account}"
  region      = "us-central1"
  zone        = "us-central1-c"
}

Unfortunately, sensitive data was accidentally leaked when the code was pushed to the target repository in the ./gke/test folder. The leak was discovered in the Terraform state file, where the master_auth.0.cluster_ca_certificate property contained an encoded certificate. To resolve this issue, the state file was deleted, a force push was made, and extra caution was taken in the future.

A new issue arose after the state file was deleted. The Terraform Cloud made a plan to recreate the existing cluster, instead of recognizing its existence. This highlighted that simply moving the existing configuration to the cloud is not enough, and the backend must be migrated first.

To perform the migration, the following steps were taken:

Create an API user token;
Create a ./terraformrc file in the user's home directory and populate it as follows
```
credentials "app.terraform.io" {
  token = "{{ token }}"
}
```

Change the backend configuration from the local backend to the remote backend

terraform {
  backend "remote" {
    hostname     = "app.terraform.io"
    organization = {{ ORG_NAME_IN_TERRAFORM_CLOUD }}
    workspaces {
      name = "gke-test"
    }
  }
}

And run the terraform init command to transfer the state to the cloud.

Caution: If you run a plan in a remote workspace, it's recommended to re-create the workspace before init due to the possibility of errors with an improperly saved state

Once the steps were followed correctly, the result was a plan with no changes needed:

Plan: 0 to add, 0 to change, 0 to destroy.

Alternatively, there may have been minor changes, but these changes should not affect the existing cluster:

+ min_master_version      = "1.24.8-gke.401"

It was crucial for us to avoid disrupting the functioning test environments, so we made sure that the applied modifications would not cause any issues.

In the state tab, there were two states present: the original state and the state after the first queued plan:

In the run tab, the complete execution history was available:

As a bonus, there were GitHub checks integrated with pull requests (PRs):

Additionally, upon clicking the Details link, we were able to view the anticipated plan and confirm that everything was as we had planned:

Subsequently, we created the GKE cluster for the internal infrastructure using Terraform, bypassing the GCP UI. As previously discussed, our goal was to manage our infrastructure programmatically. To achieve this, we created another folder in our GitHub repo, copied and modified the configuration from a test cluster, and established a new workspace in Terraform Cloud. The workspace was properly configured by specifying the associated folder in the workspace settings.

Next we setup the folder to automatic VCS trigger action:

We attempted to schedule a plan and execute it, however, things didn't go as smoothly as we had hoped. We encountered an error:

Error: error creating NodePool: googleapi: Error 409: Already exists

on container_node_pool.tf line 1, in resource "google_container_node_pool" "anna-prod_default-pool":
   1: resource "google_container_node_pool" "anna-prod_default-pool" {

After opening the Google Cloud Platform console, we found that the default pool had been created successfully. We attempted to resolve the issue by deleting the cluster and recreating it, but that didn't work. We even tried removing the node pool manually, but still no luck. It appears that we made a mistake by relying on our previous successful configuration instead of consulting the official documentation. This resulted in our difficulties, which were explained in the documentation in an interesting way.

# We can't create a cluster with no node pool defined, 
# but we want to only use
# separately managed node pools. 
# So we create the smallest possible default
# node pool and immediately delete it.
remove_default_node_pool = true
initial_node_count = 1

We tried a different approach and opted not to define the default node pool, instead using properties within the cluster configuration. This solution worked for us.

To make things even better, we setup Slack integration through the cloud web console and started receiving notifications.

And started receive notifications:

What have we done?

We used Terraformer CLI to export the existing GKE configuration. We then cleaned up and checked the exported configuration, created a workspace in Terraform Cloud, configured it, moved the state to it, and set up integrations with GitHub and Slack. We made sure not to break the existing cluster while testing more significant changes. We defined a new cluster in a different workspace within the same GitHub project and fix problems by creating it through Terraform rather than the web console.

What have we finally got?

As a result, we now have infrastructure as code for our GKE clusters and a free, user-friendly cloud solution for our small team. Making infrastructure changes is now safe, secure, and predictable.

What is cold now?

However, there are limitations in notifications at the organization level in Terraform Cloud. We must consider simplifying the process, such as moving the GCP network configuration to Terraform.