How should you host your app in 2024?
All posts

5 AWS/GCP Terraform Gotchas

A list of 5 tips and tricks to avoid Terraform gotchas with AWS and GCP
February 6, 2024

Programming is full of ups and downs. While the victories always feel great, there are also lots of tricky little things that end up sapping your time and energy. I feel a sense of responsibility to share those experiences in the hopes that it helps even just one fellow programmer.

Here's a list of things that make me go 🤦‍♂️

1. Lock your provider versions!

I know this might seem like an obvious one but I'll reluctantly admit that I've been bitten by this on multiple occasions. Biggest lesson I've learned is to be fairly aggressive with version locking. I initially thought that preventing major version upgrades would be enough but that may not always be the case.

The biggest issue here is that you cannot always downgrade a terraform provider since it's possible that your terraform state is no longer compatible with the previous version. (should only be possible w/ major version upgrades, but... 🤦‍♂️) I'd recommend going as far as locking the minor version as well. (the '~>' notation allows only the rightmost version component to increment)

e.g.


terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.23.0"
    }
  }
}

2. AWS Dynamic listener rules

If you've ever tried to implement dynamic listener rules for an AWS application load balancer you may have run into an error like "Error creating LB Listener Rule: PriorityInUse: Priority X is currently in use"

This would typically happen when reordering rules. e.g. going from:


[ruleA(p0), ruleB(p1), ruleC(p2)]

to


[ruleA(p0), ruleAA(p1), ruleB(p2), ruleC(p3)]

I've seen all kinds of "clever" solutions involving somehow tracking priorities and incrementing them to avoid the issue, but they're usually pretty complicated and error prone.

The easiest way to handle this is to name your listener rule resources based on their priority. e.g.


resource "aws_lb_listener_rule" "rule_p0" {

This way the rules are updated in place (e.g. in the example above p1 goes from ruleB => ruleAA), and the priorities never change so no priority conflict.

N.B. It's probably a good idea to remove any lifecycle arguments like ```create_before_destroy``` from these resources as well. (Since it would be impossible without changing the priority)

3. AWS ECS speedy healthchecks for non-prod envs

This one is less of a gotcha and more of a quality of life tip. ECS deployments (with default healthcheck settings) typically take a long time, and if you're deploying a preview or staging environment it can be very time consuming. This is especially painful if you're deploying many times a day.

You can make some minor changes to your health check configuration and enjoy faster deploys, without completely forfeiting stability, while still allowing for slow startups:


resource "aws_alb_target_group" "example_target_group" {
    name = "example-target-group"
    port = 80
    protocol = "HTTP"
    vpc_id = "${aws_vpc.my_vpc.id}"
    target_type = "ip"
    health_check {
        path = "/health"
        matcher = "200-399"
        interval = 10
        healthy_threshold = 2
        unhealthy_threshold = 9
    }
    lifecycle {
        create_before_destroy = true
    }
}

With the above configuration, health checks could pass in as little as 20 seconds. If your application has a tendency to occasionally have slow startups you'd still get up to 90 seconds before health checks would be consider failed.

4. Reducing GCP URL map size

This one is probably not super common but you may have run into this if you have a lot of microservices or many applications/environments under the same domain.


Error creating UrlMap: googleapi: Error 409: 'URL_MAP' error Resource error code: 'SIZE_LIMIT_EXCEEDED' Resource error message: 'Size of the url map exceeds the maximum permitted size'.

Using a 2nd url map might be an option but I'm not sure it's possible without using a 2nd domain, and if there is a way it's beyond me. I did find that I had alot of rules that were redundant though and it wasn't as obvious as I'd have thought.

Here's an example url map configuration:


# This is not a complete example, its just meant to show the path matchers/rules
resource "google_compute_url_map" "example-map" {
  name        = "exampleapp-urlmap"

  default_service = google_compute_backend_service.example_service.id

  host_rule {
    hosts        = ["public-assets.my-app-domain.com"]
    path_matcher = "public-assets"
  }

  path_matcher {
    name            = "public-assets"
    default_service = google_compute_backend_bucket.example_bucket.id

    path_rule {
      paths   = ["/*"]
      service = google_compute_backend_bucket.example_bucket.id
    }
  }

  path_matcher {
    name            = "example-backend-service"
    default_service = google_compute_backend_service.ex_backend_svc.id

    path_rule {
      paths   = ["/api/", "/api/*"]
      service = google_compute_backend_service.ex_backend_svc.id
    }
  }
}

First, the path_rule for the "public-assets" path_matcher is unnecessary and creates a duplicate rule. When you set a default_service for the path_matcher the result is the creation of a rule with paths = ["/*"]. So that path matcher could have just been:


path_matcher {
    name            = "public-assets"
    default_service = google_compute_backend_bucket.example_bucket.id
  }

Next is the path_rule for "example-backend-service". That rule ultimately generates 2 path rules in GCP. This is unnecessary because "*" matches 0 or more characters. So one of the rules is redundant and the path matcher would accomplish the same if it was:


path_matcher {
    name            = "example-backend-service"
    default_service = google_compute_backend_service.ex_backend_svc.id

    path_rule {
      paths   = ["/api/*"]
      service = google_compute_backend_service.ex_backend_svc.id
    }
  }

5. GKE NEGs (Network endpoint groups)

To help handle routing with GKE services google has this awesome way of setting up standalone zonal NEGs along with kubernetes services.

Basically, you set some annotations on your kubernetes service, and GCP takes care of setting up and managing the NEGs for you.

e.g.


resource "kubernetes_service" "myexampleservice" {
    metadata {
        name = "myexampleservice-svc"
        namespace = "myexampleservice-deploy"
        annotations = {
            "cloud.google.com/neg": jsonencode({
                exposed_ports = {
                    80 = {
                        name = "myexampleservice-neg"
                    }
                }
            })
        }
    }
}

Then, GCP automatically adds additional annotations to the k8s service that contain the names of the NEGs that are created as well as the zones that they were created in.

The issue here is that we're no longer managing the NEGs via terraform and there are resources that we are managing that depend on the NEGs. (In my case it was a "google_compute_backend_service".)

Accessing the annotations via the "kubernetes_service" resource doesn't work because we're not able to see the additional annotations that were added by GCP.

The solution is to add a data resource for the kubernetes_service we just created. Surprisingly, this allows us to see the added annotations:

e.g.


data "kubernetes_service" "myexampleservice" {
    metadata {
        name = "myexampleservice-svc"
        namespace = "myexampleservice-deploy"
    }

    depends_on = [kubernetes_service.myexampleservice]
}

data "google_compute_network_endpoint_group" "myexampleservice-neg" {
    count = data.kubernetes_service.myexampleservice.metadata != null && data.kubernetes_service.myexampleservice.metadata[0].annotations != null ? length(jsondecode(data.kubernetes_service.myexampleservice.metadata[0].annotations["cloud.google.com/neg-status"])["zones"]) : 0
    name = "myexampleservice-neg"
    zone = jsondecode(data.kubernetes_service.myexampleservice.metadata[0].annotations["cloud.google.com/neg-status"])["zones"][count.index]

    depends_on = [
        kubernetes_service.myexampleservice,
        data.kubernetes_service.myexampleservice
    ]
}

resource "google_compute_backend_service" "myexampleservice" {
    name      = "myexampleservice"
    protocol  = "HTTP"
    port_name = "http"
    timeout_sec = 30
    log_config {
      enable = true
      sample_rate = "1.0"
    }

    dynamic "backend" {
      for_each = data.kubernetes_service.myexampleservice.metadata[0].annotations != null ? range(length(jsondecode(data.kubernetes_service.myexampleservice.metadata[0].annotations["cloud.google.com/neg-status"])["zones"])) : []
      content {
        group = data.google_compute_network_endpoint_group.myexampleservice-neg[backend.value].id
        balancing_mode = "RATE"
        max_rate = 1000
      }
    }
    health_checks = [google_compute_health_check.myexampleservice.id]
}

Recap

Thanks for reading! I hope this helps save you some wasted effort, coffee, and aspirin. And please excuse my shameless plug for what we're building over at Coherence. We’re doing this kind of work across the development lifecycle so you can focus on what really matters, your actual application/business logic.