Skip to content
5 min read

Terraform on AWS: Infrastructure as Code That Actually Scales

Practical Terraform patterns for managing AWS infrastructure at scale — state management, module design, workspace strategies, and the mistakes I made so you don't have to.

#terraform #aws #devops #infrastructure #cloud

Introduction

I’ve managed AWS infrastructure through the AWS Console, then through CloudFormation, and eventually through Terraform. Each step was an improvement, but Terraform with proper practices is in a different league — especially when multiple engineers are working on the same infrastructure.

This post covers the patterns I’ve settled on after running Terraform in production across several projects. Not the toy examples — the stuff that matters when infrastructure is large, shared, and business-critical.

Why Terraform Over CloudFormation?

CloudFormation is fine for AWS-only shops with simple needs. But Terraform has a few meaningful advantages:

  • Provider ecosystem. One tool manages AWS, Cloudflare, GitHub, PagerDuty, and more. Infrastructure really is just code.
  • Plan before apply. terraform plan shows exactly what will change before anything happens. CloudFormation’s change sets are slower and less readable.
  • State is explicit. You know what Terraform manages. CloudFormation drift detection is a separate, slower step.
  • HCL is more readable. Opinions vary, but I find HCL easier to review than YAML/JSON CloudFormation templates, especially for complex resources.

State Management: Get This Right First

Terraform state is the source of truth about what infrastructure exists. Storing it locally is fine for experiments — terrible for anything shared.

Remote state in S3 with DynamoDB locking:

# backend.tf
terraform {
  backend "s3" {
    bucket         = "my-company-terraform-state"
    key            = "production/api/terraform.tfstate"
    region         = "ap-southeast-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

The DynamoDB table prevents two engineers from running terraform apply simultaneously and corrupting state. The S3 bucket should have versioning enabled — if state gets corrupted, you can roll back.

Create the backend resources before using them. Bootstrap the S3 bucket and DynamoDB table with a small separate Terraform configuration (or via the AWS CLI). You can’t use Terraform to create its own backend on the first run.

Module Structure

Flat Terraform files work for small projects. For anything larger, modules are essential for reuse and separation of concerns.

My typical structure:

infrastructure/
├── modules/
│   ├── ecs-service/       # Reusable ECS service module
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── rds-postgres/
│   └── lambda-function/
├── environments/
│   ├── production/
│   │   ├── main.tf        # Composes modules
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   └── staging/
│       ├── main.tf
│       └── terraform.tfvars
└── global/
    ├── ecr.tf             # Resources shared across environments
    └── iam-roles.tf

A reusable ECS service module:

# modules/ecs-service/variables.tf
variable "service_name" { type = string }
variable "image_uri"    { type = string }
variable "cpu"          { type = number; default = 256 }
variable "memory"       { type = number; default = 512 }
variable "desired_count" { type = number; default = 2 }
variable "environment_variables" {
  type    = map(string)
  default = {}
}
# modules/ecs-service/main.tf
resource "aws_ecs_task_definition" "this" {
  family                   = var.service_name
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = var.cpu
  memory                   = var.memory
  execution_role_arn       = aws_iam_role.execution.arn

  container_definitions = jsonencode([{
    name  = var.service_name
    image = var.image_uri
    environment = [
      for k, v in var.environment_variables : { name = k, value = v }
    ]
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = "/ecs/${var.service_name}"
        "awslogs-region"        = data.aws_region.current.name
        "awslogs-stream-prefix" = "ecs"
      }
    }
  }])
}

Consuming it from an environment:

# environments/production/main.tf
module "api_service" {
  source = "../../modules/ecs-service"

  service_name  = "api"
  image_uri     = "${var.ecr_repo_url}:${var.image_tag}"
  cpu           = 512
  memory        = 1024
  desired_count = 3

  environment_variables = {
    ENVIRONMENT = "production"
    LOG_LEVEL   = "INFO"
  }
}

This pattern means changes to the ECS service definition propagate to all environments that use the module — update once, test in staging, promote to production.

Handling Secrets

Never put secrets in .tfvars files or hardcode them in Terraform. The pattern I use:

# Create the secret shell in Terraform (the value is managed separately)
resource "aws_secretsmanager_secret" "db_password" {
  name = "production/api/db-password"
}

# Reference the secret ARN as an ECS task environment variable
resource "aws_ecs_task_definition" "api" {
  # ...
  container_definitions = jsonencode([{
    # ...
    secrets = [{
      name      = "DATABASE_PASSWORD"
      valueFrom = aws_secretsmanager_secret.db_password.arn
    }]
  }])
}

The actual secret value is set via the AWS CLI or Console — not Terraform. Terraform manages the resource; the sensitive value lives only in Secrets Manager and never touches version control or Terraform state.

Workspaces vs Separate State Files

Terraform workspaces let you manage multiple environments from one configuration by switching terraform workspace select staging. I’ve used them and I don’t recommend them for production.

The problem: all workspaces share the same backend configuration but use different state files under the hood. When something goes wrong, it’s easy to accidentally apply staging changes to production because the configuration is identical and you forgot to switch workspaces.

Separate directories with separate state files (as shown in my structure above) is more explicit and harder to accidentally get wrong. The extra boilerplate is worth it.

Plan Reviews in CI

terraform apply should never run from an engineer’s laptop in production. Our GitLab CI pipeline:

terraform-plan:
  stage: plan
  script:
    - terraform init
    - terraform plan -out=tfplan
  artifacts:
    paths:
      - tfplan
    expire_in: 1 hour

terraform-apply:
  stage: apply
  script:
    - terraform apply tfplan
  when: manual  # Requires a human to click "Run" in CI
  only:
    - main

The manual gate means a planned change sits in CI until an engineer reviews the plan output and clicks apply. This is the closest thing to a safety net you get with infrastructure changes.

Importing Existing Resources

If you’re adopting Terraform on an existing AWS account (most real-world scenarios), you’ll need to import resources Terraform didn’t create:

terraform import aws_s3_bucket.assets my-company-assets-bucket

After importing, Terraform knows about the resource and will manage it going forward. Before running apply, always run plan to confirm Terraform isn’t trying to change anything unexpected on the imported resource.

The Mistakes I Made

Mistake 1: Not pinning provider versions. Terraform providers release breaking changes. Without version pinning, terraform init can pull a new provider version and your plan suddenly shows unintended diffs or errors.

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"  # Minor version updates only
    }
  }
}

Mistake 2: Putting too much in one state file. When all infrastructure is in one Terraform state, every plan loads the entire state, every apply can potentially touch anything, and state locking blocks all other engineers while one apply is running. Split state by service boundary.

Mistake 3: Not using lifecycle blocks for critical resources. S3 buckets and RDS instances should never be accidentally destroyed. prevent_destroy is a guardrail:

resource "aws_db_instance" "main" {
  # ...
  lifecycle {
    prevent_destroy = true
  }
}

terraform destroy will error on this resource. A deliberate deletion requires removing the lifecycle block first — giving you a moment to reconsider.

Conclusion

Terraform done well is almost invisible — infrastructure changes are predictable, reviewable, and repeatable. Getting there requires discipline around state management, module design, and CI/CD integration.

Start with remote state. Structure your modules around reusable primitives. Run plans in CI with a manual approval gate for applies. Pin your provider versions. And add prevent_destroy to anything you’d lose sleep over.

Infrastructure as code is only as good as the practices around it.

Kaikobud Sarkar

Kaikobud Sarkar

Software engineer passionate about backend technologies and continuous learning. I write about Python frameworks, cloud architecture, engineering growth, and staying current in tech.

Related Articles