Terraform on AWS: Infrastructure as Code That Actually Scales
Practical Terraform patterns for managing AWS infrastructure at scale — state management, module design, workspace strategies, and the mistakes I made so you don't have to.
Introduction
I’ve managed AWS infrastructure through the AWS Console, then through CloudFormation, and eventually through Terraform. Each step was an improvement, but Terraform with proper practices is in a different league — especially when multiple engineers are working on the same infrastructure.
This post covers the patterns I’ve settled on after running Terraform in production across several projects. Not the toy examples — the stuff that matters when infrastructure is large, shared, and business-critical.
Why Terraform Over CloudFormation?
CloudFormation is fine for AWS-only shops with simple needs. But Terraform has a few meaningful advantages:
- Provider ecosystem. One tool manages AWS, Cloudflare, GitHub, PagerDuty, and more. Infrastructure really is just code.
- Plan before apply.
terraform planshows exactly what will change before anything happens. CloudFormation’s change sets are slower and less readable. - State is explicit. You know what Terraform manages. CloudFormation drift detection is a separate, slower step.
- HCL is more readable. Opinions vary, but I find HCL easier to review than YAML/JSON CloudFormation templates, especially for complex resources.
State Management: Get This Right First
Terraform state is the source of truth about what infrastructure exists. Storing it locally is fine for experiments — terrible for anything shared.
Remote state in S3 with DynamoDB locking:
# backend.tf
terraform {
backend "s3" {
bucket = "my-company-terraform-state"
key = "production/api/terraform.tfstate"
region = "ap-southeast-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
The DynamoDB table prevents two engineers from running terraform apply simultaneously and corrupting state. The S3 bucket should have versioning enabled — if state gets corrupted, you can roll back.
Create the backend resources before using them. Bootstrap the S3 bucket and DynamoDB table with a small separate Terraform configuration (or via the AWS CLI). You can’t use Terraform to create its own backend on the first run.
Module Structure
Flat Terraform files work for small projects. For anything larger, modules are essential for reuse and separation of concerns.
My typical structure:
infrastructure/
├── modules/
│ ├── ecs-service/ # Reusable ECS service module
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── rds-postgres/
│ └── lambda-function/
├── environments/
│ ├── production/
│ │ ├── main.tf # Composes modules
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ └── staging/
│ ├── main.tf
│ └── terraform.tfvars
└── global/
├── ecr.tf # Resources shared across environments
└── iam-roles.tf
A reusable ECS service module:
# modules/ecs-service/variables.tf
variable "service_name" { type = string }
variable "image_uri" { type = string }
variable "cpu" { type = number; default = 256 }
variable "memory" { type = number; default = 512 }
variable "desired_count" { type = number; default = 2 }
variable "environment_variables" {
type = map(string)
default = {}
}
# modules/ecs-service/main.tf
resource "aws_ecs_task_definition" "this" {
family = var.service_name
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = var.cpu
memory = var.memory
execution_role_arn = aws_iam_role.execution.arn
container_definitions = jsonencode([{
name = var.service_name
image = var.image_uri
environment = [
for k, v in var.environment_variables : { name = k, value = v }
]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = "/ecs/${var.service_name}"
"awslogs-region" = data.aws_region.current.name
"awslogs-stream-prefix" = "ecs"
}
}
}])
}
Consuming it from an environment:
# environments/production/main.tf
module "api_service" {
source = "../../modules/ecs-service"
service_name = "api"
image_uri = "${var.ecr_repo_url}:${var.image_tag}"
cpu = 512
memory = 1024
desired_count = 3
environment_variables = {
ENVIRONMENT = "production"
LOG_LEVEL = "INFO"
}
}
This pattern means changes to the ECS service definition propagate to all environments that use the module — update once, test in staging, promote to production.
Handling Secrets
Never put secrets in .tfvars files or hardcode them in Terraform. The pattern I use:
# Create the secret shell in Terraform (the value is managed separately)
resource "aws_secretsmanager_secret" "db_password" {
name = "production/api/db-password"
}
# Reference the secret ARN as an ECS task environment variable
resource "aws_ecs_task_definition" "api" {
# ...
container_definitions = jsonencode([{
# ...
secrets = [{
name = "DATABASE_PASSWORD"
valueFrom = aws_secretsmanager_secret.db_password.arn
}]
}])
}
The actual secret value is set via the AWS CLI or Console — not Terraform. Terraform manages the resource; the sensitive value lives only in Secrets Manager and never touches version control or Terraform state.
Workspaces vs Separate State Files
Terraform workspaces let you manage multiple environments from one configuration by switching terraform workspace select staging. I’ve used them and I don’t recommend them for production.
The problem: all workspaces share the same backend configuration but use different state files under the hood. When something goes wrong, it’s easy to accidentally apply staging changes to production because the configuration is identical and you forgot to switch workspaces.
Separate directories with separate state files (as shown in my structure above) is more explicit and harder to accidentally get wrong. The extra boilerplate is worth it.
Plan Reviews in CI
terraform apply should never run from an engineer’s laptop in production. Our GitLab CI pipeline:
terraform-plan:
stage: plan
script:
- terraform init
- terraform plan -out=tfplan
artifacts:
paths:
- tfplan
expire_in: 1 hour
terraform-apply:
stage: apply
script:
- terraform apply tfplan
when: manual # Requires a human to click "Run" in CI
only:
- main
The manual gate means a planned change sits in CI until an engineer reviews the plan output and clicks apply. This is the closest thing to a safety net you get with infrastructure changes.
Importing Existing Resources
If you’re adopting Terraform on an existing AWS account (most real-world scenarios), you’ll need to import resources Terraform didn’t create:
terraform import aws_s3_bucket.assets my-company-assets-bucket
After importing, Terraform knows about the resource and will manage it going forward. Before running apply, always run plan to confirm Terraform isn’t trying to change anything unexpected on the imported resource.
The Mistakes I Made
Mistake 1: Not pinning provider versions. Terraform providers release breaking changes. Without version pinning, terraform init can pull a new provider version and your plan suddenly shows unintended diffs or errors.
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0" # Minor version updates only
}
}
}
Mistake 2: Putting too much in one state file. When all infrastructure is in one Terraform state, every plan loads the entire state, every apply can potentially touch anything, and state locking blocks all other engineers while one apply is running. Split state by service boundary.
Mistake 3: Not using lifecycle blocks for critical resources. S3 buckets and RDS instances should never be accidentally destroyed. prevent_destroy is a guardrail:
resource "aws_db_instance" "main" {
# ...
lifecycle {
prevent_destroy = true
}
}
terraform destroy will error on this resource. A deliberate deletion requires removing the lifecycle block first — giving you a moment to reconsider.
Conclusion
Terraform done well is almost invisible — infrastructure changes are predictable, reviewable, and repeatable. Getting there requires discipline around state management, module design, and CI/CD integration.
Start with remote state. Structure your modules around reusable primitives. Run plans in CI with a manual approval gate for applies. Pin your provider versions. And add prevent_destroy to anything you’d lose sleep over.
Infrastructure as code is only as good as the practices around it.
Related Articles
Docker for Backend Developers: A Practical Introduction
Learn how Docker works, why backend developers need it, and how to containerize your first Python or Go application in under 30 minutes.
Containerising a Backend Service: From Docker to Kubernetes
A practical walkthrough of containerising a Python backend service with Docker, deploying it to Kubernetes on ECS, and the production gaps that only show up once real traffic hits.
Kubernetes Basics Every Backend Developer Should Know
You don't need to be a DevOps engineer to understand Kubernetes. Learn the core concepts — Pods, Deployments, Services — that every backend developer encounters in production.