Terraform Stacks: Managing Multi-Environment Infrastructure at Scale

Home › Blog › Terraform Stacks: Managing Multi-Environment Infrastructure at Scale

Terraform Stacks for Multi-Environment Infrastructure

Terraform Stacks multi-environment management represents a meaningful shift in how teams organize and deploy infrastructure across many environments. Introduced by HashiCorp as a native orchestration layer, Stacks coordinate multiple Terraform configurations, handle cross-stack dependencies, and enable consistent deployments from development through production. Rather than gluing configurations together with shell scripts, you describe the relationships between components declaratively and let the platform sequence the work.

This guide walks you through adopting Stacks to replace fragile workspace-based or directory-based multi-environment patterns. Moreover, you will learn how to design reusable components, implement deployment orchestration, and maintain drift detection across all your environments without bolting on external tooling.

The Problem with Traditional Multi-Environment Patterns

Most teams manage multiple environments using one of three approaches, and each carries real drawbacks. Workspaces share a single configuration and state backend, which makes it dangerously easy to run an apply against the wrong environment because the only thing separating prod from dev is a CLI flag. Directory duplication — copying the whole config into environments/dev, environments/prod, and so on — drifts apart over time as fixes land in one folder but not the others. Wrapper scripts built on tools like Make or Bash add orchestration, but they reinvent dependency management poorly and become their own maintenance burden.

Stacks address all three failure modes by introducing a declarative orchestration layer above individual Terraform configurations. Specifically, each stack defines which components to deploy, in what order, and with which environment-specific variables — so the difference between environments becomes a small, reviewable set of inputs rather than a divergent copy of the entire codebase.

Multi-environment cloud infrastructure management — Orchestrating infrastructure deployments across multiple environments

Terraform Stacks Multi-Environment Architecture

A Stack is composed of three concepts: components, deployments, and orchestration rules. Components are reusable Terraform configurations — your VPC module, your EKS cluster module, your RDS module. Deployments define where and how those components get applied, supplying the per-environment inputs. Additionally, orchestration rules specify ordering, dependencies, and approval gates between deployments. Crucially, dependencies between components are expressed by referencing one component’s outputs as another’s inputs, and the engine derives the correct apply order from that graph automatically.

# stacks/platform/components.tfstack.hcl
# Define reusable components

component "networking" {
  source = "./modules/networking"

  inputs = {
    vpc_cidr       = var.vpc_cidr
    environment    = var.environment
    azs            = var.availability_zones
    enable_nat     = var.environment != "dev"
    enable_vpn     = var.environment == "production"
  }
}

component "kubernetes" {
  source = "./modules/eks-cluster"

  inputs = {
    cluster_name    = "${var.project}-${var.environment}"
    vpc_id          = component.networking.vpc_id
    subnet_ids      = component.networking.private_subnet_ids
    node_count      = var.node_count
    instance_types  = var.instance_types
    k8s_version     = var.kubernetes_version
  }
}

component "database" {
  source = "./modules/rds-aurora"

  inputs = {
    cluster_name     = "${var.project}-${var.environment}-db"
    vpc_id           = component.networking.vpc_id
    subnet_ids       = component.networking.database_subnet_ids
    instance_class   = var.db_instance_class
    multi_az         = var.environment == "production"
    backup_retention = var.environment == "production" ? 30 : 7
  }
}

component "monitoring" {
  source = "./modules/observability"

  inputs = {
    cluster_endpoint = component.kubernetes.cluster_endpoint
    db_endpoint      = component.database.cluster_endpoint
    environment      = var.environment
    alert_endpoints  = var.alert_endpoints
  }
}

Look closely at how the environment shapes behavior here. Expressions like enable_nat = var.environment != "dev" and backup_retention = var.environment == "production" ? 30 : 7 encode policy directly in the component definition. Therefore, dev skips the cost of NAT gateways, while production gets longer backup retention and multi-AZ — all from one source of truth, with no duplicated files to keep in sync.

Defining Deployments per Environment

# stacks/platform/deployments.tfdeploy.hcl

deployment "dev" {
  inputs = {
    environment        = "dev"
    project            = "platform"
    vpc_cidr           = "10.0.0.0/16"
    availability_zones = ["us-east-1a", "us-east-1b"]
    node_count         = 2
    instance_types     = ["t3.medium"]
    kubernetes_version = "1.29"
    db_instance_class  = "db.t3.medium"
    alert_endpoints    = ["dev-alerts@company.com"]
  }
}

deployment "staging" {
  inputs = {
    environment        = "staging"
    project            = "platform"
    vpc_cidr           = "10.1.0.0/16"
    availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
    node_count         = 3
    instance_types     = ["t3.large"]
    kubernetes_version = "1.29"
    db_instance_class  = "db.r6g.large"
    alert_endpoints    = ["staging-alerts@company.com"]
  }
}

deployment "production" {
  inputs = {
    environment        = "production"
    project            = "platform"
    vpc_cidr           = "10.2.0.0/16"
    availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
    node_count         = 6
    instance_types     = ["m6i.xlarge", "m6i.2xlarge"]
    kubernetes_version = "1.28"
    db_instance_class  = "db.r6g.2xlarge"
    alert_endpoints    = ["prod-alerts@company.com", "oncall@company.com"]
  }
}

This file is the entire surface area that differs between environments, and that is the point. Non-overlapping CIDR ranges (10.0/16, 10.1/16, 10.2/16) keep the VPCs ready to peer later without renumbering. Note too that production deliberately pins Kubernetes 1.28 while the lower environments run 1.29 — a common pattern that lets teams validate a version in staging for a release cycle before promoting it. Because the component graph is identical across all three, what you test in staging is structurally the same thing you ship to production.

Orchestrated Deployment Pipelines

Stacks enable progressive deployments where changes flow from dev to staging to production with automated validation gates between each environment. Consequently, this pattern catches regressions in a low-stakes environment before they ever reach customers.

# stacks/platform/orchestration.tfdeploy.hcl

orchestrate "progressive_rollout" {
  check {
    # Deploy to dev first
    condition = context.deployment == "dev"
    action    = "auto_approve"
  }

  check {
    # Staging requires dev to be healthy
    condition = context.deployment == "staging"
    depends_on = [deployment.dev]
    wait_for {
      health_check = "https://api.dev.company.com/health"
      timeout      = "5m"
    }
    action = "auto_approve"
  }

  check {
    # Production requires manual approval
    condition = context.deployment == "production"
    depends_on = [deployment.staging]
    wait_for {
      health_check = "https://api.staging.company.com/health"
      timeout      = "10m"
    }
    action = "manual_approve"
    notify = ["platform-team@company.com"]
  }
}

The asymmetry between environments is intentional. Dev auto-approves so engineers iterate quickly, staging gates on a real health check so a broken dev never silently promotes, and production demands a human in the loop. As a result, the orchestration encodes your release policy as code that the platform enforces, rather than as tribal knowledge in a runbook that someone forgets at 2 a.m.

Cloud infrastructure deployment pipeline — Progressive deployment pipeline from dev through production

Drift Detection and Reconciliation

One of the most valuable features of Stacks is built-in drift detection across all deployments. Instead of manually running terraform plan against each environment on a schedule, Stacks continuously compare actual infrastructure against desired state and alert you when the two diverge — for example, when someone fixes a security group by hand in the console during an incident and forgets to bring it back into code.

# Enable drift detection for all deployments
orchestrate "drift_detection" {
  schedule = "0 */6 * * *"  # Every 6 hours

  on_drift {
    severity = "high"
    action   = "notify"
    notify   = ["infrastructure-team@company.com"]
  }

  on_drift {
    severity = "critical"
    action   = "auto_reconcile"
    notify   = ["infrastructure-team@company.com", "oncall@company.com"]
  }
}

A word of restraint on auto_reconcile: automatically reverting drift is powerful but blunt. If an on-call engineer made an emergency change for a good reason, a reconcile job that silently undoes it can turn a contained incident into a worse one. Therefore, many teams start with notify for every severity, build trust in the signal, and only graduate the truly safe, well-understood resources to automatic reconciliation later.

Migrating an Existing Codebase into Stacks

You rarely adopt Stacks on a greenfield project; usually there is a pile of existing Terraform to bring across. The pragmatic path is incremental. First, refactor your most-copied configuration into a clean, input-driven module so it can serve as a component. Next, model a single non-production environment as a deployment and run it in parallel with the legacy setup, comparing plans until they match. Then import the existing real resources into the Stack’s state so nothing is recreated. Finally, retire the old workspace or directory once production has been cut over and observed for a full release cycle. This staged approach keeps a working rollback available at every step.

When NOT to Use Terraform Stacks

If you manage a single environment, or your infrastructure genuinely fits in one modest configuration, Stacks add ceremony without a payoff — a couple of workspaces will serve you better. Additionally, the full orchestration, drift scheduling, and approval features depend on HCP Terraform or Terraform Enterprise; the open-source CLI alone provides only limited Stack support, so factor the platform cost into your decision.

Teams already invested in Terragrunt or Pulumi should think twice as well, because those tools may already solve their multi-environment needs adequately. Migration from a mature Terragrunt setup is non-trivial and rarely justifies itself unless you specifically want the native drift detection and orchestration. In short, reach for Stacks when coordinating several interdependent environments has become the actual bottleneck — not merely because the feature is new.

Infrastructure planning and architecture — Evaluating whether Terraform Stacks fit your infrastructure needs

Key Takeaways

By defining components, deployments, and orchestration rules in a single Stack, you eliminate the fragile scripts and manual steps that traditionally coordinate multi-environment infrastructure. Furthermore, expressing per-environment differences as a small block of inputs — rather than diverging copies of the whole codebase — is what keeps staging an honest rehearsal for production. Built-in drift detection then closes the loop, ensuring environments stay consistent long after the initial rollout.

Start by identifying your most painful multi-environment workflow and modeling just that one as a Stack. For additional resources, consult the Terraform Stacks documentation and the HashiCorp blog on Stacks. You might also find our posts on Karpenter autoscaling and Kubernetes network policies helpful for managing the workloads running on your infrastructure.

In conclusion, Terraform Stacks multi-environment management is an essential topic for modern infrastructure teams. By applying the patterns and practices covered in this guide, you can build more robust, scalable, and maintainable systems. Start with the fundamentals, iterate on your implementation, and continuously measure results to ensure you are getting the most value from these approaches.