Infrastructure as code: the state file is the source of truth, not the repo

May 19, 2024

Infrastructure as code is the practice of describing the cloud resources a system needs — networks, virtual machines, databases, load balancers, DNS records, IAM roles — as code in a repository, and applying that code with a tool that makes the real cloud match what the code says. The pitch, circa 2014: stop clicking in consoles, stop sending runbooks to ops, stop losing the knowledge of “how this environment was built” to whichever engineer left the company. Put it in Git, review it like code, deploy it like code.

A decade in, the pitch has mostly won. Production cloud environments built by clicking are a minority. The tools — Terraform, Pulumi, AWS CDK, CloudFormation, the cloud-native equivalents — converged on a small set of ideas: declare the desired state, diff it against reality, apply the diff. The interesting arguments are no longer whether to do IaC; they are about which tool, and how to avoid the set of problems that show up in every IaC codebase once it is past toy size.

This post is about what IaC actually is, the tools worth knowing, the state file (which is the thing that most newcomers underestimate and most outages come from), and the patterns for structuring a large codebase so that it does not collapse under its own weight.

The declarative model

All of the tools in this space share the same basic model:

You write code (HCL, YAML, TypeScript, Python, Go, C#, depending on the tool) that describes the resources you want.
The tool reads a state file that records what it thinks already exists.
It asks the cloud provider what actually exists.
It computes a diff between your code, the state, and reality, and shows you a plan.
If you approve the plan, it calls the provider’s APIs to create, update, or delete resources until reality matches the code.

The declarative model is the key abstraction. You do not write “create this VPC, then create this subnet inside it, then create this route table, then attach it to the subnet” as a sequence of steps. You write “the VPC exists; the subnet exists inside it; the route table exists and is attached to the subnet” as a set of facts that must be true. The tool figures out the order, the dependencies, the create-vs-update-vs-delete decisions.

This shifts the discipline. You are not writing a script; you are writing a specification. The things you care about — “what should be true about this environment?” — are in the code. The things you do not care about — exact API call order, retries, idempotency — are the tool’s problem.

When it works, it is much better than procedural scripting. When it breaks, it breaks in the specific ways the declarative model makes possible: state drift, dependency cycles, and resources the tool wants to destroy that it should not.

The tools, briefly

Terraform (HashiCorp, 2014). The default. HCL as the configuration language, a huge ecosystem of providers covering everything from AWS to Cloudflare to Kubernetes to Datadog. The state file is explicit and lives somewhere (local file, S3, Terraform Cloud). License changed to BUSL in 2023, which forked the community into OpenTofu — an open-source fork that is drop-in compatible and gaining momentum. For most new projects in 2026, either is a reasonable default.

AWS CloudFormation (Amazon, 2011). Older than Terraform. Native to AWS only. YAML or JSON templates, state managed entirely by AWS in the form of stacks. No state file to lose; no drift to reconcile manually. In exchange, you are tied to AWS, the error messages are often opaque, rollbacks on failed updates can be slow and destructive, and the YAML is verbose. CDK (Cloud Development Kit) generates CloudFormation from TypeScript/Python/Java code and has become the preferred way to write CloudFormation at most organizations that chose it.

Pulumi (Pulumi, 2018). Like Terraform but the language is a general-purpose one — TypeScript, Python, Go, C#, Java. Same declarative model, same state file, similar provider coverage. Appeals to teams that find HCL limiting. Skeptics argue that letting engineers use a real programming language to build infrastructure invites the wrong kind of cleverness; proponents point out that HCL eventually grows into an inferior programming language anyway (loops, conditionals, functions, modules) and Pulumi just lets you use one that already exists.

Cloud-native equivalents. GCP has Deployment Manager (deprecated) and now leans on Terraform. Azure has ARM templates and the newer Bicep (a DSL that compiles to ARM templates, similar in spirit to CDK). Kubernetes has Helm and Kustomize, which are IaC scoped to in-cluster resources rather than cloud resources.

The meaningful dividing line is not between tools in this list. It is between tools that manage state inside the tool (CloudFormation, ARM) and tools that manage state as a file (Terraform, Pulumi, OpenTofu). The file-based model is more portable and more flexible; the service-based model is less to go wrong.

The state file

The state file is the thing most new users underestimate and most teams have had a bad incident about.

When Terraform (or Pulumi, or OpenTofu) applies a change, it records what it created in the state file: resource types, ids, attributes, dependencies. The next time it runs, it reads the state file to know what it already manages. Without the state file, it has no memory — a fresh clone of your repo would think nothing existed and try to create everything from scratch.

The state file is authoritative. If the state file says the bucket my-bucket is managed by this Terraform, and you delete the bucket by hand, the next terraform plan will offer to recreate it. If you create a bucket by hand that matches the name in your code and the state file has no record of it, the next plan will offer to create it again and fail because the name is taken. The code is not the source of truth; neither is the cloud. The state file is the source of truth for “what this IaC configuration believes it owns.”

Three practical consequences:

Store it remotely. A state file on someone’s laptop is a single point of failure and a single point of conflict. Store it in S3, GCS, Terraform Cloud, or equivalent, with locking (DynamoDB, a backend lock feature) so two people can’t apply at once.
Treat it as sensitive. The state file contains resource attributes, sometimes including secrets — RDS master passwords, API keys passed as resource arguments. Encrypt at rest; restrict access.
Back it up. A corrupted or lost state file means your code no longer tracks the resources it created. Recovering requires importing resources one by one. Backup the state bucket; enable versioning; keep a history.

The service-based tools (CloudFormation, ARM) handle this transparently, which is a real advantage. The file-based tools expose it, which gives you control and also more rope.

Drift

Drift is what happens when the real state of the cloud diverges from the IaC code’s understanding of it. Someone logs in and changes a security group rule by hand. An auto-scaling event creates a new instance. A compliance tool rotates a password. The code says one thing; reality says another.

The tools handle this with varying degrees of grace:

Terraform/OpenTofu: the next plan shows the difference. terraform refresh updates the state to match reality without changing code; terraform apply changes reality to match the code. Either choice is a decision, and skipping the decision is how drift accumulates silently.
CloudFormation: supports drift detection as a separate API call, but doesn’t surface drift automatically. Out-of-band changes often persist quietly until they conflict with a stack update.
Pulumi: similar to Terraform. pulumi refresh vs pulumi up.

The cultural answer to drift is to make out-of-band changes infrequent, auditable, and reconciled quickly. The technical answers are weaker: console access can be restricted by IAM, changes can be logged (CloudTrail), drift can be periodically detected. None of these stops drift, only surfaces it.

A team that treats IaC as “the way we make changes most of the time” and permits ad-hoc hand-edits under emergency will have drift. A team that treats IaC as the only path and enforces it with IAM policies (read-only console, break-glass accounts for emergencies) will have much less. The second posture is rare in practice; the first is common.

Modules and reuse

Any IaC codebase beyond the toy size needs reusable components. Every cloud has “the set of things you do a hundred times” — a VPC with public and private subnets and NAT, an EKS cluster with a node group and the right IAM, a Postgres database with snapshots and monitoring. Copy-pasting these across environments is how IaC codebases become unmaintainable.

Terraform’s answer is modules: a folder of Terraform code with inputs (variables) and outputs, referenced from elsewhere with a module block. Modules are first-class; the registry has thousands of them, and most organizations build their own for their specific needs.

CDK’s answer is constructs: classes in a real language, with inheritance, composition, parameters. Because CDK is a programming language, constructs compose like normal code does.

The design question, same in either tool: at what granularity do I make modules?

Too fine-grained — one module per resource — adds indirection without hiding complexity. Too coarse — one module per environment — prevents reuse and makes every small change a big deal. The useful heuristic is Skelton’s: a module should encode a pattern that is used in more than one place, with the variability captured as inputs. A “web service” module that provisions a load balancer, target group, ECS service, and log group in one call, parameterized by name, image, and port, is at the right grain if “web service” is a thing your organization deploys more than once.

A module registry — internal or public — is what turns modules into a platform. The thinnest viable platform’s paved path for “I need a new database” is often just “here is the Postgres module, here are the two inputs you need to set.”

The workspaces / environments problem

Every IaC codebase has to answer: how do I manage prod, staging, and dev separately?

Three models, each with tradeoffs:

Workspaces. Terraform’s built-in feature; each workspace gets its own state file, same code. Simple. The danger: it encourages running the same code against environments that should be genuinely different (prod should have more replicas, larger instances, tighter monitoring). When every difference is a count = var.environment == "prod" ? 3 : 1 inside the shared code, the code gets hard to read.
Separate directories, shared modules. One directory per environment, each calling into shared modules with different inputs. More boilerplate, but each environment’s code is explicit about what is in it. Easier to reason about “what runs in prod” by reading envs/prod. This is the approach most mature Terraform codebases converge on.
Separate repos. One repo per environment. Maximum isolation, maximum drift risk between environments. Sometimes justified for regulatory separation; usually overkill.

The second model, with modules that are semantically-versioned and pinned per environment, gives you the right combination of reuse and explicitness. It also gives you a natural path for progressive rollout: bump the module version in dev, test, bump it in staging, test, bump it in prod.

Imperative escape hatches

Every declarative IaC tool has an imperative escape hatch, because some things genuinely resist the declarative model. A partial list:

Ordering that the tool can’t infer. Rare in modern Terraform; more common in CloudFormation. Solved by explicit depends_on declarations.
One-time operations. Seed data in a database, an initial admin user, a migration that has to run once. Terraform’s null_resource with a local-exec or remote-exec provisioner is the standard hack. It works and it’s ugly; it lives outside the state model and produces operations the tool cannot undo.
Things the provider doesn’t support. The cloud moves faster than the providers. Features released last month may not have Terraform coverage yet. CLI calls via local-exec are the fallback.
Conditional logic. Some resources should exist in prod but not in dev, or should be configured differently based on environment. HCL has count and for_each for this; it gets ugly at the edges. Pulumi and CDK, because they are real languages, handle it more cleanly.

The escape hatches are honest about the limits of the model. The temptation is to use them everywhere, which undoes the value of declarative IaC. A rule of thumb: if you find yourself writing procedural glue, consider whether the problem belongs in IaC at all or whether it is really a one-shot migration script.

Secrets and sensitive data

IaC code lives in a repo. The repo is read by many people. The same IaC code needs to know database passwords, API keys, and TLS private keys. These cannot be in the repo.

The standard patterns:

External secret managers. AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager. IaC resources declare “this database has a password” and reference the secret by name; the secret itself is rotated and managed outside IaC.
Sealed secrets / SOPS. Encrypted secret values committed to the repo, decryptable only by holders of the right key (often a KMS key). git clone-able without exposing the secret.
Runtime injection. IaC creates the infrastructure but not the secret; a separate system (a CI/CD job with access to a vault, an init container, a platform primitive) injects secrets at deploy or run time.

The wrong pattern, common enough to name: plaintext in the repo, with a .gitignore rule that sometimes catches it and sometimes doesn’t. Every git leak incident has this somewhere in its history.

Testing IaC

Testing infrastructure is harder than testing application code, and the industry has spent a decade not quite agreeing on how to do it. The techniques in use:

Static analysis. terraform validate, tflint, checkov, tfsec. Catches syntax errors, common misconfigurations, known anti-patterns (public buckets, unencrypted volumes). Cheap, fast, high-signal. Run in CI on every PR.
Plan review. terraform plan in CI on every PR, with the plan posted as a comment. Humans read it, catch the “oh that’s going to delete the prod database” change before it merges. Atlantis, Spacelift, Terraform Cloud all do this. The single highest-leverage safety feature available.
Policy as code. Sentinel, OPA/Conftest, cloud-native tools (AWS Config rules). Write policies — “no public buckets,” “all RDS instances must have encryption enabled,” “no 0.0.0.0/0 in security groups except the load balancer’s” — and check plans against them. Automated enforcement of organizational rules.
Actual deployments into test environments. Terratest, Kitchen-Terraform. Run the code in a throwaway environment, verify the resources came up right, tear it down. Slow and expensive, but the only way to catch some classes of bug.

Most teams get away with the first two and only the disciplined ones do the others. The first two catch most of what matters.

The limits of the model

Declarative IaC is great at things that should exist. It is bad at things that should happen.

“The bucket should exist with these properties” — declarative model fits cleanly. “Run this migration when the schema version changes” — the tool does not know when that is. “Rotate this password every 90 days” — the rotation is an event, not a state. “Scale this cluster up during Black Friday” — that is a runtime decision, not a desired state.

The pattern that scales: IaC for the stable layer, operators and automation for the event-driven layer. Terraform provisions the EKS cluster and the base resources; a Kubernetes operator handles day-to-day things inside it. Terraform provisions the database; the application handles migrations. Terraform provisions the autoscaling group; the cloud’s autoscaler reacts to load.

Trying to push event-driven behavior into declarative IaC produces the null_resource + local-exec patterns that never feel clean because they are not clean. Accept the boundary.

The rule

IaC is declarative state management for cloud resources. The code is the spec; the state file is the memory; the cloud is the target. Get the first two right and most operational problems shrink.

Choose a tool that fits the organization’s constraints — Terraform or OpenTofu for portability, CDK for AWS-heavy teams that want real languages, Pulumi for any-cloud teams that want real languages, CloudFormation/ARM when managed state matters more than portability. The choice matters less than the discipline: remote state, locking, modules at the right grain, separate environments, secrets out of the repo, plans reviewed before apply.

The rest is bookkeeping. Most IaC incidents are not bugs in the tool; they are drift the team ignored, a state file someone lost, a plan nobody read carefully, or a module parameterized to hide what it was doing. The tools have gotten good enough that the remaining failures are operational. The defense against them is a small number of habits, applied consistently.