Testing strategies: pyramids, trophies, and honeycombs

August 22, 2021

“How much of what kind of testing” is the oldest argument in software engineering that has not been settled. The testing pyramid, invented by Mike Cohn and taught as the canonical answer for fifteen years, has accumulated enough exceptions that several prominent engineers have published alternative shapes — the trophy (Kent C. Dodds), the honeycomb (Spotify), the ice cream cone (a warning, not a recommendation). Each is right about a different kind of system. None is universally correct.

The interesting question is not which shape wins. It is which shape fits which kind of code, and what each shape is optimizing for. The pyramid optimizes for fast, isolated feedback on pure logic. The trophy optimizes for integration-level confidence in systems where the integration is the logic. The honeycomb optimizes for the microservice case where most bugs live at the seams. Picking the right shape for the code you actually have is more useful than picking a shape because it is the current fashion.

The pyramid, and what it is for

Cohn’s pyramid: many unit tests at the base, fewer integration tests in the middle, a small number of end-to-end tests at the top. The premise is that unit tests are cheap to write, fast to run, and isolate failures precisely; integration tests are slower and harder to debug; end-to-end tests are slowest and flakiest. Build the foundation wide, narrow toward the top.

The pyramid fits a specific kind of codebase: one where most of the complexity lives in pure logic that can be tested in isolation. Algorithms. Business rules. Calculations. State machines. Libraries. If your domain layer is rich — aggregates with real invariants, value objects with real behavior, policies with real complexity — unit tests at the bottom of a pyramid will catch a huge percentage of bugs cheaply, and the rest of the pyramid exists to verify the pieces are wired together.

The pyramid’s failure mode is applying it to code where the logic is mostly glue — calling databases, calling other services, transforming request payloads. Unit testing glue by mocking everything it touches produces tests that verify the glue matches the mocks, not that it works. These tests are fragile, expensive to maintain, and catch almost no real bugs. The pyramid for a glue codebase is broad at the base with little to show for it.

Diagnostic: how much of your “business logic” can be exercised by calling a function with arguments and checking the return value, with no I/O? If the answer is “most of it,” the pyramid fits. If the answer is “almost none of it,” the pyramid does not.

The trophy, and what it fixes

Kent C. Dodds popularized the trophy: a small base of static checks (types, linters), a modest layer of unit tests, a thick layer of integration tests, and a small cap of end-to-end tests. Shaped, approximately, like a trophy with a flared top.

The trophy’s bet is that for most modern application code — a React frontend, a Node or Rails backend mostly made of handlers calling services calling databases — the most valuable tests are the ones that exercise multiple layers together at something close to a real boundary. A test that renders a component, clicks a button, and verifies the result appeared tells you something meaningful. A test that mocks the fetch API to return a fake response and then verifies the component rendered that fake response tells you the component can render mocked data. These are different amounts of signal.

For a UI-heavy codebase, the trophy often catches more real bugs per hour of test-writing effort than the pyramid. @testing-library /react exists specifically to make trophy-shaped testing easy: test what the user sees, not what the component’s internal state is.

The trophy’s failure mode is the opposite of the pyramid’s: if the logic is rich, the integration layer becomes a poor place to test it, because reproducing all the cases of a complex rule through integration tests is slower and more expensive than unit-testing the rule directly. Teams that adopt the trophy for logic-heavy codebases end up with long, slow test suites that still miss edge cases.

The honeycomb

Spotify’s honeycomb is a variant aimed at microservices: a small layer of implementation-detail tests, a thick middle of integrated-service tests (the service tested through its real API, against a real database, with its real dependencies mocked only at the network boundary), and a small outer layer of integration between services.

The argument is that in a microservice, the unit of meaning is the service — not the function inside it. A test that validates the service’s behavior through its API, against its real database, tells you the service works. A unit test of a function inside the service tells you the function works. The first is much more valuable; the second matters only for the handful of functions that contain non-trivial logic.

This is a strong fit for services that are mostly coordination (request → persist → respond), a less good fit for services that contain significant domain logic, and a mis-fit for monoliths (which are not the shape it was designed for).

The ice cream cone

The ice cream cone is the anti-pattern: a thin layer of unit tests, a moderate layer of integration tests, and a huge cap of manual or end-to-end tests on top. It is what teams produce when they start with no tests, add a few unit tests reluctantly, and rely on QA or end-to-end suites to catch regressions.

The cone’s problem is economic: the tests that cost the most per run and per failure are the ones doing the most work. A bug discovered in e2e is discovered hours after the commit that caused it; a bug discovered in unit tests is discovered seconds after. Every shape except the cone is a deliberate inversion of the cost curve — pushing work toward faster, cheaper tests so that slower, expensive tests only have to catch the residue.

An ice cream cone is usually a symptom, not a strategy. It reflects a codebase that is hard to unit-test (too much glue, too tightly coupled to infrastructure), a team that has no time for testing infrastructure, or both. Fixing the cone requires fixing the underlying testability, not just writing more unit tests on the same hostile code.

What each level actually tests

Sliding across the shapes: the levels themselves are worth defining precisely, because “unit test” and “integration test” mean different things to different teams.

Static analysis. TypeScript, mypy, clippy, the compiler. Catches type errors, unused variables, some null dereferences. The cheapest, fastest, narrowest feedback loop. Underrated because it works automatically.

Unit tests. Functions or classes in isolation, no I/O, no network, no database, no file system. Runtime in milliseconds. Scope in bytes of code. Ideal for pure logic. Unhelpful for code whose behavior is defined by its interactions with external systems.

Integration tests. The unit plus some of its real collaborators — often the database, sometimes external services, sometimes the whole service. Runtime in seconds. Scope is a slice of the system that produces meaningful behavior together.

Contract tests. Pact and similar. Verify that the expectations a consumer has of a producer match what the producer actually provides. Run without the producer and consumer being up at the same time — the contract file is the medium. The category most directly aimed at microservices’ specific testing problem, and the one most under-used relative to its value.

End-to-end tests. The full system, from user input to user-observable output, usually through the real UI and against a real database. Runtime in minutes. Scope is the whole product flow. Catch real bugs nothing else catches. Flaky, expensive, hard to maintain. A small number of them, focused on the critical user journeys, earns its keep. A large number drowns.

Exploratory / manual testing. Humans trying things. Irreplaceable for usability and for unexpected combinations. Not automated. Should catch what the automation cannot, not what the automation has neglected.

Each level tests a different thing. The question of “how many of each” is the question of how the system’s complexity is distributed — where the bugs will live, and therefore where the tests should.

Fakes, stubs, mocks, and spies

The naming is muddled and worth getting right, because the choices affect what the tests actually prove.

A fake is a working implementation that is simpler than production — an in-memory database, a fake mail sender that records what was sent. Fakes behave like the real thing within a bounded scope. Tests against fakes run fast and exercise real logic.

A stub is a hard-coded response. “When called with this input, return this output.” Does not verify how it was called. Cheap, useful for isolating code from its dependencies during unit tests.

A mock is a stub that additionally verifies how it was called. “Was this method invoked exactly once, with these arguments?” Mocks turn a test into an assertion about implementation details — which is sometimes what you want (ensuring an email gets sent) and sometimes a liability (tests break when the implementation changes even though the behavior is correct).

A spy is like a mock but passes through to a real implementation — it records calls without replacing behavior. Useful for assertions that a real function was called, without changing what the real function does.

The common failure: using mocks everywhere because the testing framework makes it easy. A codebase heavily mocked becomes test-brittle — the tests fail on refactors that do not change behavior, and do not fail on bugs that do. The rule of thumb: prefer fakes over stubs, prefer stubs over mocks, and use mocks only when the interaction itself is what you are testing.

Contract testing and the microservice problem

The single hardest testing problem in a microservice architecture: you cannot spin up the whole system for each test. Twenty services, each with dependencies, running against real databases, would take minutes to start and would be flaky. So most teams end up with either a staging environment that every team’s tests run against (slow, shared, always broken) or fully mocked tests between services (fast, isolated, and wrong about whether the services actually agree).

Contract testing is the way out. The consumer writes a test that exercises its client code against a mock of the producer — but the mock’s behavior is specified in a contract file that is shared with the producer’s test suite. The producer’s CI runs the contract file against the real producer, verifying that the producer actually behaves as the consumer expects.

The result: the consumer’s tests run fast (against a mock) and are correct (because the contract is verified on the producer side). The producer’s tests catch breaking changes to its API before they ship, because the contract is in its CI. Neither needs the other to be running.

Pact is the usual tool. Spring Cloud Contract in the Java ecosystem. Specmatic, which works from OpenAPI. The choice of tool matters less than adopting the discipline: every consumer-producer pair has a contract, and both sides verify it automatically.

The adoption cost is real. Teams must learn the workflow. The contract broker (where the files live) is a new piece of infrastructure. Some producer teams resent being held to consumer expectations. All of that is worth it. Contract testing is the piece that turns microservices from a perpetual integration headache into a system with actual deployability boundaries.

Property-based testing

Example-based tests — “call the function with this input, expect that output” — verify specific cases. They miss the cases you did not think of. Property-based tests — QuickCheck, Hypothesis, fast-check — verify properties that should hold for all inputs, by generating many random inputs and checking the property on each.

Examples: “sorting a list twice gives the same result as sorting once.” “Encoding then decoding a value gives the original value.” “Adding $100 to an account and then withdrawing $100 leaves the balance unchanged.” These properties, when violated, usually indicate a bug. The random generator explores corner cases the developer forgot to write: empty inputs, very large inputs, Unicode, negatives, zero, NaN.

Property-based tests have a sweet spot: mathematical functions, serializers, parsers, algorithms with clear invariants. They are less useful for glue code, UI, and integrations. In their sweet spot, a handful of property tests can find bugs that thousands of example tests miss.

The setup cost is real. Writing a property is harder than writing an example. The payoff is that the property tests keep finding bugs across refactors, while examples need to be manually extended each time.

The shape that fits your system

There is no universal answer, but the decision process is tractable if you ask it in pieces.

Is the code mostly pure logic? The pyramid.
Is the code mostly glue? The trophy.
Is the code a microservice whose value is at its API? The honeycomb, with contract tests.
Does the code have mathematical structure (parsers, serializers, algorithms)? Add property-based tests.
Does the code have critical user journeys? Add a small, focused, maintained layer of end-to-end tests.
Does the code have API contracts with other teams or external clients? Add contract tests.

Most real codebases have several kinds of code and deserve several kinds of tests. A “strategy” that picks one shape and applies it uniformly is leaving signal on the table. A test suite that reflects the actual complexity of the code — rich in the logic- heavy parts, thick at the integration seams in the glue parts, contract-tested at the service boundaries — is harder to describe in a diagram and more effective in practice.

What tests are actually for

A final framing. Tests exist for three purposes: to catch bugs before they ship, to document behavior for future readers, and to enable refactoring without fear. Every test should be earning its keep on at least two of those three.

A test that catches a bug is valuable. A test that documents a non-obvious behavior is valuable even if it has never caught a bug — because it stops someone, months later, from changing the behavior without meaning to. A test that lets you refactor aggressively without breaking things is what the whole discipline is for, ultimately — code changes constantly, and a test suite that approves the change is what makes change safe.

A test that does none of these — a test that is slow, tests an implementation detail, documents nothing, and has never caught a bug — is technical debt, not testing. Delete it. The metric that matters is not how many tests but how much trust they produce in the change you are about to make. Aim for trust. The shape of the test pyramid, trophy, or honeycomb is whatever shape produces trust for your specific code.

Test what matters. Delete what does not. Run them fast. Trust them when they pass. That is the whole job.