Event-driven architecture: the log is the contract

December 10, 2023

The move from request/response to event-driven is one of those architectural shifts that looks like a technology choice and is actually a modeling choice. Two services that talk over HTTP encode a specific claim about the world: service A needs an answer from service B right now, and is willing to be unavailable if B is. Two services that talk over an event log encode a different claim: A produces facts, and whoever cares can consume them on their own schedule. Those are not interchangeable architectures with different wire formats. They are different answers to who is coupled to whom, and how tightly.

Event-driven architecture is the umbrella term for the second answer. It covers a surprisingly wide spectrum — from a service that emits a webhook when an order ships, to a Kafka topic that is the system of record for every state change that has ever happened. The pieces look similar on a diagram. They are not the same pattern, and confusing them is how teams end up with the costs of an event-driven system and the coupling of a synchronous one.

Events vs. commands

The distinction is older than the pattern language and still the place most teams get wrong. A command is an instruction: PlaceOrder, ChargeCard, CancelShipment. It is directed at a specific recipient. It can be rejected. It is written in the imperative.

An event is a statement of fact about something that has already happened: OrderPlaced, CardCharged, ShipmentCancelled. It has no recipient; anyone can listen or nobody can. It cannot be rejected, because the thing is already done. It is written in the past tense.

The subtle failure is a message called SendEmail published to a topic and consumed by an email service. That is a command pretending to be an event. The publisher is coupled to a specific behavior at a specific recipient; the only thing the topic has bought is asynchrony. Rename the message to OrderConfirmed, let the email service decide to react to it by sending an email, and the coupling inverts: the order service no longer knows or cares who sends emails. That inversion is the whole point of events. If it has not happened, you have queueing, not event-driven architecture.

Three flavors

Martin Fowler’s taxonomy holds up well. Event-driven systems tend to be one of three kinds, and most of the confusion comes from treating them as interchangeable.

Event notification. A service emits an event to tell the world something happened, carrying minimal data — usually just an identifier and a timestamp. Consumers that want more context call back to the producer’s API. This is cheap to adopt, couples consumers to the producer’s API, and is the least ambitious variant. It is request/response with a trigger in front of it.

Event-carried state transfer. The event carries enough state that consumers can act without calling back. OrderPlaced includes the line items, the customer id, the shipping address, the totals. Consumers can maintain their own projections of the data they care about and answer queries locally. The producer is now decoupled from the consumers’ availability, but the event schema is a published contract and evolves under real constraints.

Event sourcing. The events are the system of record. Current state is derived by replaying them. This is the most ambitious flavor — it changes how the write model is built, how history is queried, how schemas evolve, and how bugs are fixed. Covered in the DDD post; worth reading Greg Young before committing.

A team that picks “event-driven” without picking one of these three ends up building all three accidentally, in different parts of the system, with no shared story about what events mean or how much they carry. Pick deliberately. Write it down.

The log is the architecture

The most consequential shift in event-driven architecture over the last decade is the move from transient messaging (RabbitMQ, SQS — “deliver this, then forget it”) to durable, ordered, replayable logs (Kafka, Kinesis, Pulsar, Redpanda — “append this, keep it, anyone can replay it from any offset”).

The log changes what events are. In a transient broker, an event is a notification you hope someone received; if a consumer was down, it lost the message. In a log, an event is a position in a sequence. A consumer that was down replays from where it left off. A new consumer added a year later can replay from the beginning and catch up. The log becomes the system’s shared source of truth about what has happened, and consumers become projections over it.

This has consequences that go beyond the wire format:

Event schemas become long-lived. Old events stay in the log. Consumers written today must still be able to read events written two years ago. Schema evolution — additive changes, optional fields, explicit versioning — stops being optional.
Replay is a first-class operation. Debugging, rebuilding a projection, onboarding a new consumer, recovering from a corrupted database — all become “replay the log from offset X.”
Ordering is per-partition, not global. Systems that assume total ordering (“every consumer sees every event in the same order”) will break at scale. Systems that encode their ordering requirements in the partition key survive.
The log is a dependency. Its availability is now part of every producer’s availability. Its schema registry is a piece of infrastructure somebody has to own.

When people say “Kafka is a database,” they usually mean this: the log is persistent, ordered, and replayable enough that it can be treated as the primary record, with other stores as derived projections. That is a strong claim and not every system needs it. It is, however, the claim that makes event-driven architecture qualitatively different from async-messaging-with-extra-steps.

The dual-write problem and the outbox

The most common bug in event-driven systems is subtle and catastrophic. A service handles a command, writes the new state to its database, and publishes an event to the broker. Two writes, two systems, no shared transaction.

If the database commit succeeds and the publish fails, the rest of the world never learns what happened. The system is silently inconsistent.
If the publish succeeds and the database commit fails, the world learns about a thing that did not actually happen.
If the publish succeeds and the database commit succeeds but the process crashes between them, either failure mode is possible depending on order.

No amount of retry logic fixes this. The problem is that there is no atomic “commit these two writes together” across a database and a broker.

The outbox pattern is the standard fix. The service writes the new state and the event to the same database, in the same transaction, to two tables: the business tables and an outbox table. A separate process — often a Change Data Capture (CDC) pipeline watching the database’s replication log — reads the outbox and publishes the events to the broker. If the publish fails, it retries. If the service crashed mid-transaction, neither the state change nor the event exists. The atomicity of the database is used to produce at-least-once delivery of the event.

The outbox has a cousin: the inbox pattern on the consumer side, where the event is written to a local table before any work is done, making the consumer’s handler idempotent by construction. Between the two, most of the hard distributed-transaction problems in event-driven systems have a mechanical answer. Use the answer. Hand-rolled “publish after commit” code that does not use the outbox is the bug you will discover in production at 3am.

Change Data Capture

CDC is the pattern of reading a database’s transaction log (Postgres WAL, MySQL binlog, etc.) and turning each row-level change into an event. It solves the outbox problem at a lower level — the database’s own commit log is the source of events, and the application does not need to maintain a separate outbox table.

CDC is powerful and has a specific risk: the events it produces are shaped like database rows, not like the domain. customers.updated with a before/after row diff is not a domain event. It is a leak of the producer’s schema into every consumer. Consumers written against CDC tend to couple to the producer’s table structure; a schema migration on the producer breaks every downstream.

The healthy pattern is to use CDC as the transport for events that are still modeled as domain events — write domain events to an outbox table, let CDC publish them to the broker. You get CDC’s transactional guarantees without leaking your schema to the world. Reserve raw CDC for cases where the consumer genuinely is a database replica (data warehouse ingestion, search index backfills) and the coupling is the point.

Eventual consistency as a first-class citizen

An event-driven system is, by definition, eventually consistent between producers and consumers. The producer commits a change; the consumers see it some milliseconds-to-minutes later. Most of the hard work of building these systems is making eventual consistency bearable for users and honest for operators.

The places it bites:

Read-your-writes across services. A user updates their profile in the account service and immediately loads the order page, which reads a profile projection maintained by the order service’s consumer. The projection is still propagating. The user sees stale data on the page they just navigated to. The fix is usually either to route that specific read through the source of truth, or to block the navigation until the projection confirms it has caught up — both are ugly, both are sometimes necessary.
Cross-aggregate invariants. “A user cannot have more than ten active subscriptions” is trivial inside one service and fragile across services. Enforcement becomes a compensating action: you accept the eleventh subscription and cancel it after the fact, or you introduce a saga that serializes the check. Neither is as clean as a database constraint.
Debugging. “What did the system think the state was at 14:32:07?” becomes a multi-service question. Good observability (correlated event ids, timestamps, causation chains) is what makes this tractable. Without it, incident response is guesswork.

The point is not that eventual consistency is bad. It is that the cost of it shows up in specific, recurring places, and a team adopting event-driven architecture should budget for them explicitly. A system that pretends it is synchronous and is not is worse than either a synchronous system or an honestly asynchronous one.

Choreography and orchestration, again

The microservices post covered sagas. The event-driven framing sharpens the choice.

A choreographed event-driven system has no central coordinator. Each service reacts to events and emits its own. The workflow is the shape of the event graph. This is maximally decoupled and works well for simple flows. It gets painful when the flow is complex, because no single artifact describes the end-to-end process, and debugging a stuck order means mentally stitching together events from five services.

An orchestrated event-driven system has a workflow service that owns each long-running process. It consumes events, decides the next step, and issues commands. The workflow is explicit, in code or in a workflow engine (Temporal, Step Functions, Camunda). This is easier to reason about and easier to operate, at the cost of a new service that has to be available for the workflow to progress.

The serious mistake is to pick choreography for process reasons (decoupling feels good) and then hide an orchestrator inside one of the services anyway, because without it nobody can figure out why an order is stuck. That is orchestration with worse naming. Pick the pattern explicitly. For workflows with more than three or four steps, orchestration is usually the honest answer.

When events are the wrong shape

Not every interaction wants to be an event. A user asking for their current account balance wants an answer now, from the source of truth, synchronously. Forcing that through an event system (“publish a BalanceRequested event, wait for a BalanceResponded event”) is request/response with extra latency and a correlation-id bug waiting to happen.

A useful test: if the producer genuinely does not care who, if anyone, consumes the message, an event is the right shape. If the producer needs a specific recipient to do a specific thing and return a result, it is a command or a query, and HTTP/gRPC is the honest transport. Mixing these shows up as the same information being expressed three different ways in three different services, with no canonical version.

What the style will and will not buy you

Event-driven architecture, when it fits, buys you loose coupling between producers and consumers, the ability to add new consumers without touching producers, an audit log of what the system has done, and a natural fit with microservice boundaries and DDD’s domain events.

It does not buy you simplicity. It moves complexity from request paths into the interpretation of event streams, the evolution of event schemas, the handling of out-of-order and duplicate messages, and the operability of the log itself. A team that adopts events without an explicit story for idempotency, ordering, schema evolution, and replay ends up with a system that looks loosely coupled on a diagram and behaves like a distributed debugger at 3am.

The rule of thumb after two decades of practice: model in events the things the business genuinely thinks of as facts (OrderPlaced, PaymentCaptured, ShipmentDispatched), and model in commands or queries the things it thinks of as actions (chargeCard, getBalance). Publish the events on a durable log. Use the outbox to get them there honestly. Budget for eventual consistency where it actually appears. Pick choreography or orchestration deliberately, and write it down.

The log is the contract. Everything else is implementation.