Event-Driven Architecture with Kafka: Patterns for Production Systems
Synchronous REST calls between microservices create tight coupling — when the payment service is slow, the order service is slow too. Event-driven architecture with Kafka decouples services by communicating through events instead of direct calls. Therefore, this guide covers the patterns that make event-driven systems reliable in production: event sourcing, CQRS, sagas, dead letter queues, the transactional outbox, and consumer group strategies.
Why Event-Driven? The Coupling Problem
In a synchronous architecture, placing an order might involve five REST calls: validate inventory, charge payment, update loyalty points, send confirmation email, and notify warehouse. If any service is down, the order fails. Moreover, adding a new step (like fraud detection) requires modifying the order service to call yet another endpoint.
With event-driven architecture, the order service publishes an “OrderPlaced” event to Kafka. Every interested service subscribes to this topic and reacts independently. The payment service charges the card, the inventory service reserves stock, the email service sends confirmation — all without the order service knowing they exist. Adding fraud detection means adding a new consumer; the order service never changes.
This decoupling has real operational benefits. Services can be deployed independently, scale independently, and fail independently. When the email service goes down for maintenance, orders keep flowing. The email service catches up when it comes back online because Kafka retains events for days or weeks.
Event Sourcing: Store Events, Not State
Traditional databases store the current state — an order has status “shipped.” Event sourcing stores every state change as an immutable event: OrderPlaced, PaymentReceived, ItemsPacked, OrderShipped. The current state is derived by replaying all events for that entity.
// Event Sourcing with Kafka and Spring Boot
@Entity
@Table(name = "event_store")
public class StoredEvent {
@Id
private UUID eventId;
private String aggregateId;
private String eventType;
private int version;
private Instant timestamp;
@Column(columnDefinition = "jsonb")
private String payload;
}
// Order aggregate rebuilt from events
public class OrderAggregate {
private String orderId;
private OrderStatus status;
private BigDecimal total;
private List items;
public static OrderAggregate rebuild(List events) {
OrderAggregate order = new OrderAggregate();
for (StoredEvent event : events) {
order.apply(event);
}
return order;
}
private void apply(StoredEvent event) {
switch (event.getEventType()) {
case "OrderPlaced":
OrderPlacedPayload placed = deserialize(event);
this.orderId = placed.getOrderId();
this.items = placed.getItems();
this.total = placed.getTotal();
this.status = OrderStatus.PLACED;
break;
case "PaymentReceived":
this.status = OrderStatus.PAID;
break;
case "OrderShipped":
this.status = OrderStatus.SHIPPED;
break;
case "OrderCancelled":
this.status = OrderStatus.CANCELLED;
break;
}
}
}
Event sourcing gives you a complete audit trail, the ability to replay events to rebuild state, and temporal queries (“what was this order’s status last Tuesday at 3 PM?”). Additionally, you can create new read models by replaying events through a new projection — without changing any existing code.
The trade-off is complexity. Querying the current state requires replaying events (or maintaining snapshots). Schema evolution for events requires careful versioning. And the event store grows indefinitely, requiring archiving strategies for old events.
CQRS: Separate Reads from Writes
Command Query Responsibility Segregation splits your data model into a write model (optimized for consistency) and one or more read models (optimized for queries). Events bridge the gap — every write produces an event that updates the read models asynchronously.
// Write side: handles commands, produces events
@Service
public class OrderCommandHandler {
@KafkaListener(topics = "order-commands")
public void handle(OrderCommand command) {
switch (command.getType()) {
case "PlaceOrder":
// Validate business rules
Order order = Order.create(command.getItems());
// Persist to event store
eventStore.save(order.getUncommittedEvents());
// Publish events to Kafka
order.getUncommittedEvents().forEach(event ->
kafkaTemplate.send("order-events", order.getId(), event)
);
break;
}
}
}
// Read side: consumes events, builds query-optimized views
@Service
public class OrderReadModelUpdater {
@KafkaListener(topics = "order-events", groupId = "order-read-model")
public void updateReadModel(OrderEvent event) {
switch (event.getType()) {
case "OrderPlaced":
// Insert into denormalized read table
jdbcTemplate.update(
"INSERT INTO order_summary (id, customer, total, status, placed_at) VALUES (?, ?, ?, ?, ?)",
event.getOrderId(), event.getCustomerName(),
event.getTotal(), "PLACED", event.getTimestamp()
);
break;
case "OrderShipped":
jdbcTemplate.update(
"UPDATE order_summary SET status = 'SHIPPED', shipped_at = ? WHERE id = ?",
event.getTimestamp(), event.getOrderId()
);
break;
}
}
}
CQRS lets you optimize each side independently. The write side uses a normalized schema for consistency. The read side uses denormalized tables, materialized views, or even Elasticsearch for fast queries. Consequently, you can handle complex queries without impacting write performance.
The hidden cost of CQRS is eventual consistency between the two models. After a write commits, there is a window — usually milliseconds, occasionally seconds under lag — where the read model still shows the old value. Your UI must account for this, either by reading the user’s own writes from the command side or by optimistically rendering the change locally until the projection catches up. Teams that ignore this window ship “I placed an order but my dashboard says I have none” bugs.
The Transactional Outbox: Atomic Writes and Publishes
One of the most common production bugs in event-driven systems is the dual-write problem. The command handler above does two things that must both succeed or both fail: it persists to the database and it publishes to Kafka. But these are two separate systems, and there is no distributed transaction spanning them. If the service crashes after the database commit but before the Kafka send, you have a saved order that no downstream service ever heard about.
The transactional outbox pattern fixes this. Instead of writing to the database and publishing to Kafka separately, you write the event into an outbox table in the same local transaction as your business data. A separate relay process — usually Change Data Capture via Debezium — tails the outbox table and publishes those rows to Kafka. Because the business write and the outbox write share one transaction, they are atomic.
-- Outbox table lives in the same database as the business tables
CREATE TABLE outbox (
id UUID PRIMARY KEY,
aggregate_id VARCHAR(64) NOT NULL,
event_type VARCHAR(64) NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
published BOOLEAN DEFAULT FALSE
);
-- Business write and event write commit together
BEGIN;
INSERT INTO orders (id, customer_id, total, status)
VALUES ('o-123', 'c-9', 49.99, 'PLACED');
INSERT INTO outbox (id, aggregate_id, event_type, payload)
VALUES (gen_random_uuid(), 'o-123', 'OrderPlaced',
'{"orderId":"o-123","total":49.99}');
COMMIT;
-- Debezium streams the new outbox row to Kafka, exactly once per row.
This pattern is the single highest-leverage reliability improvement most teams can make. It converts a brittle “hope both writes happen” flow into a guarantee. Combined with idempotent consumers, it gives you effectively exactly-once delivery semantics across the boundary between your database and your event log.
The Saga Pattern for Distributed Transactions
When an order involves payment, inventory, and shipping, you need a way to coordinate across services without distributed transactions (which don’t scale with Kafka). The saga pattern breaks a distributed transaction into a sequence of local transactions, each publishing an event that triggers the next step. If any step fails, compensating transactions undo the previous steps.
SAGA: Place Order
Step 1: Reserve Inventory → on failure: (nothing to compensate)
Step 2: Charge Payment → on failure: Release Inventory
Step 3: Confirm Order → on failure: Refund Payment, Release Inventory
Step 4: Notify Warehouse → on failure: Cancel Order, Refund Payment, Release Inventory
Each step is a Kafka consumer that:
1. Performs its local transaction
2. Publishes a success/failure event
3. The saga orchestrator (or next choreography step) reacts accordingly
There are two approaches: choreography (each service knows what to do next) and orchestration (a central coordinator manages the workflow). Choreography is simpler for 3-4 steps but becomes spaghetti with more. Orchestration adds a single point of coordination but is easier to understand and debug. For most production systems, orchestration with a saga coordinator service is the practical choice.
The subtle requirement that catches teams off guard is that compensating actions must themselves be idempotent and, ideally, commutative. A refund that arrives twice must not double-refund, and a “release inventory” that races with a retry of “reserve inventory” must converge to the right stock count. Designing the compensation logic is usually harder than designing the happy path, and it deserves the bulk of your test coverage.
Partitioning and Ordering: The Key Decision
Kafka only guarantees ordering within a partition, never across a topic. The partition key you choose therefore dictates both your ordering guarantees and your parallelism, and it is the design choice teams most often get wrong. Key by orderId and every event for a given order lands in the same partition, processed in strict sequence — but a single hot order can never be parallelized. Key by something too coarse, like a country code, and one partition becomes a bottleneck while others sit idle.
A good default is to key by the aggregate identity that defines your ordering boundary: the order, the account, or the user. This guarantees per-entity ordering while still spreading load across partitions, since you have many entities. Crucially, repartitioning later is painful — adding partitions changes the key-to-partition mapping for new messages, which can interleave events for the same key across old and new partitions. Therefore, size your partition count generously up front; over-provisioning partitions is far cheaper than re-keying a live topic.
Dead Letter Queues and Consumer Groups
Events that cannot be processed — malformed data, business rule violations, downstream failures — need a place to go besides crashing your consumer in an infinite retry loop. A dead letter queue (DLQ) captures failed events for investigation and reprocessing.
// Kafka consumer with DLQ and retry
@Configuration
public class KafkaConsumerConfig {
@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate template) {
// Retry 3 times with exponential backoff
BackOff backOff = new ExponentialBackOff(1000L, 2.0);
((ExponentialBackOff) backOff).setMaxElapsedTime(30000L);
// After retries exhausted, send to DLQ
DeadLetterPublishingRecoverer recoverer =
new DeadLetterPublishingRecoverer(template,
(record, ex) -> new TopicPartition(
record.topic() + ".DLQ", record.partition()
));
DefaultErrorHandler handler = new DefaultErrorHandler(recoverer, backOff);
// Don't retry on deserialization errors — they'll never succeed
handler.addNotRetryableExceptions(DeserializationException.class);
return handler;
}
}
// Consumer groups: scale consumers horizontally
// 12 partitions = up to 12 consumers in same group
// Each consumer processes a subset of partitions
@KafkaListener(
topics = "order-events",
groupId = "payment-service",
concurrency = "4" // 4 consumer threads
)
Consumer groups let you scale horizontally. If your topic has 12 partitions, you can run up to 12 consumer instances in the same group, each processing a portion of the events. Adding more consumers than partitions means some sit idle. Therefore, choose your partition count based on your expected throughput — more partitions mean more parallelism but also more overhead. One detail to plan for: distinguish retryable from non-retryable failures. A deserialization error or a validation failure will never succeed on retry, so route it straight to the DLQ; only transient errors like a timed-out downstream call deserve the backoff loop.
Operational Best Practices
Event-driven systems require different operational thinking. Monitor consumer lag — the gap between the latest event and what your consumer has processed. High lag means your consumer cannot keep up. Set alerts at 1,000 events lag for latency-sensitive consumers and 100,000 for batch processors.
Schema evolution is critical. Use a schema registry (Confluent Schema Registry or Apicurio) with Avro or Protobuf schemas. Enforce backward compatibility so old consumers can read new events. Additionally, never delete or rename fields — add new ones and deprecate old ones.
When NOT to Use Event-Driven Architecture
For all its strengths, event-driven architecture is not a default — it is a deliberate trade of synchronous simplicity for asynchronous decoupling, and that trade is not always worth it. A small team running a handful of services that mostly do request/response work will pay a real tax: a Kafka cluster to operate, a schema registry to govern, consumer lag to monitor, and the cognitive overhead of debugging flows that no longer appear in a single stack trace. In those cases, plain REST or gRPC is simpler and easier to reason about.
Event-driven design also fights you when you genuinely need a synchronous answer. If a user clicks “pay” and must see “approved” or “declined” on the same screen, modeling that as fire-and-forget events forces you to bolt on correlation IDs and reply topics, reinventing request/response badly. Strong read-after-write consistency, simple CRUD admin tools, and low-traffic internal services are all places where the event log adds latency and moving parts without buying much. The honest rule is to reach for events when you have multiple independent consumers, high throughput, or a real need to decouple deployment and failure domains — and to resist it when a direct call would do. When it does fit, adopt it incrementally: start with one well-chosen event stream rather than rewriting the whole system. The related patterns in a real-time messaging system show the same persistence-plus-fan-out shape in a different domain.
Related Reading:
- Event Sourcing and CQRS Deep Dive
- Saga Pattern for Distributed Transactions
- Microservices Architecture Patterns
Resources:
In conclusion, event-driven architecture with Kafka transforms tightly coupled services into independently scalable components. Start with simple event publishing, secure the boundary with a transactional outbox, add CQRS when read performance matters, implement sagas for distributed transactions, and always plan for failure with dead letter queues. The patterns are well-established — the key is adopting them incrementally rather than rewriting everything at once.