Apache Kafka has become the backbone of event-driven architecture in modern distributed systems. In production teams handling millions of events daily, the same handful of patterns separate reliable pipelines from fragile ones. This guide focuses on those practical patterns, going well beyond the introductory pub/sub examples that most tutorials stop at.
Why Event-Driven?
Traditional request-response architectures create tight coupling between services. When Service A directly calls Service B, both must be available simultaneously, and changes in B's API break A. As the number of services grows, this synchronous web becomes a distributed monolith where one slow dependency stalls an entire request chain.
An event-driven approach inverts this relationship:
Traditional: Service A → Service B → Service C
(coupled, synchronous, brittle)
Event-Driven: Service A → [Kafka] ← Service B
← Service C
(decoupled, async, resilient)
Services publish events about what happened. Other services subscribe to events they care about. Crucially, the producer needs no knowledge of who consumes the event, so you can add a new analytics consumer or fraud-detection service later without touching the producer at all.
Kafka Fundamentals in Practice
Topic Design
Topic design is the foundation. A common mistake is creating too many topics or too few. A useful default is one topic per aggregate or domain event family, which keeps related events ordered and discoverable:
# One topic per aggregate/domain event type
order-events → OrderCreated, OrderUpdated, OrderCancelled
payment-events → PaymentProcessed, PaymentFailed, RefundIssued
inventory-events → StockReserved, StockReleased, LowStockAlert
Partitioning strategy: Partition by the entity's natural key (e.g., orderId). This guarantees ordering for related events while allowing parallel processing. Remember that Kafka only orders messages within a partition — events for different orders may interleave globally, and that is exactly what you want for throughput.
@Configuration
public class KafkaTopicConfig {
@Bean
public NewTopic orderEvents() {
return TopicBuilder.name("order-events")
.partitions(12)
.replicas(3)
.config(TopicConfig.RETENTION_MS_CONFIG, "604800000") // 7 days
.build();
}
}
Choosing Partition Count
Partition count is one of the few decisions that is painful to change later, because adding partitions reshuffles key-to-partition assignment and breaks per-key ordering for in-flight data. Therefore it pays to size partitions for your projected peak, not today's load. A practical heuristic is to target a throughput ceiling per partition and divide: if a single consumer thread comfortably handles roughly 10 MB/s and you expect 100 MB/s at peak, you need at least ten partitions, plus headroom for future consumer scaling.
However, do not over-partition either. Every partition adds open file handles, replication overhead, and controller metadata; clusters with hundreds of thousands of partitions suffer slow leader elections and longer recovery times. As a rule of thumb, keep total partitions per broker in the low thousands and revisit only when consumer lag proves you are throughput-bound rather than processing-bound.
Producer Patterns
Transactional Outbox Pattern
The biggest challenge with event-driven systems is ensuring the database write and event publish happen atomically. If you save the order and then crash before publishing, downstream services never learn the order exists; if you publish first and the database write rolls back, you have announced an event that never happened. The outbox pattern solves this by writing the event to the same database in the same transaction.
@Service
@Transactional
public class OrderService {
private final OrderRepository orderRepo;
private final OutboxRepository outboxRepo;
public Order createOrder(OrderRequest request) {
// 1. Save the order
Order order = orderRepo.save(new Order(request));
// 2. Write event to outbox table (same transaction)
outboxRepo.save(new OutboxEvent(
"order-events",
order.getId().toString(),
"OrderCreated",
toJson(order)
));
return order;
// Transaction commits: both order AND outbox event are saved atomically
}
}
A separate process (Debezium CDC or a scheduled poller) reads the outbox table and publishes to Kafka. Debezium is the preferred relay in production because it tails the database write-ahead log instead of polling, which avoids the lag and load of a busy SELECT loop:
@Scheduled(fixedDelay = 1000)
public void publishOutboxEvents() {
List<OutboxEvent> events = outboxRepo.findUnpublished();
for (OutboxEvent event : events) {
kafkaTemplate.send(event.getTopic(), event.getKey(), event.getPayload());
event.markPublished();
outboxRepo.save(event);
}
}
Consumer Patterns
Idempotent Consumer
Network issues and rebalances mean consumers may receive the same message twice. Kafka guarantees at-least-once delivery by default, so your consumers must be idempotent rather than assuming exactly-once. The simplest approach is a processed-events table keyed by a unique event ID.
@KafkaListener(topics = "order-events", groupId = "payment-service")
public void handleOrderEvent(ConsumerRecord<String, String> record) {
String eventId = record.headers()
.lastHeader("eventId").value().toString();
// Check if already processed
if (processedEventRepo.existsByEventId(eventId)) {
log.info("Duplicate event ignored: {}", eventId);
return;
}
// Process the event
OrderEvent event = deserialize(record.value());
paymentService.processPayment(event);
// Mark as processed
processedEventRepo.save(new ProcessedEvent(eventId));
}
For correctness under failure, the side effect and the processed-events insert should commit in one local transaction. Otherwise a crash between "process payment" and "mark processed" reintroduces the duplicate you were trying to prevent. A common production pattern is to make the business write and the dedup marker rows in the same relational transaction, then commit the Kafka offset only after that transaction succeeds.
Dead Letter Queue (DLQ)
Messages that fail after retries should go to a dead letter topic for investigation, not be silently dropped. Spring Kafka's DefaultErrorHandler with a DeadLetterPublishingRecoverer routes poison messages to a parallel order-events.DLT topic after exhausting retries.
@Bean
public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerFactory() {
ConcurrentKafkaListenerContainerFactory<String, String> factory =
new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory());
factory.setCommonErrorHandler(new DefaultErrorHandler(
new DeadLetterPublishingRecoverer(kafkaTemplate),
new FixedBackOff(1000L, 3) // 3 retries, 1 second apart
));
return factory;
}
One subtlety often missed: distinguish transient failures (a timed-out downstream call) from permanent ones (a malformed payload). Retrying a malformed message three times just wastes a few seconds before the inevitable DLQ; retrying a transient network blip often succeeds on the second attempt. Configure the error handler to treat deserialization exceptions as non-retryable so they skip straight to the dead letter topic.
Event Sourcing with Kafka
Instead of storing current state, store the sequence of events that led to the current state. Kafka's immutable, ordered log makes it a natural fit for this pattern.
Event Store (Kafka Topic: order-events)
┌──────────────────────────────────────────────────┐
│ OrderCreated │ ItemAdded │ ItemAdded │ OrderPaid │
│ (t=1) │ (t=2) │ (t=3) │ (t=4) │
└──────────────────────────────────────────────────┘
↓ Replay events
Current State: Order {
items: [item1, item2],
status: PAID,
total: $150.00
}
This approach gives you a complete audit trail, the ability to replay events to rebuild state, and time-travel debugging. The trade-off is real, though: you must handle schema evolution carefully (old events live forever) and build read models or snapshots so you are not replaying millions of events on every read. Many teams pair event sourcing with the CQRS pattern, projecting events into a query-optimized store; our companion piece on event sourcing and CQRS walks through that projection step in detail.
Monitoring Kafka in Production
You cannot operate an event-driven platform you cannot observe. These are the metrics worth alerting on:
| Metric | Alert Threshold | Why |
|---|---|---|
| Consumer lag | > 10,000 messages | Consumer falling behind |
| Under-replicated partitions | > 0 | Data durability at risk |
| Request latency (p99) | > 500ms | Broker performance degradation |
| ISR shrink rate | > 0 | Broker health issues |
# Prometheus + Grafana dashboard queries
kafka_consumer_lag_sum{group="payment-service"} > 10000
kafka_server_under_replicated_partitions > 0
Consumer lag is the single most actionable signal. A steadily rising lag means consumers are not keeping up, and the fix is usually more partitions plus more consumer instances. A lag that spikes and recovers, by contrast, is normal backpressure during traffic bursts and does not warrant paging anyone.
Performance Tuning Tips
Producer side:
-
Use
acks=allfor critical data,acks=1for high-throughput non-critical data -
Batch messages:
batch.size=32768,linger.ms=20 -
Enable compression:
compression.type=lz4
Consumer side:
-
Increase
max.poll.recordsfor batch processing -
Use
fetch.min.bytesto reduce network calls -
Match partition count to consumer instances for parallelism
A small but important note on linger.ms: setting it slightly above zero lets the producer accumulate records into larger batches, which improves throughput and compression ratio at the cost of a few milliseconds of latency. For most pipelines that trade is well worth it; for latency-sensitive command paths it is not, so tune per producer rather than globally.
When NOT to Use Event-Driven Architecture
Event-driven systems are not free. They trade the simplicity of a synchronous call stack for eventual consistency, harder debugging, and real operational overhead in running and securing a Kafka cluster. If your system is a single service with a single database, introducing Kafka adds infrastructure without solving a problem you actually have.
Likewise, workflows that genuinely need an immediate, strongly consistent answer — "did this payment succeed, yes or no, right now" — fit a synchronous request-response or a saga with explicit compensation better than fire-and-forget events. The honest rule is to reach for events when you have multiple independently deployed consumers, need an audit log, or want to decouple producers from a growing set of subscribers. Below that threshold, a direct API call or a job queue is simpler and easier to reason about.
Final Thoughts
Building event-driven systems with Kafka is powerful but introduces operational complexity. Start with simple pub/sub patterns, master idempotency and error handling, and evolve toward event sourcing only when the business requirements justify it.
For further reading, refer to the Martin Fowler architecture guides and the Microservices patterns for comprehensive reference material. For a deeper treatment of the same theme, see our guide on Kafka messaging patterns.
The systems worth admiring are rarely the ones with the most sophisticated event patterns — they are the ones where events flow reliably, failures are handled gracefully, and the team can debug issues by reading the event stream like a story.
In conclusion, event-driven architecture with Kafka is an essential topic for modern software development. By applying the patterns and practices covered in this guide, you can build more robust, scalable, and maintainable systems. Start with the fundamentals, iterate on your implementation, and continuously measure results to ensure you are getting the most value from these approaches.