Pavan Rangani

HomeBlogOutbox Pattern for Reliable Event Publishing in Microservices

Outbox Pattern for Reliable Event Publishing in Microservices

By Pavan Rangani · March 26, 2026 · Architecture

Outbox Pattern for Reliable Event Publishing in Microservices

Outbox Pattern for Reliable Event Publishing

Outbox pattern reliable events publishing solves one of the hardest problems in distributed systems: how do you update a database and publish an event atomically? Without the outbox pattern, you risk either losing events (if the message broker is down after the DB commit) or publishing events for uncommitted changes (if you publish before committing). Both scenarios lead to data inconsistency across services. Moreover, these failures are intermittent and hard to reproduce, which makes them notoriously difficult to debug in production.

This guide explains the transactional outbox pattern in depth, covering both polling-based and log-based (CDC) implementations. You will learn how to implement it in Java with Spring Boot and PostgreSQL, handle failure scenarios, and choose the right approach for your architecture. Additionally, we will cover ordering guarantees, observability, and the operational trade-offs that determine which variant fits your team.

The Dual-Write Problem

Consider a typical microservice that creates an order and publishes an OrderCreated event. The naive approach has a critical flaw:

@Service
public class OrderService {

    @Transactional
    public Order createOrder(OrderRequest request) {
        // Step 1: Save to database
        Order order = orderRepository.save(new Order(request));

        // Step 2: Publish event — PROBLEM!
        // If this fails, the order exists but no event is published
        // If the app crashes between step 1 and 2, same issue
        eventPublisher.publish(new OrderCreatedEvent(order));

        return order;
    }
}

This is called the dual-write problem. You are writing to two systems (database and message broker) without a distributed transaction. The outbox pattern eliminates this by writing the event to the same database as the business data, within the same transaction. Two-phase commit (XA) could theoretically span both systems, but in practice few teams adopt it because it couples availability, hurts throughput, and most modern brokers like Kafka do not participate well in XA transactions.

Outbox pattern reliable events architecture diagram
Transactional outbox pattern eliminating dual-write inconsistency

How the Outbox Pattern Works

The pattern introduces an “outbox” table in the same database as your business data. Instead of publishing events directly, the service writes events to the outbox table within the same transaction as the business operation. A separate process then reads from the outbox table and publishes events to the message broker.

-- Outbox table schema
CREATE TABLE outbox_events (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_type  VARCHAR(255) NOT NULL,
    aggregate_id    VARCHAR(255) NOT NULL,
    event_type      VARCHAR(255) NOT NULL,
    payload         JSONB NOT NULL,
    created_at      TIMESTAMP NOT NULL DEFAULT NOW(),
    published_at    TIMESTAMP NULL,
    retry_count     INT NOT NULL DEFAULT 0
);

-- Index for the polling publisher
CREATE INDEX idx_outbox_unpublished
    ON outbox_events (created_at)
    WHERE published_at IS NULL;

Implementation with Spring Boot

@Entity
@Table(name = "outbox_events")
public class OutboxEvent {
    @Id
    @GeneratedValue(strategy = GenerationType.UUID)
    private UUID id;

    private String aggregateType;
    private String aggregateId;
    private String eventType;

    @JdbcTypeCode(SqlTypes.JSON)
    private String payload;

    private Instant createdAt;
    private Instant publishedAt;
    private int retryCount;
}

@Service
public class OrderService {

    private final OrderRepository orderRepository;
    private final OutboxRepository outboxRepository;
    private final ObjectMapper objectMapper;

    @Transactional  // Single transaction for both writes
    public Order createOrder(OrderRequest request) {
        // Save business data
        Order order = orderRepository.save(new Order(request));

        // Write event to outbox (same transaction!)
        OutboxEvent event = new OutboxEvent();
        event.setAggregateType("Order");
        event.setAggregateId(order.getId().toString());
        event.setEventType("OrderCreated");
        event.setPayload(objectMapper.writeValueAsString(
            new OrderCreatedPayload(order.getId(), order.getTotal(),
                order.getCustomerId())
        ));
        event.setCreatedAt(Instant.now());
        outboxRepository.save(event);

        return order;
    }
}

Because both the order and the outbox event are written in the same database transaction, they either both succeed or both fail. There is no window for inconsistency. Notice the aggregate_type and aggregate_id fields: they let consumers route, partition, and de-duplicate without parsing the payload, and they double as a natural partition key when you forward events to Kafka.

Polling Publisher Approach

The simplest way to relay outbox events to the message broker is a polling publisher — a scheduled task that periodically queries for unpublished events and sends them.

@Component
public class OutboxPollingPublisher {

    private final OutboxRepository outboxRepository;
    private final KafkaTemplate kafkaTemplate;

    @Scheduled(fixedDelay = 500)  // Poll every 500ms
    @Transactional
    public void publishPendingEvents() {
        List events = outboxRepository
            .findByPublishedAtIsNullOrderByCreatedAtAsc(
                PageRequest.of(0, 100));

        for (OutboxEvent event : events) {
            try {
                String topic = "events." + event.getAggregateType().toLowerCase();
                kafkaTemplate.send(topic, event.getAggregateId(),
                    event.getPayload()).get();  // Sync send

                event.setPublishedAt(Instant.now());
                outboxRepository.save(event);
            } catch (Exception e) {
                event.setRetryCount(event.getRetryCount() + 1);
                outboxRepository.save(event);
                log.warn("Failed to publish event {}: {}",
                    event.getId(), e.getMessage());
            }
        }
    }

    // Cleanup: delete published events older than 7 days
    @Scheduled(cron = "0 0 3 * * *")
    @Transactional
    public void cleanupPublishedEvents() {
        outboxRepository.deleteByPublishedAtBefore(
            Instant.now().minus(7, ChronoUnit.DAYS));
    }
}

The polling approach is simple to implement and understand. However, it introduces latency (up to the polling interval) and creates load on the database with frequent queries. For most applications with moderate event volumes, this trade-off is acceptable.

Concurrency and the Multi-Instance Trap

A subtle bug appears the moment you run more than one instance of your service: two pollers select the same rows and publish each event twice. Therefore, you must prevent concurrent pollers from claiming the same batch. The cleanest fix on PostgreSQL is row-level locking with SELECT ... FOR UPDATE SKIP LOCKED, which lets each instance grab a disjoint slice of pending rows without blocking the others.

-- Each poller instance claims its own rows; SKIP LOCKED avoids contention
SELECT id, aggregate_type, aggregate_id, event_type, payload
FROM outbox_events
WHERE published_at IS NULL
ORDER BY created_at
FOR UPDATE SKIP LOCKED
LIMIT 100;

Alternatively, you can run a single active publisher elected through a leader-election mechanism such as ShedLock or a Kubernetes lease. Single-publisher designs preserve strict global ordering but cap throughput; SKIP LOCKED scales horizontally but only guarantees ordering per aggregate, which is usually what you actually want.

Log-Based CDC Approach

For higher throughput and lower latency, use log-based Change Data Capture (CDC). Tools like Debezium read the database transaction log (WAL in PostgreSQL) and stream changes to Kafka. This approach has near-zero latency and adds no load to the database because it never runs queries against your tables.

{
  "name": "outbox-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "secret",
    "database.dbname": "orderservice",
    "table.include.list": "public.outbox_events",
    "transforms": "outbox",
    "transforms.outbox.type":
      "io.debezium.transforms.outbox.EventRouter",
    "transforms.outbox.table.fields.additional.placement":
      "event_type:header:eventType",
    "transforms.outbox.route.by.field": "aggregate_type",
    "transforms.outbox.route.topic.replacement": "events.$1"
  }
}

With CDC you no longer need a published_at column or a cleanup job that races the publisher, because Debezium tracks its own position (the log sequence number) in offset storage. Instead, you typically configure the connector to delete the outbox row immediately after the insert is captured, keeping the table small. The cost is operational: you now run Kafka Connect, monitor connector health, manage replication slots, and watch for WAL bloat if a connector falls behind.

Replication Slots and the WAL Bloat Risk

The most common production incident with Debezium on PostgreSQL is a stalled replication slot. When the connector stops consuming — because it crashed, the topic is unavailable, or Connect was scaled to zero — PostgreSQL cannot recycle WAL segments the slot still references. As a result, disk usage climbs until the database runs out of space. Always alert on replication slot lag and configure max_slot_wal_keep_size so a dead slot cannot fill the disk and take the primary down with it.

Handling Duplicate Events

Both polling and CDC can produce duplicate events (e.g., if the publisher crashes after sending but before marking as published). Therefore, consumers must be idempotent. The standard approach is to track processed event IDs.

@Service
public class OrderEventConsumer {

    private final ProcessedEventRepository processedEvents;
    private final InventoryService inventoryService;

    @KafkaListener(topics = "events.order")
    @Transactional
    public void handleOrderCreated(ConsumerRecord record) {
        String eventId = new String(record.headers()
            .lastHeader("eventId").value());

        // Idempotency check
        if (processedEvents.existsById(eventId)) {
            log.info("Duplicate event {}, skipping", eventId);
            return;
        }

        OrderCreatedPayload payload = objectMapper.readValue(
            record.value(), OrderCreatedPayload.class);
        inventoryService.reserveStock(payload.getItems());

        processedEvents.save(new ProcessedEvent(eventId, Instant.now()));
    }
}

For the idempotency check to be airtight, the duplicate-detection write and the business write must share one transaction; otherwise a crash between them reopens the same gap the outbox closed on the producer side. A practical refinement is to make the side effect itself naturally idempotent — for example, an UPSERT keyed by order ID — so that even a missed marker cannot double-apply the change.

Observability: Watching Publish Lag

Whichever variant you choose, the single most useful metric is publish lag — the age of the oldest unpublished event. For the polling design, expose it directly with a Micrometer gauge; for CDC, derive it from connector metrics. A steadily rising lag is the early warning that a poller died, a slot stalled, or the broker is rejecting writes, long before downstream services notice missing data.

@Component
public class OutboxLagMetrics {

    public OutboxLagMetrics(OutboxRepository repo, MeterRegistry registry) {
        Gauge.builder("outbox.oldest.unpublished.seconds", repo,
                r -> r.findOldestUnpublishedAge()
                       .map(Duration::getSeconds).orElse(0L))
             .description("Age of the oldest pending outbox event")
             .register(registry);
    }
}
Microservice event architecture monitoring
Monitoring outbox event publishing lag and throughput

When NOT to Use the Outbox Pattern

The outbox pattern adds complexity — an extra table, a publishing process, and idempotent consumers. If your system tolerates occasional missed events (best-effort delivery), a simple try-catch around the publish call may be sufficient. Additionally, if you use an event-sourced architecture where the event store IS the source of truth, the outbox pattern is redundant because the act of appending an event is already the durable write.

For systems with very low event volumes (fewer than 100 events per hour), the operational overhead of managing a CDC connector or polling publisher may not be justified. A simpler retry mechanism with a dead-letter queue could provide adequate reliability with less infrastructure. Likewise, if your business write and event live in different databases owned by different teams, the single-transaction premise breaks down and you need a saga or a coordinating service instead. Be honest about your delivery requirements before paying this tax.

Key Takeaways

  • The transactional outbox guarantees atomicity between database changes and event publication by writing both in a single transaction
  • Polling publishers are simple to implement but add latency; use SELECT ... FOR UPDATE SKIP LOCKED or leader election to stay correct across multiple instances
  • CDC with tools like Debezium offers near-real-time delivery with zero database query load, at the cost of operating Kafka Connect and watching replication slots
  • Always design consumers to be idempotent since both approaches can produce duplicate events during failure recovery
  • Track publish lag as your primary health signal, and clean up or delete published rows to prevent unbounded growth

Related Reading

External Resources

In conclusion, the Outbox pattern reliable events approach is an essential building block for modern distributed systems. By applying the patterns and practices covered in this guide — single-transaction writes, safe multi-instance publishing, idempotent consumers, and lag-based monitoring — you can build more robust, scalable, and maintainable event-driven services. Start with a polling publisher, prove correctness under failure, and graduate to CDC only when latency and throughput demand it.

← Back to all articles