Saga Pattern: Managing Distributed Transactions in Microservices

Home › Blog › Saga Pattern: Managing Distributed Transactions in Microservices

Saga Pattern for Distributed Transactions: A Complete Implementation Guide

Distributed systems can’t use traditional database transactions that span multiple services. When an order involves an inventory service, payment service, and shipping service — each with its own database — you need a different approach. The saga pattern coordinates multi-service transactions through a sequence of local transactions with compensation logic for failures. Therefore, this guide covers both choreography and orchestration approaches with practical implementation examples, along with the operational realities that documentation rarely emphasizes.

Why Traditional Transactions Don’t Work in Microservices

In a monolithic application, a single database transaction can atomically update inventory, charge payment, and create a shipment. If anything fails, everything rolls back. This ACID guarantee breaks down in microservices because each service owns its own database — there’s no shared transaction coordinator that can lock rows across multiple databases simultaneously. As a result, the atomicity you took for granted simply does not exist across a network boundary.

Two-phase commit (2PC) exists but has fatal flaws for microservices: it requires all services to be available simultaneously, any single service failure blocks the entire transaction, and the coordinator becomes a single point of failure. Moreover, 2PC holds locks across services during the prepare phase, destroying throughput under load. Consequently, most modern platforms avoid it entirely. The alternative replaces 2PC with eventual consistency — each service completes its local transaction and publishes an event, and compensation transactions undo completed work when a later step fails.

The saga approach trades strong consistency for availability and performance. Instead of “everything succeeds or nothing succeeds atomically,” you get “everything eventually succeeds, or failed steps are compensated.” For most business operations — order processing, booking systems, account creation — this trade-off is acceptable and far more practical at scale. Importantly, this means your system passes through visible intermediate states, so the UI and downstream consumers must tolerate an order that is briefly “pending” before it becomes “confirmed” or “cancelled.”

Saga pattern distributed transaction architecture — Sagas coordinate multi-service transactions through local transactions and compensation

Designing Compensations: The Hard Part

The execution path of a saga is usually straightforward; the compensation path is where teams underestimate the work. A compensation is not a database rollback — it is a new business transaction that semantically undoes a previous one. Refunding a payment is not the same as never charging it: the customer sees a charge and then a refund on their statement, and your ledger records both events. Therefore, you must design compensations as first-class operations with their own validation, logging, and idempotency.

Some actions cannot be cleanly undone at all. You cannot un-send an email or un-ship a package that has already left the warehouse. For these irreversible steps, the standard guidance is to order them last in the saga (so earlier, reversible steps fail first) or to convert them into a “pivot” point after which the saga can only roll forward. A common pattern is to split a step into a reversible reservation and an irreversible commit — reserve inventory early, but only decrement stock after payment succeeds. In addition, every compensation should be safe to run even if the original action never completed, because a step can fail midway and leave ambiguous state.

Choreography: Event-Driven Sagas

In choreography-based sagas, each service listens for events and decides independently what to do next. There’s no central coordinator — the saga emerges from the interaction of autonomous services. This approach is simpler to implement initially and works well for sagas with three or four steps where the flow rarely changes.

// Order Service — starts the saga
class OrderService {
  async createOrder(orderData) {
    // Step 1: Create order in PENDING state
    const order = await this.orderRepo.create({
      ...orderData,
      status: 'PENDING'
    });

    // Publish event — inventory service will react
    await this.eventBus.publish('order.created', {
      orderId: order.id,
      items: order.items,
      customerId: order.customerId,
      totalAmount: order.totalAmount
    });

    return order;
  }

  // Compensation: payment failed or shipping failed
  async handlePaymentFailed(event) {
    await this.orderRepo.updateStatus(event.orderId, 'CANCELLED');
    // Notify customer
    await this.notificationService.send(event.customerId,
      'Your order has been cancelled due to payment failure.');
  }
}

// Inventory Service — reacts to order.created
class InventoryService {
  async handleOrderCreated(event) {
    try {
      // Step 2: Reserve inventory
      await this.inventoryRepo.reserve(event.items);

      await this.eventBus.publish('inventory.reserved', {
        orderId: event.orderId,
        items: event.items,
        customerId: event.customerId,
        totalAmount: event.totalAmount
      });
    } catch (error) {
      // Compensation: can't reserve inventory
      await this.eventBus.publish('inventory.failed', {
        orderId: event.orderId,
        reason: error.message
      });
    }
  }

  // Compensation: undo reservation if payment fails
  async handlePaymentFailed(event) {
    await this.inventoryRepo.release(event.items);
  }
}

// Payment Service — reacts to inventory.reserved
class PaymentService {
  async handleInventoryReserved(event) {
    try {
      // Step 3: Charge payment
      const payment = await this.paymentGateway.charge({
        customerId: event.customerId,
        amount: event.totalAmount
      });

      await this.eventBus.publish('payment.completed', {
        orderId: event.orderId,
        paymentId: payment.id
      });
    } catch (error) {
      // Payment failed — triggers compensation chain
      await this.eventBus.publish('payment.failed', {
        orderId: event.orderId,
        items: event.items,
        reason: error.message
      });
    }
  }
}

The downside of choreography is that the saga logic is scattered across services. As sagas grow beyond four or five steps, it becomes difficult to understand the complete flow, debug failures, and ensure all compensation paths are covered. Additionally, circular event dependencies can create infinite loops if not carefully managed — for example, a “release” event that itself triggers a “reserve” handler. Because no single component knows the overall state, answering a question as simple as “where is order 4711 right now?” requires correlating logs across every participant.

Orchestration: Centralized Saga Coordination

Orchestration-based sagas use a dedicated saga orchestrator that manages the sequence of steps, tracks state, and handles failures. The orchestrator tells each service what to do rather than services reacting to events autonomously. Consequently, the entire saga flow is visible in one place, making it easier to understand, test, and debug.

// Saga Orchestrator — manages the complete flow
class OrderSagaOrchestrator {
  constructor(orderService, inventoryService, paymentService,
              shippingService) {
    this.steps = [
      {
        name: 'create_order',
        execute: (data) => orderService.createOrder(data),
        compensate: (data) => orderService.cancelOrder(data.orderId)
      },
      {
        name: 'reserve_inventory',
        execute: (data) => inventoryService.reserve(data.items),
        compensate: (data) => inventoryService.release(data.items)
      },
      {
        name: 'process_payment',
        execute: (data) => paymentService.charge(data),
        compensate: (data) => paymentService.refund(data.paymentId)
      },
      {
        name: 'arrange_shipping',
        execute: (data) => shippingService.createShipment(data),
        compensate: (data) => shippingService.cancelShipment(
          data.shipmentId)
      }
    ];
  }

  async execute(orderData) {
    const sagaLog = { id: uuid(), status: 'RUNNING',
      completedSteps: [], data: orderData };

    for (const step of this.steps) {
      try {
        const result = await step.execute(sagaLog.data);
        sagaLog.data = { ...sagaLog.data, ...result };
        sagaLog.completedSteps.push(step.name);
        await this.saveSagaState(sagaLog);
      } catch (error) {
        sagaLog.status = 'COMPENSATING';
        sagaLog.failedStep = step.name;
        sagaLog.error = error.message;
        await this.compensate(sagaLog);
        return { success: false, error: error.message };
      }
    }

    sagaLog.status = 'COMPLETED';
    await this.saveSagaState(sagaLog);
    return { success: true, data: sagaLog.data };
  }

  async compensate(sagaLog) {
    // Execute compensations in reverse order
    const stepsToCompensate = [...sagaLog.completedSteps].reverse();
    for (const stepName of stepsToCompensate) {
      const step = this.steps.find(s => s.name === stepName);
      try {
        await step.compensate(sagaLog.data);
      } catch (compError) {
        // Log compensation failure — needs manual intervention
        console.error(`Compensation failed for ${stepName}:`,
          compError);
        sagaLog.status = 'COMPENSATION_FAILED';
        await this.alertOps(sagaLog);
      }
    }
    sagaLog.status = 'COMPENSATED';
    await this.saveSagaState(sagaLog);
  }
}

Notice that the orchestrator persists state after every step. This is not optional — the orchestrator process can crash between steps, and on restart it must know exactly which steps completed so it can resume or compensate correctly. In production teams typically back this with a durable workflow engine such as Temporal, AWS Step Functions, or Camunda Zeebe rather than a hand-rolled loop, because those engines handle persistence, retries, and timeouts for you. The hand-written version above is excellent for understanding the mechanics, but a battle-tested engine removes an entire class of bugs around durability and exactly-once execution.

Distributed transaction orchestration monitoring dashboard — Orchestration centralizes saga logic — the complete flow is visible in one place

Implementing with Kafka: Reliable Event Delivery

Both choreography and orchestration sagas need reliable event delivery. Apache Kafka is the most common choice because it provides durable, ordered event streams with strong delivery guarantees. The transactional outbox pattern ensures that database writes and event publications are atomic, which solves the classic dual-write problem where a service commits to its database but crashes before publishing the corresponding event.

// Transactional outbox pattern — atomic DB write + event
class OutboxEventPublisher {
  async publishWithOutbox(dbTransaction, event) {
    // Write event to outbox table within the same DB transaction
    await dbTransaction.query(
      'INSERT INTO outbox (event_type, payload, status) VALUES ($1, $2, $3)',
      [event.type, JSON.stringify(event.data), 'PENDING']
    );
    // Separate process polls outbox and publishes to Kafka
    // This ensures atomicity: if the DB write fails,
    // the event is never published
  }
}

// Outbox poller — runs as a separate process
class OutboxPoller {
  async poll() {
    const events = await db.query(
      'SELECT * FROM outbox WHERE status = $1 ORDER BY created_at LIMIT 100',
      ['PENDING']
    );

    for (const event of events) {
      await kafka.produce(event.event_type, event.payload);
      await db.query(
        'UPDATE outbox SET status = $1 WHERE id = $2',
        ['PUBLISHED', event.id]
      );
    }
  }
}

Furthermore, idempotency is critical for saga reliability. Network failures can cause duplicate event delivery, so every saga step must be idempotent — processing the same event twice should produce the same result. Use unique saga IDs and check whether a step has already been completed before executing it again. A practical technique is to record processed event IDs in a table and reject duplicates: a payment handler that has already seen orderId=4711 simply re-publishes its previous result instead of charging the card a second time. Note also that the outbox poller delivers events at least once, never exactly once, which is precisely why downstream idempotency is mandatory rather than a nice-to-have.

Observability and Stuck Sagas

Because sagas are long-running and span services, you cannot operate them blind. Every saga instance needs a correlation ID that flows through each event and log line, so you can reconstruct the full timeline of a single order across all participants. In addition, you should emit metrics for sagas that enter the COMPENSATING or COMPENSATION_FAILED states, since the latter almost always requires human intervention. A saga that has been RUNNING for longer than its expected duration is “stuck” — perhaps a participant is down or an event was lost — and you want an alert and a timeout-driven compensation rather than an order silently frozen forever. Distributed tracing tools such as OpenTelemetry make this far easier by stitching spans across service boundaries automatically.

Choosing Between Choreography and Orchestration

Use choreography when your saga has two to four steps, the services are owned by different teams who want autonomy, and the flow is unlikely to change frequently. Use orchestration when the saga has five or more steps, you need clear visibility into the complete flow, complex branching logic is required, or the business logic changes frequently. In practice, many production systems use orchestration because the debugging and monitoring benefits outweigh the additional infrastructure complexity. That said, the two styles are not mutually exclusive — large platforms often mix them, using orchestration for the critical checkout flow and choreography for loosely coupled side effects like analytics and email.

When NOT to Use the Saga Pattern

Sagas are powerful, but they are not free, and reaching for them by default is a mistake. If your operation fits inside a single service and a single database, use a normal local ACID transaction — it is simpler, faster, and strongly consistent. Do not introduce a saga, an event bus, and compensation logic to coordinate two tables that live in the same schema. Similarly, if your business genuinely cannot tolerate intermediate states — certain financial settlement or regulatory flows — eventual consistency may be the wrong model, and you should reconsider the service boundaries instead. Sagas also add real cost: more moving parts, harder testing, and the cognitive overhead of reasoning about partial failures. As a rule of thumb, only adopt a saga when a business process truly must span multiple independently owned services and you have accepted the eventual-consistency trade-off that comes with it.

Microservices architecture and event-driven patterns — Choose choreography for simple flows, orchestration for complex multi-step transactions

Related Reading:

Resources:

In conclusion, the saga pattern is essential for managing distributed transactions in microservices, but it should be applied deliberately rather than reflexively. Start with orchestration for most non-trivial use cases — the visibility and debuggability are worth the additional infrastructure. Always design compensations as real business operations, use the transactional outbox pattern for reliable event delivery, make every step idempotent, and invest in correlation IDs and alerting for stuck sagas. With those foundations in place, your distributed system will be more resilient and far easier to operate.