Using Postgres Transactions to Solve Distributed System Challenges

Using Postgres Transactions to Solve Distributed System Challenges

Co-locating Workflow State with Data

Co-locating application workflow state within a Postgres database allows developers to eliminate the "dual-write" problem—the risk that a system updates a database but fails to notify a message queue, or vice versa. By treating the database as the single source of truth for both the business data and the progression of the workflow, developers can ensure that a state transition and its corresponding side effect are committed atomically.

This approach transforms a distributed coordination problem into a local transaction problem. Instead of attempting to coordinate a transaction across two separate systems (e.g., a database and a message broker), the system writes both the data change and the intended action to the same database in a single ACID transaction.

The Transactional Outbox Pattern

The Transactional Outbox pattern is the primary mechanism for achieving atomicity between a database and an external message queue. It works by splitting the database's role into two parts: the primary state and an "outbox" table.

  1. Atomic Write: The application updates the business state and inserts a message into the outbox table within a single Postgres transaction.
  2. Asynchronous Dispatch: A separate process polls the outbox table or uses a database trigger/UDF to push the message to the external queue.
  3. Confirmation: Once the external system confirms receipt, the message is marked as processed or deleted from the outbox.

As noted by community members, this pattern effectively "pretends" there is a transaction across the database and the queue by ensuring the intent to send a message is persisted as durably as the data itself.

Eliminating the Dual-Write Bug in Financial Systems

In high-stakes environments like money-movement systems, co-locating checkpoints with writes is critical to prevent "half-committed" states. If a system crashes mid-workflow, the recovery process can look at the persisted workflow state in Postgres to determine exactly where the process stopped and resume from that point without duplicating or missing a transaction.

"This is the trick that kills the dual-write bug in money-movement systems: co-locate the checkpoint with the write so a mid-workflow crash can't leave you half-committed."

Trade-offs and Implementation Considerations

While using Postgres as a workflow engine provides strong guarantees, it introduces specific architectural trade-offs:

Coupling and Complexity

Aligning the workflow progression unit with the database commit unit creates a tight coupling between the database schema and the application workflow. While this simplifies the outbox pattern, it makes it difficult to separate the workflow engine from the database in the future. However, for many services, the database is the most stable part of the stack, making this trade-off acceptable.

Idempotency and Side Effects

Database transactions cannot roll back side effects that have already occurred in the physical world (e.g., sending an email). Therefore, idempotency is mandatory for any external service interaction.

For example, in email notification systems, a common strategy is to prioritize "at-least-once" delivery over "exactly-once" delivery. The system confirms the email was sent successfully before closing the transaction that removes the message from the pending list, accepting that a rare failure could result in a duplicate email rather than a lost one.

Centralization Risks

Using a single database for both state and coordination creates a single point of failure. While this simplifies operational overhead—since a database outage brings down the entire system consistently—it removes the ability to scale the coordination layer independently of the data layer.

Comparison with Distributed Alternatives

Some argue that this approach is not a "distributed system" in the strict sense because it relies on a central database. However, it serves as a practical alternative to complex distributed coordination tools like Temporal or replicated state machines. For systems where the volume of steps is manageable and permanent records of workflow progression are required, implementing durable workflows directly in Postgres is often more efficient than introducing additional infrastructure like Redis or Valkey.

Sources