Transaction Internals: Recovery

A transaction must be always consistent and durable. This is straightforward when everything is normal, whereas maintaining such properties despite of failures becomes nontrivial. Failures like system crashes can leave a system in an inconsistent state. Today we are treating failures as the norm. It’s critical to possess the capability of fault tolerance. One important remedy is to recover transactions after crashes.

The recovery of transactions has two goals:

  1. Restoring the lost dirty data due to failures, which brings a system back to the last normal state right before the crash.
  2. Rolling back partially completed transactions. From a user’s perspective, they are aborted because of the failure. To guarantee atomicity, an aborted transaction needs to be rolled back to the state right before that transaction starts.

The recovery solution we are focusing here is based on the ARIES recovery algorithm. Before diving into details, let’s look at two techniques lying behind ARIES, write-ahead logging and checkpoints.

Write-ahead logging

Write-ahead logging provides the foundation for recovery. It defines an order of writing data and logs. A log records the specific modification of the data. Under the write-ahead logging, the log needs to be first written to the storage before persisting the data modification. A request is considered completed once logs are safely persisted, while data modifications can be delayed. With that:

  1. The loss of dirty data can be recovered as long as the corresponding logs survive. The lost data can be restored by redoing the logs.
  2. Partially completed transactions can be rolled back by traversing the corresponding logs reversely and undoing the operation of each log.


Write-ahead logging also helps improve the I/O performance. First of all, logging increases the writing volume, since we need to write not only data but also logs. However, logs are usually generated sequentially. Therefore, the additional I/O introduced by logging incurs a low cost. On the other hand, data modifications are likely to be random and scattered. Logging allows their persistence to be delayed, resulting in an increased chance of the writing aggregation. Aggregated writing tends to be sequential. For an extreme strategy, we can gain very high I/O performance by only writing logs to the storage while leaving all data volatile, since we can always reproduce data by replaying logs. However, this is not practical. The recovery from a failure is then likely need to redo tons of logs, resulting in very long unavailability. Given that failures are the norm, it’s unaffordable to have a long recovery every time a system restarts.

The persistence of data modifications needs to happen at some point. A common strategy is to flush all dirty data periodically and maintain a balance between the cost of dirty data writing and the potential recovery time. We use the term checkpoint to indicate a system state where a data flush is just applied. A checkpoint tells the starting point from which a recovery begins, because we know that all dirty data before a checkpoint have been persisted. A checkpoint also captures the state of all active transactions at the time of the checkpoint, which is used during recovery to determine which transactions are partially completed.


ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) is an algorithm of recovery for transactions. It consists of three phases based on the principle of write-ahead logging and checkpoints:

  1. Analysis–identifying abnormal states–the lost dirty data and open transactions.
  2. Redo–restoring the lost dirty data via redoing logs.
  3. Undo–rolling back partially completed transactions via undoing logs.


The analysis phase collects information from logs to identify the lost dirty data and partially completed transactions. Because all data modifications before the last checkpoint are already persisted, the analysis phase only needs to scan logs starting from the last checkpoint.

In order to identify the lost dirty data, we record the LSN (log sequence number) of the log to the data page when modifying that data. During the analysis phase, if we see a log that has a LSN higher than that of the corresponding data page, we know the data modification recorded by that log didn’t survive from the failure.

In order to identify partially completed transactions, we record all active transactions to the log when doing checkpoints. When the analysis phase starts, we extract such active transaction snapshot from the last checkpoint. In the end of the analysis phase, we determine the partially completed transactions if we don’t see any logs indicating the commitment of them.

You probably notice that the recovery introduces extra costs during the normal service time, for example, recording LSNs and active transaction snapshots. While failures do not always occur, we are always paying for such cost during the normal service. This is worthwhile because the cost of being nonrecoverable is much higher or even unaffordable.


The redo phase restores a system to the last normal state right before the crash. With the lost dirty data identified from the analysis phase, the redo phase reapplies those data modifications by redoing their logs.

The system becomes consistent once the redo phase completes. The service thus can become available. However, we still have partially completed transactions pending. If we would like to open the service from here, which is totally fine since the system is already consistent, we need to block the access (e.g., acquiring locks) to the data that were modified by partially completed transactions. Since those transactions are considered aborted, data modifications involved also become invalid.


During the undo phase, we roll back all partially completed transactions identified from the analysis phase. If the service is already open since the end of the redo phase, accesses are blocked for data involved in partially completed transactions. After a transaction is rolled back, the corresponding data access can be unblocked (e.g., releasing locks).

Recovery of recovery

By far that’s the whole process of recovery which looks perfect, right? Wait, how about failures in the middle of a recovery? Do we have the recovery of a recovery? The short answer is yes. We just need to reapply the same recovery process.

Before going into more details, let’s first understand physical vs. logical operations. A physical operation is repeatable while a logical operation isn’t. For example, to allocate a new I/O unit (e.g., a data page), repeating a physical operation will always choose the same I/O unit for allocation, while repeating a logical operation may end up with different candidates as long as it finishes the goal of allocating a new I/O unit.

Redoing logs are physical operations. The goal of log replay is to restore a system to the last normal state right before the crash. Therefore, we need to make sure the exact same operation can be redone. On the other hand, undoing logs are often logical operations. In some cases, it’s almost impossible to apply a physical operation for undoing. Let’s use the same example of the I/O unit allocation. Undoing a I/O unit allocation is to deallocate that I/O unit. If the I/O unit is also occupied by other transactions, a direct deallocation of the whole I/O unit would impact other transactions. A logical operation allows to undo a log by applying a new operation based on the current state.

Now let’s review the recovery of each recovery phase:

  1. The analysis phase does not change any state. Therefore, for a recovery disrupted during the analysis phase, a restart simply repeats the same analysis.

  2. The redo phase applies physical operations. For a recovery disrupted at the log replay, a restart first goes through the analysis phase and then continues doing that same log replay from where it’s left previously. The property of physical operations ensures that the effect of redoing a log is the same under either the original recovery or the restarted recovery.

  3. The undo phase applies logical operations, which is essentially the same as that during the normal time. Similarly, in order to restore a disrupted undo phase to its previous state right before the crash, the same principle of write-ahead logging is applied. We record logs during the undo phase and call them compensation logs for distinction. For a recovery disrupted when undoing logs, a restart first goes through the analysis phase and then replays both regular logs and compensation logs during the redo phase. After that the undo phase is able to continue from its previously disrupted state.