Skip to main content

Outages and Recovery Objectives (RTO / RPO)

When a cloud outage disrupts a Namespace, Temporal Cloud takes measures to maintain the Namespace's availability and data durability. The time it takes to recover from the outage is called the recovery time. The recovery point is how far back in time data must be recovered from after an outage. A durable system should have a low recovery time and a near recovery point.

Temporal Cloud publishes goals for the recovery time and recovery point for each kind of outage. These goals are called the Recovery Time Objective (RTO) and Recovery Point Objective (RPO). For details on how each is measured, see How RTO and RPO are measured. These objectives are complementary to Temporal Cloud's Service Level Agreement (SLA).

The RTO and RPO for a Namespace depend on the type of outage and which High Availability features the Namespace has enabled.

RTO and RPO summary

The following table summarizes the RTO and RPO targets for each type of outage. These targets apply to Namespaces that have Temporal-initiated failovers enabled, which is the default. Temporal-initiated failovers are triggered by Temporal's tooling and on-call engineers without user action. Users can always initiate a failover independently. In an outage, a user-initiated failover will not cancel out or reverse a Temporal-initiated failover.

These targets are for unplanned cloud outages and do not apply to user-initiated failovers during healthy periods, such as DR drills. Read about triggering a failover to see how a Namespace failover performs during healthy periods.

Outage typeApplicable NamespacesRPORTO
Availability Zone outageAll NamespacesZeroNear-zero
Cell outageNamespaces with Same-region, Multi-region, or Multi-cloud ReplicationUnder 1 minuteUnder 20 minutes
Cloud Region outageNamespaces with Multi-region or Multi-cloud ReplicationUnder 1 minuteUnder 20 minutes
Cloud-wide outageNamespaces with Multi-cloud ReplicationUnder 1 minuteUnder 20 minutes
tip

Temporal highly recommends keeping Temporal-initiated failovers enabled. When Temporal-initiated failovers are disabled, Temporal Cloud cannot set an RPO and RTO for that Namespace, because it cannot control when or if the user will trigger a failover.

As soon as a cloud outage resolves, Temporal's on-call engineers work to restore service to Namespaces that were not protected by High Availability. A cloud outage can leave lingering effects in Temporal's systems and applications, even after the cloud provider restores the underlying service. An affected Namespace's outage may last longer than the cloud provider's outage.

All Namespaces are backed up every 4 hours. If an outage causes data loss on a Namespace that was not protected by High Availability, Temporal uses the backup to restore as much data as feasible.

Outage types and their RTO/RPO

The following sections explain each type of outage in more detail, including the blast radius, Temporal Cloud features that mitigate the outage, and whether the outage is included in the SLA calculation.

Availability Zone outage

An Availability Zone (AZ) is akin to an isolated datacenter managed by a cloud hyperscaler, with independent power, networking, and cooling infrastructure. Each cloud region contains multiple AZs, and an individual AZ can fail due to events such as hardware failure, power loss, or a localized network partition.

AZ outages are the most common type of outage, and Temporal Cloud has weathered many of them transparently.

Blast Radius: A single Availability Zone within a single cloud region. Because every Namespace's components are spread across at least three AZs, the blast radius to Temporal Cloud users is typically zero — Namespaces stay operational with little to no downtime.

caution

While Temporal Cloud can withstand single AZ outages without disruption, if you have Workers that are deployed in the impacted AZ, those Workers may be disrupted. To mitigate this risk, Temporal recommends deploying your Workers across multiple AZs.

Mitigation: Every Namespace is automatically spread across at least three Availability Zones, and any Namespace can handle a single AZ failure without disruption to end-user Temporal operations. High Availability features are not required to keep Temporal Cloud operations running through an AZ outage.

SLA inclusion: Included in the SLA calculation. Any errors during an AZ outage count toward SLA credits, since AZ resilience is within Temporal's responsibility.

If two AZs fail simultaneously, Temporal Cloud treats the event as a Cloud Region outage. In that case, Namespaces in the region may be impacted, including those using Same-region Replication.

info

When an AZ fails, Temporal may also trigger a failover on Namespaces that have High Availability enabled, as a precaution in case the outage scope expands. You can opt out of this behavior by disabling Temporal-managed failovers on the Namespace.

RTO and RPO

When using Temporal Cloud (no additional features required):

  • Near-zero RTO. When a single AZ fails, the remaining two AZs continue serving requests without a failover, so end users see little to no disruption.
  • Zero RPO. Writes to Workflow state are synchronously replicated across all three AZs before being acknowledged back to the Client, so an AZ failure cannot cause data loss.

Cell outage

Temporal Cloud runs on a cell architecture. Each cell contains the software and services necessary to host a Namespace, and components within a cell are distributed across at least three Availability Zones. Cells provide a strong unit of isolation: a problem inside one cell does not propagate to other cells. A cell outage occurs when a cell becomes degraded or unavailable, disrupting the Namespaces hosted within it.

Blast Radius: One cell--and the Namespaces within that cell--within a single region. Even though your Workers will remain healthy, they will not be able to process Workflows because the Namespace is down.

Mitigation: Multi-region Replication and Multi-cloud Replication replicate a Namespace into another cell in a different region or different cloud provider. Same-region Replication replicates a Namespace into another cell within the same region. When any of these features are enabled for a Namespace, an outage that disrupts a single cell can be mitigated by failing the Namespace over to its replica.

SLA inclusion: Included in the SLA calculation. Any errors during a cell outage count toward SLA credits, since mitigating cell outages is within Temporal's responsibility.

Cell-level disruptions occur from time to time, and Temporal's replication and failover tooling has restored affected Namespaces in real-world incidents.

RTO and RPO

When using Same-region Replication, Multi-region Replication, or Multi-cloud Replication for Temporal-managed failover:

  • RTO under 20 minutes. Temporal detects the disruption and fails the Namespace over to its replica cell.
  • RPO under 1 minute. Asynchronous replication keeps the replica close to the active cell.

Even though the RPO target is under 1 minute, data loss is virtually eliminated thanks to Temporal's built-in Recovery and Conflict Resolution process, which reconciles state between the active and replica when the outage is over.

Cloud Region outage

A cloud region as a whole can become degraded, with effects that span beyond any single cell or Availability Zone.

Blast Radius: All Namespaces and Workers within a single cloud region are potentially affected.

Mitigation: Multi-region Replication and Multi-cloud Replication place the replica outside the affected region, so a Namespace can fail over and continue serving Workflows. Same-region Replication does not protect against a Cloud Region outage, since the replica resides in the same region.

SLA inclusion: Included in the SLA calculation only for Namespaces that have Multi-region Replication or Multi-cloud Replication enabled with Temporal-managed failovers — in those cases, Temporal can mitigate the outage. For Namespaces without these features, a Cloud Region outage is excluded from the SLA calculation, as it is beyond Temporal's control to mitigate.

If two or more regions in the same cloud provider experience an outage simultaneously, Temporal Cloud treats the event as a Cloud-wide outage.

Regional outages are less common than cell or AZ outages, but they do happen. During the AWS us-east-1 incident on October 20, 2025, Temporal Cloud's regional failover kept customer Namespaces running.

RTO and RPO

When using Multi-region Replication or Multi-cloud Replication for Temporal-managed failover:

  • RTO under 20 minutes. Temporal detects the regional disruption and fails the Namespace over to its replica in another region.
  • RPO under 1 minute. Asynchronous replication keeps the replica close to the active region.

Even though the RPO target is under 1 minute, data is virtually never "lost" thanks to Temporal's built-in Recovery and Conflict Resolution process, which reconciles state between the active and replica when a failover occurs.

Cloud-wide outage

On rare occasions, an issue affects two or more regions of a single cloud provider at once. Any simultaneous outage of two or more regions in the same cloud provider is treated as a cloud-wide outage.

Example causes: a software bug rolled out to every region of a cloud provider that triggers cascading failures across the provider's infrastructure, or two or more regions in the same cloud experiencing independent regional outages at the same time.

Blast Radius: Most or all regions of a single cloud provider. Every Namespace and every Worker hosted in that cloud is potentially affected.

Mitigation: Multi-cloud Replication places the replica in a different cloud provider entirely, so the Namespace can fail over even when an entire cloud provider goes down.

SLA inclusion: Included in the SLA calculation only for Namespaces that have Multi-cloud Replication enabled with Temporal-managed failovers — in those cases, Temporal can mitigate the outage. For Namespaces without this feature, a cloud-wide outage is excluded from the SLA calculation, as it is beyond Temporal's control to mitigate.

Cloud-wide outages are the rarest category, but they have occurred. Multi-cloud Replication is designed to keep Namespaces running through such events.

RTO and RPO

When using Multi-cloud Replication for Temporal-managed failover:

  • RTO under 20 minutes. Temporal detects the cloud-wide disruption and fails the Namespace over to its replica in a different cloud provider.
  • RPO under 1 minute. Asynchronous replication keeps the replica close to the active region, even across cloud providers.

Even though the RPO target is under 1 minute, data is virtually never "lost" thanks to Temporal's built-in Recovery and Conflict Resolution process, which reconciles state between the active and replica when a failover occurs.

How RTO and RPO are measured

Temporal Cloud achieves its RTO and RPO targets through High Availability replication. The following sections explain how each metric is measured and what factors can affect them.

RPO

Unlike a traditional database where data within the recovery point window may be permanently lost, Temporal Cloud durably persists all acknowledged data. After an outage resolves, Temporal's Recovery and Conflict Resolution process automatically syncs data back into the Namespace. The RPO therefore reflects the maximum data that may be temporarily unavailable in the replica at the moment of failover, not data that is permanently lost.

Temporal keeps replicas up to date using asynchronous replication, with monitoring, alerting, and internal SLOs on replication lag for every Namespace.

User actions on a Namespace can affect the recovery point. For example, suddenly spiking into much higher throughput than a Namespace has seen before could create a period of replication lag where the replica falls behind the active.

Temporal provides a replication lag metric for each Namespace. This metric approximates the recovery point the Namespace would achieve in a worst-case failure at that moment. Temporal recommends monitoring the replication lag and alerting if it rises above 1 minute.

RTO

The Recovery Time for a given incident is measured from the moment the incident begins to cause abnormal Namespace operation — for example, when unavailability or error rates rise above an acceptable level — to the moment the Namespace is restored to full functionality.

For most incidents, the vast majority of the Recovery Time is spent detecting the incident, determining the affected boundary (a single cell, a region, or an entire cloud), and deciding to fail Namespaces over to their replicas. The actual time to complete the failover is usually a very small piece of the Recovery Time.

This Recovery Time covers only the Temporal Namespace. Your application's overall Recovery Time also depends on having enough healthy Workers that can reach the Namespace and process Workflows. Maintaining sufficient Worker capacity that can reach the replica region (or replica cloud) during a failover is your responsibility. You are also responsible for failing over any other regional dependencies your application relies on, such as replicated application databases.

Tips for a lower Recovery Time

To achieve the lowest possible recovery times, Temporal recommends that you:

  • Keep Temporal-initiated failovers enabled on your Namespace (the default)
  • Invest in a process to detect outages and trigger a manual failover.

You can trigger manual failovers on their Namespaces even if Temporal-initiated failovers are enabled. There are several benefits to combining a manual failover process with Temporal-initiated failovers:

  • You can detect outages that Temporal doesn't. In the cloud, regional outages don't affect all services equally. It's possible that Temporal--and the services it depends on--are unaffected by the outage, even while your Workers or other cloud infrastructure are disrupted. If you monitor services in your critical path and alert on unusual error rates, you may catch outages before Temporal Cloud does.

  • You can sequence your failovers in a particular order. Your cloud infrastructure probably contains more pieces than just your Temporal Namespace: Temporal Workers, compute pools, data stores, and other cloud services. If you manually fail over, you can choose the order in which these pieces switch to the replica region. You can then test that ordering with failover drills and ensure it executes smoothly without data consistency issues or bottlenecks.

  • You can proactively fail over more aggressively than Temporal. While the 20-minute RTO should be sufficient for most use cases, some may strive to hit an even lower RTO. For workloads like high frequency trading, auctions, or popular sporting events, an outage at the wrong time could cause tremendous lost revenue per minute. You can adopt a posture that fails over more eagerly than Temporal does. For example, you could trigger a manual failover at the first sign of a possible disruption, before knowing whether there's a true regional outage.

  • Even if you have robust tooling to detect an outage and trigger a failover, leaving Temporal-initiated failovers enabled provides a "safety net" in case your automation misses an outage. It also gives Temporal leeway to preemptively fail over your Namespace if we detect that it may be disrupted soon, e.g., by a rolling failure that has impacted other Namespaces but not yours, yet.

Comparing RTO and SLA

Temporal has both a Recovery Time Objective (RTO) and a Service Level Agreement (SLA). They serve complementary purposes and apply in different situations.

AspectRTOSLA
What is it?An objective, or high-priority goal, for the total time that an outage disrupts a Namespace.A contractual agreement that sets an upper bound on the service error rate, with financial repercussions.
How is it measured?The achieved recovery time is measured in terms of minutes per outage.The achieved service error rate is measured in terms of error rate per month.
How is the calculation performed?The achieved recovery time in a given outage is the total time between when a disruption to a Namespace began and when the Namespace was restored to full functionality, either after a failover to a healthy region or after the outage has been mitigated.Temporal measures the percentage of requests to Temporal Cloud that fail, and applies a formula to get the final percentage for the month.
Do partial degradations count?Most outages contain periods of partial degradation where some percentage of Namespace operations fail while the rest complete as normal. When they disrupt a Namespace, periods of partial degradation count in the calculation of the recovery time.Partial degradations only partially count for the service error rate calculation. A 5-minute window with a 10% error rate would count less than a 5-minute window with a 100% error rate.
What is excluded?For partial degradations, what counts as a disruption to a Namespace is subject to Temporal's expert judgment, but a good rule of thumb is a service error rate >=10%.We exclude outages that are out of Temporal's control to mitigate, e.g., a failure of the underlying cloud provider infrastructure that affects a Namespace without High Availability and Temporal-initiated failovers enabled. If a Namespace has the relevant High Availability feature and has Temporal-initiated failovers enabled, then Temporal can act to mitigate the outage and it does usually count against the SLA. Full exclusions on the SLA page.

The following examples illustrate the RTO and SLA calculations for different types of in a regional outage. These hypothetical Namespaces are based on actual Temporal Cloud performance in a real-world outage.

Suppose that region middle-earth-1 experienced a cascading failure starting at 10:00:00 UTC, causing various instances and machines to fail over time. Temporal's automatic failover triggered for all Namespaces and completed at 10:15:00 UTC.

  • Namespace 0 was in the region but its cell was not affected by the outage. The only downtime it had was for a few seconds during the failover operation. It experienced a near-zero Recovery Time, and its service error rate was negligible. Graceful failover was successful, and this Namespace achieved a recovery point of 0.

  • Namespace 1_A was in the region and its cell experienced a partial degradation that caused 10% of requests to fail in the first 5 minutes, 25% in the second five minutes, and 50% in the third five minutes. Since it was significantly impacted from 10:00:00 to 10:15:00, its Recovery Time was 15 minutes. If it had no other service errors that month, then its service error rate for the month would be: ( (1 - 10%) + (1 - 25%) + (1 - 50%) + 8925 * 100% ) / 8928 = 99.990%. (Note: there are 8928 5-minute periods in a 31-day month.) Graceful failover was successful, and this Namespace achieved a recovery point of 0.

  • Namespace 1*B was in the same cell as Namespace 2_A, so it also experienced a partial degradation that caused 10% of requests to fail. However, its owner detected the outage via their own tooling and decided to manually fail over at 10:05:00. This Namespace achieved a recovery time of 5 minutes and a service error rate of ( 1 * (1 - 10%) + 8927 _ 100% ) / 8928 = 99.998%. Graceful failover was successful, and this Namespace achieved a recovery point of 0.

  • Namespace 2*A was in the region and its cell was fully network partitioned at the start of the outage, causing 100% of requests to fail. Since it was significantly impacted from 10:00:00 to 10:15:00, its Recovery Time was 15 minutes. If it had no other service errors that month, then its service error rate for the month would be: ( 3 * (1 - 100%) + 8928 _ 100% ) / 8640 5-minute periods per month = 99.97%. Because the Namespace was network partitioned, graceful failover did not succeed, and forced failover was used. The recovery point achieved was equal to the replication lag at the time of the network partition, which was a few seconds.

  • Namespace 2*B was in the region and was fully network partitioned, causing 100% of requests to fail. However, its owner detected the outage via their own tooling and decided to manually fail over at 10:05:00. This Namespace achieved a recovery time of 5 minutes and a service error rate of ( 1 5-minute periods * (1 - 100%) + 8639 5-minute periods _ 100% ) / 8640 5-minute periods per month = 99.99%. Because the Namespace was network partitioned, graceful failover did not succeed, and forced failover was used. The recovery point achieved was equal to the replication lag at the time of the network partition, which was a few seconds.

All of the above Namespaces were in the affected region and beat the 1-minute RPO. But they achieved varying recovery times and service error rates.

  • Notice how Namespace 1_A and Namespace 2_A were both automatically failed over with the same recovery time but different service error rates. Notice how Namespace 2_B and Namespace 1_A happen to have the same service error rate but different recovery times. This illustrates how RTO and SLA can differ, even in the same outage. Both are valuable tools for Temporal Cloud users to measure the availability of their Namespaces.

  • Notice how the Namespaces that were manually failed over (Namespace 1_B and Namespace 2_B) achieved lower recovery times than the Namespaces that were automatically failed over (Namespace 1_A and Namespace 2_A). This illustrates how proactive, aggressive manual failover can achieve a better recovery time than automatic failover.