(LEGACY) High Availability and Disaster Recovery

Docs > Observability Pipelines > (LEGACY) Observability Pipelines Documentation > (LEGACY) Best Practices for OPW Aggregator Architecture > (LEGACY) High Availability and Disaster Recovery

This guide is for large-scale production-level deployments.

In the context of Observability Pipelines, high availability refers to the Observability Pipelines Worker remaining available if there are any system issues.

A diagram showing availability zone one with load balancer one offline, and both agents sending data to load balancer two and then to Worker one and Worker two. In availability zone two, Worker three is down, so both load balancers are sending data to Worker N

To achieve high availability:

Deploy at least two Observability Pipelines Worker instances in each Availability Zone.
Deploy Observability Pipelines Worker in at least two Availability Zones.
Front your Observability Pipelines Worker instances with a load balancer that balances traffic across Observability Pipelines Worker instances. See Capacity Planning and Scaling for more information.

Mitigating failure scenarios

Handling Observability Pipelines Worker process issues

To mitigate a system process issue, distribute the Observability Pipelines Worker across multiple nodes and front them with a network load balancer that can redirect traffic to another Observability Pipelines Worker instance as needed. In addition, platform-level automated self-healing should eventually restart the process or replace the node.

A diagram showing three nodes, where each node has an Observability Pipelines Worker

Mitigating node failures

To mitigate node issues, distribute the Observability Pipelines Worker across multiple nodes and front them with a network load balancer that can redirect traffic to another Observability Pipelines Worker node. In addition, platform-level automated self-healing should eventually replace the node.

A diagram showing data going to node one's load balancer, but because the Observability Pipelines Worker is down in node one, the data is sent to the Workers in node two and node N

Handling availability zone failures

To mitigate issues with availability zones, deploy the Observability Pipelines Worker across multiple availability zones.

A diagram showing the load balancers and Observability Pipelines Worker down in availability zone one, but load balancers and Workers in zone N still receiving and sending data

Mitigating region failures

Observability Pipelines Worker is designed to route internal observability data, and it should not failover to another region. Instead, Observability Pipelines Worker should be deployed in all of your regions. Therefore, if your entire network or region fails, Observability Pipelines Worker would fail with it. See Networking for more information.

Disaster recovery

Internal disaster recovery

Observability Pipelines Worker is an infrastructure-level tool designed to route internal observability data. It implements a shared-nothing architecture and does not manage state that should be replicated or transferred to a disaster recovery (DR) site. Therefore, if your entire region fails, Observability Pipelines Worker would fail with it. Therefore, you should install the Observability Pipelines Worker in your DR site as part of your broader DR plan.

External disaster recovery

If you’re using a managed destination, such as Datadog, Observability Pipelines Worker can facilitate automatic routing of data to your Datadog DR site using Observability Pipelines Worker’s circuit breaker feature.