High Availability support of the Datadog Agent

High Availability support of the Datadog Agent is not supported for your selected Datadog site ().

High Availability support of the Datadog Agent is in Preview. Reach out to your Datadog representative to sign up.

Overview

High Availability (HA) support of the Datadog Agent enables seamless failover between a designated active Agent and a standby Agent. If the active Agent becomes unavailable, due to unexpected issues or planned events like OS patches or Agent upgrades, the standby Agent automatically takes over. This configuration eliminates the Agent as a single point of failure, ensuring uninterrupted monitoring and increased resilience across your infrastructure.

You can configure Agents as active-standby pairs in several supported integrations. If the active Agent becomes unavailable, the standby Agent automatically takes over within 90 seconds. You can designate a preferred active Agent, allowing the primary Agent to automatically resume its role when it becomes available. This enables proactive Agent switching ahead of scheduled maintenance.

Supported integrations

The following integrations are supported for High Availability:

CategorySupported Integrations
Network MonitoringSNMP, Network Path, HTTP Check
Vendor-SpecificCisco ACI, Cisco SD-WAN

Prerequisites

Supported Operating Systems:

  • Linux
  • Windows
  • macOS

Setup

Installation

  1. Install the Datadog Agent on two similar hosts (one on each host). The following setup is for hosts with similar capabilities (CPU, RAM, and networking) and configurations (including datadog.yaml and integration settings).

  2. Configure your datadog.yaml on each host, with the following settings:

    ha_agent:
      enabled: true
    config_id: <CONFIG-NAME>  # example: "my-ndm-agents"
                              # only use lowercase alphanumerics, hyphen and underscore
    
  3. Configure one of the supported integrations for High Availability:

    For example, to set up the SNMP integration, install it on both Agents using the SNMP Metrics setup guide.
    Note: Both individual device monitoring and Autodiscovery methods are supported for the SNMP integration.

    After the Agents are configured, they function as an HA pair:

    • The installed integration runs only on the active Agent.
    • If the active Agent or host fails (due to a crash or shutdown), the standby Agent automatically takes over, maintaining uninterrupted monitoring.

Define a preferred active Agent

  1. Go to Integrations > Fleet Automation > View Agents.

  2. Search for your previously configured Agents using tags or hostname, for example, config_id:<CONFIG-NAME>.

    Fleet Automation View Agents
  3. Click on the Agent you want to designate as the preferred active Agent to open a side-panel.

  4. In the HA Preferred Active Agent dropdown, select the Agent you would like to define as preferred.

    Fleet Automation View Agents, highlighting HA Preferred Active Agent

Testing and validation

  1. Test failover by shutting down the active Agent or its host.
  2. The standby Agent should start monitoring the configured integration(s) after 1-3 minutes.

FAQ

How is the active Agent determined?

Without a preferred active Agent:

  • The active Agent is initially selected at random.
  • Failover occurs only when the current active Agent shuts down or crashes.
  • When the primary Agent recovers, it does not automatically reclaim the active role.

With a preferred active Agent:

  • The preferred Agent always takes priority when available.
  • If it fails, the standby Agent becomes active.
  • When the preferred Agent recovers, it automatically resumes the active role, and the standby Agent returns to standby.

Why does my Agent have an unknown HA Agent state?

Further reading

PREVIEWING: aleksandr.pasechnik/svls-6807-lambda-fips