High Availability support of the Datadog Agent
High Availability support of the Datadog Agent is not supported for your selected
Datadog site (
).
High Availability support of the Datadog Agent is in Preview. Reach out to your Datadog representative to sign up.
Overview
High Availability (HA) support of the Datadog Agent enables seamless failover between a designated active Agent and a standby Agent. If the active Agent becomes unavailable, due to unexpected issues or planned events like OS patches or Agent upgrades, the standby Agent automatically takes over. This configuration eliminates the Agent as a single point of failure, ensuring uninterrupted monitoring and increased resilience across your infrastructure.
You can configure Agents as active-standby pairs in several supported integrations. If the active Agent becomes unavailable, the standby Agent automatically takes over within 90 seconds. You can designate a preferred active Agent, allowing the primary Agent to automatically resume its role when it becomes available. This enables proactive Agent switching ahead of scheduled maintenance.
Supported integrations
The following integrations are supported for High Availability:
Prerequisites
Supported Operating Systems:
Setup
Installation
Install the Datadog Agent on two similar hosts (one on each host). The following setup is for hosts with similar capabilities (CPU, RAM, and networking) and configurations (including datadog.yaml
and integration settings).
Configure your datadog.yaml
on each host, with the following settings:
ha_agent:
enabled: true
config_id: <CONFIG-NAME> # example: "my-ndm-agents"
# only use lowercase alphanumerics, hyphen and underscore
Configure one of the supported integrations for High Availability:
For example, to set up the SNMP integration, install it on both Agents using the SNMP Metrics setup guide.
Note: Both individual device monitoring and Autodiscovery methods are supported for the SNMP integration.
After the Agents are configured, they function as an HA pair:
- The installed integration runs only on the active Agent.
- If the active Agent or host fails (due to a crash or shutdown), the standby Agent automatically takes over, maintaining uninterrupted monitoring.
Define a preferred active Agent
Go to Integrations > Fleet Automation > View Agents.
Search for your previously configured Agents using tags or hostname, for example, config_id:<CONFIG-NAME>
.
Click on the Agent you want to designate as the preferred active Agent to open a side-panel.
In the HA Preferred Active Agent dropdown, select the Agent you would like to define as preferred.
Testing and validation
- Test failover by shutting down the active Agent or its host.
- The standby Agent should start monitoring the configured integration(s) after 1-3 minutes.
FAQ
How is the active Agent determined?
Without a preferred active Agent:
- The active Agent is initially selected at random.
- Failover occurs only when the current active Agent shuts down or crashes.
- When the primary Agent recovers, it does not automatically reclaim the active role.
With a preferred active Agent:
- The preferred Agent always takes priority when available.
- If it fails, the standby Agent becomes active.
- When the preferred Agent recovers, it automatically resumes the active role, and the standby Agent returns to standby.
Why does my Agent have an unknown
HA Agent state?
Further reading
Additional helpful documentation, links, and articles: