Troubleshooting Monitor Alerts

Overview

This guide provides an overview of some foundational concepts that can help you determine if your monitor’s alerting behavior is valid. If you suspect that your monitor’s evaluations are not accurately reflecting the underlying data, use this guide to inspect your monitor and troubleshoot the following:

Monitor state and monitor status

While monitor evaluations are stateless, meaning that the result of a given evaluation does not depend on the results of previous evaluations, monitors themselves are stateful, and their state is updated based on the evaluation results of their queries and configurations. A monitor evaluation with a given status won’t necessarily cause the monitor’s state to change to the same status. See below for some potential causes:

Metrics are too sparse within a metric monitor’s evaluation window

If metrics are absent from a monitor’s evaluation window, and the monitor is not configured to anticipate no-data conditions, the evaluation may be skipped. In such a case, the monitor state is not updated, so a monitor previously in the OK state remains OK, and likewise with a monitor in the Alert state. Use the history graph on the monitor status page and select the group and time frame of interest. If data is sparsely populated, see monitor arithmetic and sparse metrics for more information.

Monitor state updates due to external conditions

The state of a monitor may also sometimes update in the absence of a monitor evaluation, for example, due to auto-resolve.

Verify the presence of data

If your monitor’s state or status is not what you expect, confirm the behavior of the underlying data source. For a metric monitor, you can use the history graph to view the data points being pulled in by the metric query.

Alert conditions

Unexpected monitor behavior can sometimes be the result of misconfigured alert conditions, which vary by monitor type. If your monitor query uses the as_count() function, check the as_count() in Monitor Evaluations guide.

If using recovery thresholds, check the conditions listed in the recovery thresholds guide to see if the behavior is expected.

Monitor status and groups

For both monitor evaluations and state, status is tracked by group.

For a multi alert monitor, a group is a set of tags with one value for each grouping key (for example, env:dev, host:myhost for a monitor grouped by env and host). For a simple alert, there is only one group (*), representing everything within the monitor’s scope.

By default, Datadog keeps monitor groups available in the UI for 24 hours, or 48 hours for host monitors, unless the query is changed. See Monitor settings changes not taking effect for more information.

If you anticipate creating new monitor groups within the scope of your multi alert monitors, you may want to configure a delay for the evaluation of these new groups. This can help you avoid alerts from the expected behavior of new groups, such as high resource usage associated with the creation of a new container. Read new group delay for more information.

If your monitor queries for crawler-based cloud metrics, use an evaluation delay to ensure that the metrics have arrived before the monitor evaluates. Read cloud metric delay for more information about cloud integration crawler schedules.

Notification issues

If your monitor is behaving as expected, but producing unwanted notifications, there are multiple options to reduce or suppress notifications:

  • For monitors that rapidly change between states, read reduce alert flapping for ways to minimize alert fatigue.
  • For alerts which are expected or are otherwise not useful for your organization, use Downtimes to suppress unwanted notifications.
  • To control alert routing, use template variables and the separation of warning or alert states with conditional variables.

Absent notifications

If you suspect that notifications are not being properly delivered, check the items below to ensure that notifications are able to be delivered:

  • Check email preferences for the recipient and ensure that Notification from monitor alerts is checked.
  • Check the event stream for events with the string Error delivering notification.

Opsgenie multi-notification

If you are using multiple @opsgenie-[...] notifications in your monitor, we send those notifications with the same alias to Opsgenie. Due to an Opsgenie feature, Opsgenie will discard what is seen as a duplication.

Further Reading

PREVIEWING: piotr_wolski/update-dsm-docs