Incident Management Analytics

Overview

Use Incident Analytics to learn from past incidents and understand the efficiency and performance of your incident response process. Incident analytics allows you to pull aggregated statistics on your incidents over time. You can use these statistics to create reports that help you to:

  • Analyze whether your incident response process is improving over time
  • Assess your mean time to resolutions
  • Identify areas of improvement that you should invest in

Data collected

Incident Management Analytics is a queryable data source for aggregated incident statistics. You can query these analytics in a variety of graph widgets in both Dashboards and Notebooks to analyze the history of your incident response over time. To give you a starting point, Datadog provides the following out-of-the-box resources that you can clone and customize:

Measures

Incident Management collects the following analytic measures to form analytic queries:

  • Incident Count
  • Customer Impact Duration
  • Status Active Duration
  • Status Stable Duration
  • Time to Detect
  • Time to Repair (customer impact end time - created time)
  • Time to Resolve (resolved time - created time)
  • Number of Users Impacted
  • Acknowledge

In addition to these defaults, you can create new measures by adding custom Number property fields in your Incident Settings.

Visualize incident data in dashboards

To configure your graph using Incident Management Analytics data, follow these steps:

  1. Select your visualization.
  2. Select Incidents from the data source dropdown menu.
  3. Select a measure from the yellow dropdown menu.
    • Default Statistic: Counts the number of incidents.
  4. Select an aggregation for the measure.
  5. (Optional) Select a rollup for the measure.
  6. (Optional) Use the search bar to filter the statistic down to a specific subset of incidents.
  7. (Optional) Select a facet in the pink dropdown menu to break the measure up by group and select a limited number of groups to display.
  8. Title the graph.
  9. Save your widget.

Example: Weekly outage customer impact duration grouped by service

Timeseries graph configuration showing Incidents data source filtered by severity, showing the customer impact duration grouped by service

This example configuration shows you an aggregation of your incidents that are SEV-1 or SEV-2. The graph displays the Customer Impact Duration of those incidents grouped by service.

  1. Widget: Timeseries Line Graph
  2. Datasource: Incidents
  3. Measure: Customer Impact Duration
  4. Aggregation: avg
  5. Rollup: 1w
  6. Filter: severity:("SEV-1" OR "SEV-2")
  7. Group: Services, limit to top 5

Incident report

Use the out-of-the-box Notebook template to create the Incident Report or build one from scratch to get a summary report of incidents in your team or service.

  1. Open the Incident Report template.
  2. Click Use Template to edit and customize.
  3. You can use the existing Incident cells or customize the query to display values for each measure.
  4. Update the summary cells with the relevant values and share the report with the rest of your team.

Further reading

PREVIEWING: alai97/reorganize-some-sections-in-dora-metrics