Getting Started with Incident Management
Incident Management is not available for your selected Datadog site ().
Overview
Datadog Incident Management is for tracking and communicating about an issue you’ve identified with your metrics, traces, or logs.
This guide walks you through using the Datadog site for declaring an incident, updating the incident as investigation and remediation progresses, and generating a postmortem when the incident has been resolved. The example assumes the Slack integration is enabled.
Walking through an incident from issue detection to resolution
Declaring an incident
Scenario: A monitor is alerting on a high number of errors which may be slowing down several services. It’s unclear whether customers are being impacted.
This guide describes using the Datadog Clipboard to declare an incident. Using the Clipboard, you can gather information from different sources, such as graphs, monitors, entire dashboards, or notebooks. This helps you provide as much information as possible when declaring an incident.
- In Datadog, navigate to Dashboard List and select System - Metrics.
- Hover over one of the graphs and copy it to the Clipboard with one of the following commands:
- Ctrl/Cmd + C
- Click the Export icon on the graph and select Copy.
- In the Datadog menu on the left-hand side, go to Monitors > Monitors List and select [Auto] Clock in sync with NTP.
- Open the Clipboard: Ctrl/Cmd + Shift + K.
- In the Clipboard, click Add current page to add the monitor to the Clipboard.
- Click Select All and then Export items to…
- Select Declare Incident.
- Describe what’s happening:
| |
---|
Title | Follow any naming conventions your team wants to use for incident titles. Because this is not a real incident, include the word TEST to make it clear that this is a test incident. An example title: [TEST] My incident test |
Severity Level | Set to Unknown since it’s unclear whether customers are being impacted and how related services are being impacted. See the in-app description of what each severity level means and follow your team’s guidelines. |
Incident Commander | Leave this assigned to you. In an actual incident this would be assigned to the leader of the incident investigation. You or others can update who the incident commander is as the incident investigation progresses. |
Notifications | Leave blank because this is only a test, and you don’t want to alert anyone else or another service. For an actual incident, add people and services that should be notified to help with the investigation and remediation. You can send these notifications to Slack and PagerDuty as well. |
- Click Declare Incident to create the incident.
You can also declare an incident from a graph, monitor, or the incidents API. For APM users, you can click the incidents icon on any APM graph to declare an incident.
As part of the Slack integration, you can also use the
/datadog incident
shortcut to declare an incident and set the title, severity, and customer impact. - Click Slack Channel on the incident’s page to go to the incident’s Slack channel.
A new Slack channel dedicated to the incident is automatically created for any new incident, so that you can consolidate communication with your team and begin troubleshooting. If your organization’s Slack integration is set up to update a global incident channel, then the channel is updated with the new incident.
In this example, you are the only one added to the new incident channel. When you add people or services in Notifications for an actual incident, all recipients are automatically added to the incident channel.
If you don’t have the Slack integration enabled, click Add Chat to add the link to the chat service you are using to discuss the incident.
Click Add Video Call to add a link to the call where discussions about the incident are happening.
Troubleshooting and updating the incident
The Incident page has four main sections: Overview, Timeline, Remediation, and Notifications. Update these sections as the incident progresses to keep everyone informed of the current status.
Overview
Scenario: After some investigation, you discover that the root cause is a host running out of memory. You’ve also been informed that a small subset of customers are being affected and seeing slow loading of pages. The first customer report came in 15 minutes ago. It is a SEV-3 incident.
In the Overview section, you can update incident fields and customer impact as the investigation continues.
To update the severity level and root cause:
- Click the Severity dropdown and select SEV-3.
- Under What happened, select Monitor in the Detection Method dropdown (Unknown is selected), because you were first alerted by a monitor on the issue.
- Add to the Why it happened field:
TEST: Host is running out of memory.
- Click Save to update the properties.
From Slack, you can also update the title, severity, or status of an ongoing issue using the
/datadog incident update
command.
To add the customer impact:
- Click + Add in the Impact section.
- Change the timestamp to 15 minutes earlier, because that was when the first customer report came in.
- Add to descriptions field:
TEST: Some customers seeing pages loading slowly.
- Click Save to update the fields. The Impact section updates to show how long the customer impact has been going on. All changes made on the Overview page are added to the Timeline.
Timeline
The Timeline shows additions and changes to incident fields and information in chronological order.
- Click the Timeline tab.
- Find the Impact added event and mark as Important by clicking the flag icon.
- Add a note to the timeline:
I found the host causing the issue.
- Hover over the note’s event and click the pencil icon to change the timestamp of the note because you actually found the host causing the issue 10 minutes ago.
- Flag the note as Important.
- Click Slack Channel to go back to the incident’s Slack channel.
- Post a message in the channel saying
I am working on a fix.
- Click the message’s actions command icon (three dots on the right after hovering over a message).
- Select Add to Incident to send the message to the timeline.
You can add any Slack comment in the incident channel to the timeline so that you can consolidate important communications related to the investigation and mitigation of the incident.
Scenario: There’s a notebook on how to handle this kind of issue, which includes tasks that need to be done to fix it.
In the Remediation section, you can keep track of documents and tasks for investigating the issue or for post-incident remediation tasks.
- Click the Remediation tab.
- Click the plus icon
+
in the Documents box and add a link to a Datadog notebook. All updates to the Documents section are added to the timeline as an Incident Update type. - Add a task by adding a description of a task in the Incident Tasks box, for example:
Run the steps in the notebook.
- Click Create Task.
- Click Assign To and assign yourself the task.
- Click Set Due Date and set the date for today.
All task additions and changes are recorded in the Timeline.
You can also add post-incident tasks in the Remediation section to keep track of them.
Notifications
Scenario: The issue has been mitigated, and the team is monitoring the situation. The incident status is stable.
In the Notifications section, you can send out a notification updating the status of the incident.
- Navigate back to the Overview section.
- Change the status in the dropdown menu from ACTIVE to STABLE.
- Go to the Notifications tab.
- Click New Notification.
The default message has the incident’s title in the subject and information about the current status of the incident in the body.
In an actual incident you would send updates to the people involved in the incident. For this example, send a notification to yourself only.
- Add yourself to the Recipients field.
- Click Send.
You should receive an email with the message.
You can create customized message templates. Group templates together using the Category field.
Resolution and postmortem
Scenario: It’s been confirmed that the issue no longer impacts customers and that you’ve resolved the issue. The team wants a postmortem to look back on what went wrong.
- Go to the Overview section.
- Change the status from STABLE to RESOLVED so that it’s no longer active. You can also change the date and time for when the customer impact ended if it occurred earlier.
- When an incident’s status is set to resolved, a Generate Postmortem button appears at the top. Click Generate Postmortem.
- For the timeline section, select Marked as Important so that only the Important events are added to the postmortem.
- Click Generate.
The postmortem is generated as a Datadog Notebook, and it includes the timeline events and resources referenced during the investigation and remediation. This makes it easier to review and further document what caused the issue and how to prevent it in the future. Datadog Notebook supports live collaboration so you can edit it with your teammates in real-time.
If there are follow-up tasks that you and your team need to complete to ensure the issue doesn’t happen again, add those and track them in the Remediation’s Incident Tasks section.
Customizing your incident management workflow
Datadog Incident Management can be customized with different severity and status levels, based on your organization’s needs, and also include additional information such as APM services and teams related to the incident. For more information, see this section of the Incident Management page.
You can also set up notification rules to automatically notify specific people or services based on an incident’s severity level. For more information, see the Incident Settings documentation.
To customize Incident Management, go to the incident settings page. From the Datadog menu on the left-hand side, go to Monitors > Incidents (if you get an Incident Management welcome screen, click Get Started). Then on the top, click Settings.
Create and Manage Incidents on Mobile
The Datadog Mobile App, available on the Apple App Store and Google Play Store, enables users to create, view, search, and filter all incidents you have access to in your Datadog account from the Datadog Mobile App to ensure quick response and resolution without opening your laptop.
You can also declare and edit incidents and quickly communicate to your teams through integrations with Slack, Zoom, and many more.
Further Reading
Additional helpful documentation, links, and articles: