Join an enablement webinar session
Explore and register for Foundation Enablement sessions. Learn how Datadog Incident Management enables DevOps teams and SREs to more effectively manage their incident response workflows from start to finish, saving time and reducing frustration when it matters most.
SIGN UPAny event that may lead to a disruption in your organization’s services can be described as an incident, and it is often necessary to have a set framework for handling these events. Datadog’s Incident Management feature provides a system through which your organization can effectively identify and mitigate incidents.
Incidents live in Datadog alongside the metrics, traces, and logs you are collecting. You can view and filter incidents that are relevant to you.
Get Started
Incident Management requires no installation. Get started by taking a Learning Center course, reading our guided walkthrough, or declaring an incident.
Learn more about Incident Management:
View your incidents
To view your incidents, go to the Incidents page to see a feed of all ongoing incidents.
- Filter your incidents through the properties listed on the left, including Status, Severity, and Time To Repair (hours).
- Use the Search field to enter tag attributes or keywords.
- Export your search results with the Export button at the top of the incident list.
- Configure additional fields that appear for all incidents in Incident Settings.
You can also view your Incidents list from your mobile device home screen and manage/create incidents by downloading the Datadog Mobile App, available on the Apple App Store and Google Play Store.
Describing the incident
When declaring an incident, it is critical to provide a comprehensive description, detailing what happened, why it occurred, and related attributes to ensure all stakeholders in the incident management process are fully informed. The essential elements of an incident declaration include a title, severity level, incident commanders, and notifications. Effective incident management documentation includes:
- Updating incident details, including its status, impact, root cause, detection methods, and service impacts.
- Forming and managing a response team, using custom responder roles, and leveraging metadata attributes for detailed incident assessment.
- Configuring notifications to keep all stakeholders informed throughout the incident resolution process.
For more information, see the Describe an Incident documentation.
Evaluate incident data
Incident Analytics provides insights into the efficiency and performance of your incident response process by allowing you to aggregate and analyze statistics from past incidents. Key metrics, such as time to resolution and customer impact, can be tracked over time. You can query these analytics using graph widgets in dashboards and notebooks. Datadog offers customizable templates, such as the Incident Management Overview Dashboard and a Notebook Incident Report, to help you get started.
For more details on the measures collected and step-by-step graph configurations to visualize your data, see Incident Management Analytics.
Integrations
In addition to integrating with Slack, Incident Management also integrates with:
- PagerDuty and OpsGenie to send incident notifications to your on-call engineers.
- CoScreen to launch collaborative meetings with multi-user screen sharing, remote control, and built-in audio and video chat.
- Jira to create a Jira ticket for an incident.
- Webhooks to send incident notifications using webhooks (for example, sending SMS to Twilio).
- Statuspage to create and update Statuspage incidents.
- ServiceNow to create a ServiceNow ticket for an incident.
Further Reading
Additional helpful documentation, links, and articles: