How to Set Up Incident Data for DORA Metrics

DORA Metrics is not available in the selected site () at this time.

DORA Metrics is in public beta.

Overview

Failed deployments events, currently interpreted through incident events, are used to compute change failure rate and mean time to restore (MTTR).

Selecting and configuring an incident data source

PagerDuty is an incident management platform that equips IT teams with immediate incident visibility, enabling proactive and effective responses to maintain operational stability and resilience.

To integrate your PagerDuty account with DORA Metrics:

  1. Navigate to Integrations > Developer Tools in PagerDuty and click Generic Webhooks (v3).

  2. Click + New Webhook and enter the following details:

    VariableDescription
    Webhook URLAdd https://webhook-intake.<span class="js-region-param region-param" data-region-param="dd_site"></span>/api/v2/webhook/.
    Scope TypeSelect Account to send incidents for all PagerDuty services in your account. Alternatively, you can send incidents for specific services or teams by selecting a different scope type.
    DescriptionA description helps distinguish the webhook. Add something like Datadog DORA Metrics integration.
    Event SubscriptionSelect the following events:
    -incident.acknowledged
    -incident.annotated
    -incident.custom_field_values.updated
    -incident.delegated
    -incident.escalated
    -incident.priority_updated
    -incident.reassigned
    -incident.reopened
    -incident.resolved
    -incident.triggered
    -incident.unacknowledged
    Custom HeadersClick Add custom header, enter DD-API-KEY as the name, and input your Datadog API key as the value.

    Optionally, you can add an environment to all of the PagerDuty incidents sent from the webhook by creating an additional custom header with the name dd_env and the desired environment as the value.
  3. To save the webhook, click Add Webhook.

The severity of the incident in the DORA Metrics product is based on the incident priority in PagerDuty.

Note: Upon webhook creation, a new secret is created and used to sign all the webhook payloads. That secret is not needed for the integration to work, as the authentication is performed using the API key instead.

Mapping PagerDuty services to Datadog services

When an incident event is received for a specific PagerDuty service, Datadog attempts to retrieve the related Datadog service and team from the Service Catalog.

The matching algorithm works in the following scenarios:

  1. If the incident service URL matches with the PagerDuty service URL configured for one or more services in the Service Catalog:

    • If the incident service URL matches a single Datadog service, the incident metrics and events are emitted with the Datadog service name and team retrieved from the Service Catalog.
    • If the incident service URL matches multiple Datadog services, the incident metrics and events are emitted with the Datadog team name.

    For more information about setting the PagerDuty service URL for a Datadog service, see Use Integrations with Service Catalog.

  2. If the PagerDuty service name of the incident matches a Datadog service name in the Service Catalog, the incident metrics and events are emitted with the Datadog service name and team retrieved from the Service Catalog.

  3. If the PagerDuty team name of the incident matches a Datadog team name in the Service Catalog, the incident metrics and events are emitted with the corresponding Datadog team name.

  4. If the PagerDuty service name of the incident matches a Datadog team name in the Service Catalog, the incident metrics and events are emitted with the Datadog team name.

To send your own incident events, use the DORA Metrics API. Incident events are used in order to compute change failure rate and mean time to restore.

Include the finished_at attribute in an incident event to mark that the incident is resolved. You can send events at the start of the incident and after incident resolution. Incident events are matched by the env, service, and started_at attributes.

The following attributes are required:

  • services or team (at least one must be present)
  • started_at

You can optionally add the following attributes to the incident events:

  • finished_at for resolved incidents. This attribute is required for calculating the time to restore service.
  • id for identifying incidents when they are created and resolved. This attribute is user-generated; when not provided, the endpoint returns a Datadog-generated UUID.
  • name to describe the incident.
  • severity
  • env to filter your DORA metrics by environment on the DORA Metrics page.
  • repository_url
  • commit_sha

See the DORA Metrics API reference documentation for the full spec and additional code samples.

API (cURL) Example

For the following configuration, replace <DD_SITE> with :

curl -X POST "https://api.<DD_SITE>/api/v2/dora/incident" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "DD-API-KEY: ${DD_API_KEY}" \
  -d @- << EOF
  {
    "data": {
      "attributes": {
        "services": ["shopist"],
        "team": "shopist-devs",
        "started_at": 1693491974000000000,
        "finished_at": 1693491984000000000,
        "git": {
          "commit_sha": "66adc9350f2cc9b250b69abddab733dd55e1a588",
          "repository_url": "https://github.com/organization/example-repository"
        },
        "env": "prod",
        "name": "Web server is down failing all requests",
        "severity": "High"
      }
    }
  }
EOF

Calculating change failure rate

Change failure rate requires both deployment data and incident data.

Change failure rate is calculated as the percentage of incident events out of the total number of deployments. Datadog divides dora.incidents.count over dora.deployments.count for the same services and/or teams associated to both an failure and a deployment event.

Calculating time to restore

Time to restore is calculated as the duration distribution for resolved incident events.

DORA Metrics generates the dora.time_to_restore metric by recording the start and end times of each incident event. It calculates the mean time to restore (MTTR) as the average of these dora.time_to_restore data points over a selected time frame.

Further Reading

PREVIEWING: alai97/reorganize-some-sections-in-dora-metrics