How to Set Up Incident Data for DORA Metrics

이 페이지는 아직 한국어로 제공되지 않으며 번역 작업 중입니다. 번역에 관한 질문이나 의견이 있으시면 언제든지 저희에게 연락해 주십시오.

DORA Metrics is not available in the selected site () at this time.

DORA Metrics is in public beta.

Overview

Failed deployments events, currently interpreted through incident events, are used to compute change failure rate and mean time to restore (MTTR).

Selecting and configuring an incident data source

PagerDuty is an incident management platform that equips IT teams with immediate incident visibility, enabling proactive and effective responses to maintain operational stability and resilience.

To integrate your PagerDuty account with DORA Metrics:

  1. Navigate to Integrations > Developer Tools in PagerDuty and click Generic Webhooks (v3).

  2. Click + New Webhook and enter the following details:

    VariableDescription
    Webhook URLAdd https://webhook-intake./api/v2/webhook/.
    Scope TypeSelect Account to send incidents for all PagerDuty services in your account. Alternatively, you can send incidents for specific services or teams by selecting a different scope type.
    DescriptionA description helps distinguish the webhook. Add something like Datadog DORA Metrics integration.
    Event SubscriptionSelect the following events:
    -incident.acknowledged
    -incident.annotated
    -incident.custom_field_values.updated
    -incident.delegated
    -incident.escalated
    -incident.priority_updated
    -incident.reassigned
    -incident.reopened
    -incident.resolved
    -incident.triggered
    -incident.unacknowledged
    Custom HeadersClick Add custom header, enter DD-API-KEY as the name, and input your Datadog API key as the value.

    Optionally, you can add an environment to all of the PagerDuty incidents sent from the webhook by creating an additional custom header with the name dd_env and the desired environment as the value.
  3. To save the webhook, click Add Webhook.

The severity of the incident in the DORA Metrics product is based on the incident priority in PagerDuty.

Note: Upon webhook creation, a new secret is created and used to sign all the webhook payloads. That secret is not needed for the integration to work, as the authentication is performed using the API key instead.

Mapping PagerDuty services to Datadog services

When an incident event is received for a specific PagerDuty service, Datadog attempts to retrieve the related Datadog service and team from any triggering Datadog monitors and from the Service Catalog.

The matching algorithm works in the following steps:

  1. If the PagerDuty incident event was triggered from a Datadog monitor:

    • If the monitor is in Multi Alert mode, the incident metrics and events are emitted with the env, service, and team from the alerted group.
    • If the monitor has tags for env, service, or team:
      • env: If the monitor has a single env tag, the incident metrics and events are emitted with the environment.
      • service: If the monitor has one or more service tags, the incident metrics and events are emitted with the provided services.
      • team: If the monitor has a single team tag, the incident metrics and events are emitted with the team.
  2. If the service URL of the incident matches the PagerDuty service URL for any services in the Service Catalog:

    • If a single Datadog service matches, the incident metrics and events are emitted with the service and team.
    • If multiple Datadog services match, the incident metrics and events are emitted with the team.

    For more information about setting the PagerDuty service URL for a Datadog service, see Use Integrations with Service Catalog.

  3. If the PagerDuty service name of the incident matches a service name in the Service Catalog, the incident metrics and events are emitted with the service and team.

  4. If the PagerDuty team name of the incident matches a team name in the Service Catalog, the incident metrics and events are emitted with the team.

  5. If the PagerDuty service name of the incident matches a team name in the Service Catalog, the incident metrics and events are emitted with the team.

  6. If there have been no matches up to this point, the incident metrics and events are emitted with the PagerDuty service and PagerDuty team provided in the incident.

To send your own incident events, use the DORA Metrics API. Incident events are used in order to compute change failure rate and mean time to restore.

Include the finished_at attribute in an incident event to mark that the incident is resolved. You can send events at the start of the incident and after incident resolution. Incident events are matched by the env, service, and started_at attributes.

The following attributes are required:

  • services or team (at least one must be present)
  • started_at

You can optionally add the following attributes to the incident events:

  • finished_at for resolved incidents. This attribute is required for calculating the time to restore service.
  • id for identifying incidents when they are created and resolved. This attribute is user-generated; when not provided, the endpoint returns a Datadog-generated UUID.
  • name to describe the incident.
  • severity
  • env to filter your DORA metrics by environment on the DORA Metrics page.
  • repository_url
  • commit_sha

See the DORA Metrics API reference documentation for the full spec and additional code samples.

API (cURL) Example

For the following configuration, replace <DD_SITE> with :

curl -X POST "https://api.<DD_SITE>/api/v2/dora/incident" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "DD-API-KEY: ${DD_API_KEY}" \
  -d @- << EOF
  {
    "data": {
      "attributes": {
        "services": ["shopist"],
        "team": "shopist-devs",
        "started_at": 1693491974000000000,
        "finished_at": 1693491984000000000,
        "git": {
          "commit_sha": "66adc9350f2cc9b250b69abddab733dd55e1a588",
          "repository_url": "https://github.com/organization/example-repository"
        },
        "env": "prod",
        "name": "Web server is down failing all requests",
        "severity": "High"
      }
    }
  }
EOF

Calculating change failure rate

Change failure rate requires both deployment data and incident data.

Change failure rate is calculated as the percentage of incident events out of the total number of deployments. Datadog divides dora.incidents.count over dora.deployments.count for the same services and/or teams associated to both an failure and a deployment event.

Calculating time to restore

Time to restore is calculated as the duration distribution for resolved incident events.

DORA Metrics generates the dora.time_to_restore metric by recording the start and end times of each incident event. It calculates the mean time to restore (MTTR) as the average of these dora.time_to_restore data points over a selected time frame.

Further Reading

PREVIEWING: mcretzman/DOCS-9337-add-cloud-info-byoti