Trace Metrics

Overview

Tracing application metrics are collected after you enable trace collection and instrument your application.

Trace Metrics

These metrics capture request counts, error counts, and latency measures. They are calculated based on 100% of the application’s traffic, regardless of any trace ingestion sampling configuration. Ensure that you have full visibility into your application’s traffic by using these metrics to spot potential errors on a service or a resource, and by creating dashboards, monitors, and SLOs.

Note: If your applications and services are instrumented with OpenTelemetry libraries and you set up sampling at the SDK level and/or at the collector level, APM metrics are calculated based on the sampled set of data.

Trace metrics are generated for service entry spans and certain operations depending on integration language. For example, the Django integration produces trace metrics from spans that represent various operations (1 root span for the Django request, 1 for each middleware, and 1 for the view).

The trace metrics namespace is formatted as:

  • trace.<SPAN_NAME>.<METRIC_SUFFIX>

With the following definitions:

<SPAN_NAME>
The name of the operation or span.name (examples: redis.command, pylons.request, rails.request, mysql.query).
<METRIC_SUFFIX>
The name of the metric (examples: hits, errors, apdex, duration). See the section below.
<TAGS>
Trace metrics tags, possible tags are: env, service, version, resource, http.status_code, http.status_class, and Datadog Agent tags (including the host and second primary tag). Note: Other tags set on spans are not available as tags on traces metrics.

Metric suffix

Hits

trace.<SPAN_NAME>.hits
Prerequisite: This metric exists for any APM service.
Description: Represent the count of spans created with a specific name (for example, redis.command, pylons.request, rails.request, or mysql.query).
Metric type: COUNT.
Tags: env, service, version, resource, resource_name, http.status_code, all host tags from the Datadog Host Agent, and the second primary tag.
trace.<SPAN_NAME>.hits.by_http_status
Prerequisite: This metric exists for HTTP/WEB APM services if http metadata exists.
Description: Represent the count of hits for a given span break down by HTTP status code.
Metric type: COUNT.
Tags: env, service, version, resource, resource_name, http.status_class, http.status_code, all host tags from the Datadog Host Agent, and the second primary tag.

Latency distribution

trace.<SPAN_NAME>
Prerequisite: This metric exists for any APM service.
Description: Represent the latency distribution for all services, resources, and versions across different environments and second primary tags.
Metric type: DISTRIBUTION.
Tags: env, service,version, resource, resource_name, http.status_code, synthetics, and the second primary tag.

Errors

trace.<SPAN_NAME>.errors
Prerequisite: This metric exists for any APM service.
Description: Represent the count of errors for a given span.
Metric type: COUNT.
Tags: env, service, version, resource, resource_name, http.status_code, all host tags from the Datadog Host Agent, and the second primary tag.
trace.<SPAN_NAME>.errors.by_http_status
Prerequisite: This metric exists for any APM service.
Description: Represent the count of errors for a given span.
Metric type: COUNT.
Tags: env, service, version, resource, http.status_class, http.status_code, all host tags from the Datadog Host Agent, and the second primary tag.

Apdex

trace.<SPAN_NAME>.apdex
Prerequisite: This metric exists for any HTTP or web-based APM service.
Description: Measures the Apdex score for each web service.
Metric type: GAUGE.
Tags: env, service, version, resource / resource_name, synthetics, and the second primary tag.

Duration

trace.<SPAN_NAME>.duration
Prerequisite: This metric exists for any APM service.
Description: Measure the total time for a collection of spans within a time interval, including child spans seen in the collecting service. For most use cases, Datadog recommends using the Latency Distribution for calculation of average latency or percentiles. To calculate the average latency with host tag filters, you can use this metric with the following formula:
sum:trace.<SPAN_NAME>.duration{<FILTER>}.rollup(sum).fill(zero) / sum:trace.<SPAN_NAME>.hits{<FILTER>}.rollup(sum).fill(zero)
This metric does not support percentile aggregations. Read the Latency Distribution section for more information. Metric type: GAUGE.
Tags: env, service, resource, http.status_code, all host tags from the Datadog Host Agent, and the second primary tag.

Duration by

trace.<SPAN_NAME>.duration.by_http_status
Prerequisite: This metric exists for HTTP/WEB APM services if http metadata exists.
Description: Measure the total time for a collection of spans for each HTTP status. Specifically, it is the relative share of time spent by all spans over an interval and a given HTTP status - including time spent waiting on child processes.
Metric type: GAUGE.
Tags: env, service, resource, http.status_class, http.status_code, all host tags from the Datadog Host Agent, and the second primary tag.

Sampling impact on trace metrics

In most cases, trace metrics are calculated based on all application traffic. However, with certain trace ingestion sampling configurations, the metrics represent only a subset of all requests.

Application-side sampling

Some tracing libraries support application-side sampling, which reduces the number of spans before they are sent to the Datadog Agent. For example, the Ruby tracing library offers application-side sampling to lower performance overhead. However, this can affect trace metrics, as the Datadog Agent needs all spans to calculate accurate metrics.

Very few tracing libraries support this setting, and using it is generally not recommended.

OpenTelemetry sampling

The OpenTelemetry SDK’s native sampling mechanisms lower the number of spans sent to the Datadog collector, resulting in sampled and potentially inaccurate trace metrics.

XRay sampling

XRay spans are sampled before they are sent to Datadog, which means trace metrics might not reflect all traffic.

Further Reading

PREVIEWING: piotr_wolski/update-dsm-docs