Trace Sampling and Storage

This page is not yet available in Spanish. We are working on its translation.
If you have any questions or feedback about our current translation project, feel free to reach out to us!
This page describes deprecated features with configuration information relevant to legacy App Analytics, useful for troubleshooting or modifying some old setups. To have full control over your traces, use ingestion controls and retention filters instead.

Trace sampling

Trace Sampling is applicable for high-volume web-scale applications, where a sampled proportion of traces is kept in Datadog based on the following rules.

Statistics (requests, errors, latency, etc.), are calculated based on the full volume of traces at the Agent level, and are therefore always accurate.

Statistics

Datadog APM computes following aggregate statistics over all the traces instrumented, regardless of sampling:

  • Total requests and requests per second
  • Total errors and errors per second
  • Latency
  • Breakdown of time spent by service/type
  • Apdex score (web services only)
Aggregate statistics are generated on un-sampled data.

Goal of sampling

The goal of sampling is to keep the traces that matter the most:

  • Distributed traces
  • Low QPS Services
  • Representative variety set of traces
Individual traces are sampled at the Client, Agent, and Server level.

Sampling rules

For the lifecycle of a trace, decisions are made at Tracing Client, Agent, and Backend level in the following order.

  1. Tracing Client - The tracing client adds a context attribute sampling.priority to traces, allowing a single trace to be propagated in a distributed architecture across language agnostic request headers. Sampling-priority attribute is a hint to the Datadog Agent to do its best to prioritize the trace or drop unimportant ones.

    ValueTypeAction
    MANUAL_DROPUser inputThe Agent drops the trace.
    AUTO_DROPAutomatic sampling decisionThe Agent drops the trace.
    AUTO_KEEPAutomatic sampling decisionThe Agent keeps the trace.
    MANUAL_KEEPUser inputThe Agent keeps the trace, and the backend will only apply sampling if above maximum volume allowed. Note that when used with App Analytics filtering - all spans marked for MANUAL_KEEP are counted as billable spans.

    Traces are automatically assigned a priority of AUTO_DROP or AUTO_KEEP, with a proportion ensuring that the Agent won’t have to sample more than it is allowed. Users can manually adjust this attribute to give priority to specific types of traces, or entirely drop uninteresting ones.

  2. Trace Agent (Host or Container Level)- The Agent receives traces from various tracing clients and filters requests based on two rules -

    • Ensure traces are kept across variety of traces. (across services, resources, HTTP status codes, errors)
    • Ensure traces are kept for low volume resources (web endpoints, DB queries).

    The Agent computes a signature for every trace reported, based on its services, resources, errors, etc.. Traces of the same signature are considered similar. For example, a signature could be:

    • env=prod, my_web_service, is_error=true, resource=/login
    • env=staging, my_database_service, is_error=false, query=SELECT...

    A proportion of traces with each signature is then kept, so you get full visibility into all the different kinds of traces happening in your system. This method ensures traces for resources with low volumes are still kept.

    Moreover, the Agent provides a service-based rate to the prioritized traces from tracing client to ensure traces from low QPS services are prioritized to be kept.

    Users can manually drop entire uninteresting resource endpoints at Agent level by using resource filtering.

  3. DD Backend/Server - The server receives traces from various Agents running on hosts and applies sampling to ensure representation from every reporting Agent. It does so by keeping traces on the basis of the signature marked by Agent.

Manually control trace priority

APM enables distributed tracing by default to allow trace propagation between tracing headers across multiple services/hosts. Tracing headers include a priority tag to ensure complete traces between upstream and downstream services during trace propagation. You can override this tag to manually keep a trace (critical transaction, debug mode, etc.) or drop a trace (health checks, static assets, etc).

Manually keep a trace:

import datadog.trace.api.DDTags;
import datadog.trace.api.interceptor.MutableSpan;
import datadog.trace.api.Trace;
import io.opentracing.util.GlobalTracer;

public class MyClass {
    @Trace
    public static void myMethod() {
        // grab the active span out of the traced method
        MutableSpan ddspan = (MutableSpan) GlobalTracer.get().activeSpan();
        // Always keep the trace
        ddspan.setTag(DDTags.MANUAL_KEEP, true);
        // method impl follows
    }
}

Manually drop a trace:

import datadog.trace.api.DDTags;
import datadog.trace.api.interceptor.MutableSpan;
import datadog.trace.api.Trace;
import io.opentracing.util.GlobalTracer;

public class MyClass {
    @Trace
    public static void myMethod() {
        // grab the active span out of the traced method
        MutableSpan ddspan = (MutableSpan) GlobalTracer.get().activeSpan();
        // Always Drop the trace
        ddspan.setTag(DDTags.MANUAL_DROP, true);
        // method impl follows
    }
}

Manually keep a trace:

from ddtrace import tracer
from ddtrace.constants import MANUAL_DROP_KEY, MANUAL_KEEP_KEY

@tracer.wrap()
def handler():
    span = tracer.current_span()
    // Always Keep the Trace
    span.set_tag(MANUAL_KEEP_KEY)
    // method impl follows

Manually drop a trace:

from ddtrace import tracer
from ddtrace.constants import MANUAL_DROP_KEY, MANUAL_KEEP_KEY

@tracer.wrap()
def handler():
    span = tracer.current_span()
    //Always Drop the Trace
    span.set_tag(MANUAL_DROP_KEY)
    //method impl follows

Manually keep a trace:

Datadog::Tracing.trace(name, options) do |span|
  Datadog::Tracing.keep! # Affects the active span

  # Method implementation follows
end

Manually drop a trace:

Datadog::Tracing.trace(name, options) do |span|
  Datadog::Tracing.reject! # Affects the active span

  # Method implementation follows
end

Manually keep a trace:

package main

import (
    "log"
    "net/http"
    "gopkg.in/DataDog/dd-trace-go.v1/ddtrace/ext"
    "gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer"
)

func handler(w http.ResponseWriter, r *http.Request) {
    // Create a span for a web request at the /posts URL.
    span := tracer.StartSpan("web.request", tracer.ResourceName("/posts"))
    defer span.Finish()

    // Always keep this trace:
    span.SetTag(ext.ManualKeep, true)
    //method impl follows

}

Manually drop a trace:

package main

import (
    "log"
    "net/http"

    "gopkg.in/DataDog/dd-trace-go.v1/ddtrace/ext"
    "gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer"
)

func handler(w http.ResponseWriter, r *http.Request) {
    // Create a span for a web request at the /posts URL.
    span := tracer.StartSpan("web.request", tracer.ResourceName("/posts"))
    defer span.Finish()

    // Always drop this trace:
    span.SetTag(ext.ManualDrop, true)
    //method impl follows
}

Manually keep a trace:

const tracer = require('dd-trace')
const tags = require('dd-trace/ext/tags')

const span = tracer.startSpan('web.request')

// Always keep the trace
span.setTag(tags.MANUAL_KEEP)
//method impl follows

Manually drop a trace:

const tracer = require('dd-trace')
const tags = require('dd-trace/ext/tags')

const span = tracer.startSpan('web.request')

// Always drop the trace
span.setTag(tags.MANUAL_DROP)
//method impl follows

Manually keep a trace:

using Datadog.Trace;

using(var scope = Tracer.Instance.StartActive(operationName))
{
    var span = scope.Span;

    // Always keep this trace
    span.SetTag(Tags.ManualKeep, "true");
    //method impl follows
}

Manually drop a trace:

using Datadog.Trace;

using(var scope = Tracer.Instance.StartActive(operationName))
{
    var span = scope.Span;

    // Always drop this trace
    span.SetTag(Tags.ManualDrop, "true");
    //method impl follows
}

Manually keep a trace:

<?php
  $tracer = \OpenTracing\GlobalTracer::get();
  $span = $tracer->getActiveSpan();

  if (null !== $span) {
    // Always keep this trace
    $span->setTag(\DDTrace\Tag::MANUAL_KEEP, true);
    //method impl follows
  }
?>

Manually drop a trace:

<?php
  $tracer = \OpenTracing\GlobalTracer::get();
  $span = $tracer->getActiveSpan();

  if (null !== $span) {
    // Always drop this trace
    $span->setTag(\DDTrace\Tag::MANUAL_DROP, true);
    //method impl follows
  }
?>

Manually keep a trace:

...
#include <datadog/tags.h>
...

auto tracer = ...
auto span = tracer->StartSpan("operation_name");
// Always keep this trace
span->SetTag(datadog::tags::manual_keep, {});
//method impl follows

Manually drop a trace:

...
#include <datadog/tags.h>
...

auto tracer = ...
auto another_span = tracer->StartSpan("operation_name");
// Always drop this trace

another_span->SetTag(datadog::tags::manual_drop, {});
//method impl follows

Note that trace priority should be manually controlled only before any context propagation. If this happens after the propagation of a context, the system can’t ensure that the entire trace is kept across services. Manually controlled trace priority is set at tracing client location, the trace can still be dropped by Agent or server location based on the sampling rules.

Trace storage

Individual traces are stored for 30 days. This means that all sampled traces are retained for a period of 30 days and at the end of the 30th day, the entire set of expired traces is deleted. In addition, once a trace has been viewed by opening a full page, it continues to be available by using its trace ID in the URL: /apm/trace/<TRACE_ID>. This is true even if it “expires” from the UI. This behavior is independent of the UI retention time buckets.

Trace ID

Further Reading

PREVIEWING: rtrieu/product-analytics-ui-changes