Understand Datadog retention policy to efficiently retain trace data

Ingesting and retaining the traces you care about

Most traces generated by your applications are repetitive, and it’s not necessarily relevant to ingest and retain them all. For successful requests, retaining a representative sample of your applications’ traffic is enough, since you can’t possibly scan through dozens of individual traced requests every second.

What’s most important are the traces that contain symptoms of potential issues in your infrastructure, that is, traces with errors or unusual latency. In addition, for specific endpoints that are critical to your business, you might want to retain 100% of the traffic, to ensure that are you are able to investigate and troubleshoot any customer problem in great detail.

Relevant traces are retained by storing a combination of high-latency traces, error traces, and business critical traces.

How Datadog’s retention policy helps you retain what matters

Datadog provides two main ways of retaining data past 15 minutes:

Datadog captures relevant error and latency traces through the Intelligent retention filter, and business critical traces through custom retention filters.

Diversity sampling algorithm: Intelligent retention filter

By default, the Intelligent retention filter keeps a representative selection of traces without requiring you to create dozens of custom retention filters.

It keeps at least one span (and the associated distributed trace) for each combination of environment, service, operation, and resource every 15 minutes at most for the p75, p90, and p95 latency percentiles, as well as a representative selection of errors, for each distinct response status code.

To learn more, read the Intelligent retention filter documentation.

Tag-based retention filters

Tag-based retention filters provide the flexibility to keep traces that are the most critical to your business. When indexing spans with retention filters, the associated trace is also stored, which ensures that you keep visibility over the entire request and its distributed context.

Searching and analyzing indexed span data effectively

The set of data captured by diversity sampling is not uniformly sampled (that is, it is not proportionally representative of the full traffic). It is biased towards errors and high latency traces. If you want to build analytics only on top of a uniformly sampled dataset, exclude these spans that are sampled for diversity reasons by adding the -retained_by:diversity_sampling query parameter in the Trace Explorer.

For example, to measure the number of checkout operations grouped by merchant tier on your application, excluding the diversity sampling dataset ensures that you perform this analysis on top of a representative set of data, and so proportions of basic, enterprise, and premium checkouts are realistic:

Number of checkout operations by tier, analytics that exclude diversity-sampled data

On the other hand, if you want to measure the number of unique merchants by merchant tier, include the diversity sampling dataset which might capture additional merchant IDs not caught by custom retention filters:

Number of unique merchants by tier. analytics that include diversity-sampled data

Additional helpful documentation, links, and articles:

PREVIEWING: rtrieu/product-analytics-ui-changes