APM Troubleshooting

If you experience unexpected behavior while using Datadog APM, read the information on this page to help resolve the issue. Datadog recommends regularly updating to the latest version of the Datadog tracing libraries you use, as each release contains improvements and fixes. If you continue to experience issues, reach out to Datadog support.

The following components are involved in sending APM data to Datadog:

APM Troubleshooting Pipeline

For more information, see Additional support.

Trace retention

This section addresses issues related to trace data retention and filtering across Datadog.

If you haven’t set up custom retention filters, this is expected behavior. Here’s why:

The Trace Explorer page allows you to search all ingested or indexed spans using any tag. Here, you can query any of your traces.

By default, after spans have been ingested, they are retained by the Datadog intelligent filter. Datadog also has other retention filters that are enabled by default to give you visibility over your services, endpoints, errors, and high-latency traces.

However, to use these traces in your monitors, you must set custom retention filters.

Custom retention filters allow you to decide which spans are indexed and retained by creating, modifying, and disabling additional filters based on tags. You can also set a percentage of spans matching each filter to be retained. These indexed traces can then be used in your monitors.

PRODUCTSPAN SOURCE
MonitorsSpans from custom retention filters
Other products
(Dashboard, Notebook etc.)
Spans from custom retention filters + Datadog intelligent filter

Trace metrics

This section covers troubleshooting discrepancies and inconsistencies with trace metrics.

Trace metrics and custom span-based metrics can have different values because they are calculated based on different datasets:

  • Trace metrics are calculated based on 100% of the application’s traffic, regardless of your trace ingestion sampling configuration. The trace metrics namespace follows this format: trace.<SPAN_NAME>.<METRIC_SUFFIX>.
  • Custom span-based metrics are generated based on your ingested spans, which depend on your trace ingestion sampling. For example, if you are ingesting 50% of your traces, your custom span-based metrics are based on the 50% ingested spans.

To ensure that your trace metrics and custom span-based metrics have the same value, configure a 100% ingestion rate for your application or service.

Metric names must follow the metric naming convention. Metric names that start with trace.* are not permitted and are not saved.

Services

This section covers strategies to troubleshoot service-related issues.

This can happen when the service name is not consistent across all spans.

For example, you might have a single service such as service:test showing multiple services in the Datadog:

  • service:test
  • service:test-mongodb
  • service:test-postgresdb

You can use Inferred Service dependencies (beta). Inferred external APIs use the default naming scheme net.peer.name. For example: api.stripe.com, api.twilio.com, and us6.api.mailchimp.com. Inferred databases use the default naming scheme db.instance.

Or, you can merge the service names using an environment variable such as DD_SERVICE_MAPPING or DD_TRACE_SERVICE_MAPPING, depending on the language.

For more information, see Configure the Datadog Tracing Library or choose your language here:

dd.service.mapping
Environment Variable: DD_SERVICE_MAPPING
Default: null
Example: mysql:my-mysql-service-name-db, postgresql:my-postgres-service-name-db
Dynamically rename services with configuration. Useful for making databases have distinct names across different services.
DD_SERVICE_MAPPING
Define service name mappings to allow renaming services in traces, for example: postgres:postgresql,defaultdb:postgresql. Available in version 0.47+.
DD_SERVICE_MAPPING
Default: null
Dynamically rename services through configuration. Services can be separated by commas or spaces, for example: mysql:mysql-service-name,postgres:postgres-service-name, mysql:mysql-service-name postgres:postgres-service-name.
DD_SERVICE_MAPPING
Configuration: serviceMapping
Default: N/A
Example: mysql:my-mysql-service-name-db,pg:my-pg-service-name-db
Provide service names for each plugin. Accepts comma separated plugin:service-name pairs, with or without spaces.
DD_TRACE_SERVICE_MAPPING
Rename services using configuration. Accepts a comma-separated list of key-value pairs of service name keys to rename, and the name to use instead, in the format [from-key]:[to-name].
Example: mysql:main-mysql-db, mongodb:offsite-mongodb-service
The from-key value is specific to the integration type, and should exclude the application name prefix. For example, to rename my-application-sql-server to main-db, use sql-server:main-db. Added in version 1.23.0
DD_SERVICE_MAPPING
INI: datadog.service_mapping
Default: null
Change the default name of an APM integration. Rename one or more integrations at a time, for example: DD_SERVICE_MAPPING=pdo:payments-db,mysqli:orders-db (see Integration names).

Ruby does not support DD_SERVICE_MAPPING or DD_TRACE_SERVICE_MAPPING. See Additional Ruby configuration for code options to change the service name.

Spikes in data ingestion and indexing can be caused by various factors. To investigate the cause of an increase, use the APM Traces Estimated Usage metrics:

USAGE TYPEMETRICDESCRIPTION
APM Indexed Spansdatadog.estimated_usage.apm.indexed_spansTotal number of spans indexed by tag-based retention filters.
APM Ingested Spansdatadog.estimated_usage.apm.ingested_spansTotal number of ingested spans.

The APM Traces Usage dashboard contains several widget groups displaying high-level KPIs and additional usage information.

In some traces with an error status, the Errors tab shows Missing error message and stack trace rather than exception details.

A span can show this message for two possible reasons:

  • The span contains an unhandled exception.
  • An HTTP response within the span returned an HTTP status code between 400 and 599.

When an exception is handled in a try/catch block, error.message, error.type, and error.stack span tags are not populated. To populate the detailed error span tags, use Custom Instrumentation code.

Data volume guidelines

If you encounter any of the following issues, you may be exceeding Datadog’s volume guidelines:

  • Your trace metrics are not reporting as you would expect in the Datadog platform.
  • You are missing some of your resources that you expected to see in the Datadog platform.
  • You are seeing traces from your service but are not able to find this service on the Service Catalog page.

Your instrumented application can submit spans with timestamps up to 18 hours in the past and two hours in the future from the current time.

Datadog accepts the following combinations for a given 40-minute interval:

  • 5000 unique environments and service combinations
  • 30 unique second primary tag values per environment
  • 100 unique operation names per environment and service
  • 1000 unique resources per environment, service, and operation name
  • 30 unique versions per environment and service

If you need to accommodate larger volumes, contact Datadog support with your use case.

Datadog truncates the following strings if they exceed the indicated number of characters:

NameCharacters
service100
operation100
type100
resource5000
tag key200
tag value25000

Additionally, the number of span tags present on any span cannot exceed 1024.

If the number of services exceeds what is specified in the data volume guidelines, try following these best practices for service naming conventions.

Exclude environment tag values from service names

By default, the environment (env) is the primary tag for Datadog APM.

Environment is the default primary tag

A service is typically deployed in multiple environments, such as prod, staging, and dev. Performance metrics like request counts, latency, and error rate differ across various environments. The environment dropdown in the Service Catalog allows you to scope the data in the Performance tab to a specific environment.

Choose a specific environment using the `env` dropdown in the Service Catalog

One pattern that often leads to issues with an overwhelming number of services is including the environment value in service names. For example, you might have two unique services instead of one since they are operating in two separate environments: prod-web-store and dev-web-store.

Datadog recommends tuning your instrumentation by renaming your services.

Trace metrics are unsampled, which means your instrumented application shows all data instead of subsections of them. The volume guidelines are also applied.

Use the second primary tag instead of putting metric partitions or grouping variables into service names

Second primary tags are additional tags that you can use to group and aggregate your trace metrics. You can use the dropdown to scope the performance data to a given cluster name or data center value.

Use the dropdown menu to select a specific cluster or data center value

Including metric partitions or grouping variables in service names instead of applying the second primary tag unnecessarily inflates the number of unique services in an account and results in potential delay or data loss.

For example, instead of the service web-store, you might decide to name different instances of a service web-store-us-1, web-store-eu-1, and web-store-eu-2 to see performance metrics for these partitions side-by-side. Datadog recommends implementing the region value (us-1, eu-1, eu-2) as a second primary tag.

Connection errors

This section provides guidance on diagnosing and resolving connection and communication issues between your applications and the Datadog Agent

Read about how to find and fix these problems in Connection Errors.

Resource usage

This section contains information on troubleshooting performance issues related to resource utilization.

Read about detecting trace collection CPU usage and about calculating adequate resource limits for the Agent in Agent Resource Usage.

Within Datadog Agent logs, if you see error messages about rate limits or max events per second, you can change these limits by following these instructions. If you have questions, before you change the limits, consult with the Datadog support team.

Security

This section covers approaches for addressing security concerns in APM, including protecting sensitive data and managing traffic.

There are several configuration options available to scrub sensitive data or discard traces corresponding to health checks or other unwanted traffic that can be configured within the Datadog Agent, or in some languages the tracing client. For details on the options available, see Security and Agent Customization. While this offers representative examples, if you require assistance applying these options to your environment, reach out to Datadog Support.

Debugging and logging

This section explains how to use debug and startup logs to identify and resolve issues with your Datadog tracer.

To capture full details on the Datadog tracer, enable debug mode on your tracer by using the DD_TRACE_DEBUG environment variable. You might enable it for your own investigation or if Datadog support has recommended it for triage purposes. However, be sure to disable debug logging when you are finished testing to avoid the logging overhead it introduces.

These logs can surface instrumentation errors or integration-specific errors. For details on enabling and capturing these debug logs, see the debug mode troubleshooting page.

During startup, Datadog tracing libraries emit logs that reflect the configurations applied in a JSON object, as well as any errors encountered, including if the Agent can be reached in languages where this is possible. Some languages require these startup logs to be enabled with the environment variable DD_TRACE_STARTUP_LOGS=true. For more information, see the Startup logs.

Additional support

If you still need additional support, open a ticket with Datadog Support.

When you open a support ticket, the Datadog support team may ask for the following types of information:

  1. Links to a trace or screenshots of the issue: This helps reproduce your issues for troubleshooting purposes.

  2. Tracer startup logs: Startup logs help identify tracer misconfiguration or communication issues between the tracer and the Datadog Agent. By comparing the tracer’s configuration with the application or container settings, support teams can pinpoint improperly applied settings.

  3. Tracer debug logs: Tracer debug logs provide deeper insights than startup logs, revealing:

    • Proper integration instrumentation during application traffic flow
    • Contents of spans created by the tracer
    • Connection errors when sending spans to the Agent
  4. Datadog Agent flare: Datadog Agent flares enable you to see what is happening within the Datadog Agent, for example, if traces are being rejected or malformed. This does not help if traces are not reaching the Datadog Agent, but does help identify the source of an issue, or any metric discrepancies.

  5. A description of your environment: Understanding your application’s deployment configuration helps the Support team identify potential tracer-Agent communication issues and identify misconfigurations. For complex problems, support may request Kubernetes manifests, ECS task definitions, or similar deployment configuration files.

  6. Custom tracing code: Custom instrumentation, configuration, and adding span tags can significantly impact trace visualizations in Datadog.

  7. Version information: Knowing what language, framework, Datadog Agent, and Datadog tracer versions you are using allows Support to verify Compatiblity Requirements, check for known issues, or recommend a version upgrades. For example:

Further reading

PREVIEWING: mcretzman/DOCS-9337-add-cloud-info-byoti