This page is not yet available in Spanish. We are working on its translation.
If you have any questions or feedback about our current translation project, feel free to reach out to us!

Observability Pipelines is not available on the US1-FED Datadog site.

Overview

In Observability Pipelines, your pipelines are comprised of components that collect, process, and route your observability data. The health of your pipelines and components are indicated by health statuses and graphs, as well as resource utilization and data delivery graphs.

Health statuses are determined by specific metrics based on thresholds and default time windows. The available statuses are as follows:

  • Healthy: Indicates the Worker is not falling behind.
  • Warning: Indicates the Worker is not performing optimally and is at risk of falling behind. The Worker may fall behind due to issues such as a downstream destination or service causing back pressure to build up and there not being enough resources provisioned for the Workers.
  • Critical: Indicates that the Worker is falling behind. If the Worker is falling behind, it may be at risk of dropping data; however, the Worker will not drop data unintentionally as long as your pipelines are architected and configured correctly.

Internal metrics, which are grouped by health, data delivery, and resource utilization, drives the overall health status of your pipeline and its components.

Health graphs are available for the following metrics:

  • Events unintentionally dropped
  • Errors
  • Lag time (only available for sources)
  • Lag time rate of change (only available for sources)
  • Utilization

Data delivery graphs are available for the following metrics:

  • Events in/out per second
  • Bytes in/out per second

Resource utilization graphs are available for the following metrics:

  • CPU usage
  • Memory usage
  • Disk usage (only available for destinations)

See the status of your pipelines and components

  1. Navigate to Observability Pipelines.
  2. Click on a pipeline.
  3. Hover over the graphs to see specific data points.

Pipeline resource utilization health metrics

MetricOKWarningCriticalDescription
CPU usage<= 0.85> 0.85N/ATracks how much CPU a Worker process is using.

A value of 1 indicates that a Worker process does not have any more headroom in the host or compute units running it. This can lead to possible issues such as processing latency going out of bounds, upstream/downstream overload, and so on.
Memory usage>= 0.15< 0.15N/ATracks the amount of used and free memory on the host. The Worker is not memory bound but high memory usage can indicate leaks.

Component health metrics

MetricSourcesTransformsDestinationsOKWarningCriticalDescription
Events dropped==0N/A> 0Expected to always be 0. If you configured the Worker to intentionally drop data, for example using the filter transform, that data is not counted here. Therefore, a single error indicates that the Worker is not in a healthy state.
Total errors==0>0N/AThe total number of errors encountered by the component. These errors are also emitted as Diagnostic Logs, which provides more information about specific internal error logs.
Utilization<=0.95>0.95N/ATracks the component’s activity.

A value of 0 indicates an idle component that is waiting for input. A value of 1 indicates a component that is never idle. A value greater than 0.95 indicates that the component is busy and likely a bottleneck in the processing topology.
Lag timeN/AN/AN/AThis is the raw time difference (in milliseconds) between the timestamp on the event and the timestamp of when the event was ingested by the Worker. High lag time or a change in the lag time (see below) is an indicator of whether the Worker is falling behind due to back pressure from a downstream service, lack of resources provisioned to the Worker, or a bottleneck in the pipeline.
Lag time rate of change<=0>0>1Indicates whether there is a substantial delay between when the event is generated and when the Worker receives the data. If there is a delay, then the Worker is falling behind in receiving data from the source.

A value of 0 indicates there is no additional lag from when the observability data is generated and when the Worker receives the data. A value equal to or greater than 1 indicates that there is backpressure and a bottleneck.
Disk usage>=0.20> 0.20N/AMeasures how full a given disk is.

A value of 1 indicates that no data can be stored in the disk. A value of 0 indicates that the disk is empty.
PREVIEWING: rtrieu/product-analytics-ui-changes