Datadog Cluster Agent

Supported OS Linux Mac OS Windows

Integration version3.2.0

Overview

This check monitors the Datadog Cluster Agent through the Datadog Agent.

Setup

Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.

Installation

The Datadog Cluster Agent check is included in the Datadog Agent package. No additional installation is needed on your server.

Configuration

The Datadog Cluster Agent check uses Autodiscovery to automatically configure itself in most scenarios. The check runs in the Datadog Agent pod on the same node as the Cluster Agent pod. It will not run in the Cluster Agent itself.

If you need to further configure the check:

  1. Edit the datadog_cluster_agent.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your datadog_cluster_agent performance data. See the sample datadog_cluster_agent.d/conf.yaml for all available configuration options.

  2. Restart the Agent.

Validation

Run the Agent’s status subcommand and look for datadog_cluster_agent under the Checks section.

Data Collected

Metrics

datadog.cluster_agent.admission_webhooks.certificate_expiry
(gauge)
Time left before the certificate expires
Shown as hour
datadog.cluster_agent.admission_webhooks.cws_exec_instrumentation_attempts.count
(count)
CWS exec Instrumentation attempts count
datadog.cluster_agent.admission_webhooks.cws_exec_instrumentation_attempts.sum
(count)
CWS exec Instrumentation attempts sum
datadog.cluster_agent.admission_webhooks.cws_pod_instrumentation_attempts.count
(count)
CWS pod Instrumentation attempts count
datadog.cluster_agent.admission_webhooks.cws_pod_instrumentation_attempts.sum
(count)
CWS pod Instrumentation attempts sum
datadog.cluster_agent.admission_webhooks.library_injection_attempts
(count)
Number of library injection attempts by language
datadog.cluster_agent.admission_webhooks.library_injection_errors
(count)
Number of library injection failures by language
datadog.cluster_agent.admission_webhooks.mutation_attempts
(gauge)
Number of pod mutation attempts by mutation type
datadog.cluster_agent.admission_webhooks.mutation_errors
(gauge)
Number of mutation failures by mutation type
datadog.cluster_agent.admission_webhooks.patcher.attempts
(count)
Number of patch attempts
datadog.cluster_agent.admission_webhooks.patcher.completed
(count)
Number of completed patch attempts
datadog.cluster_agent.admission_webhooks.patcher.errors
(count)
Number of patch errors
datadog.cluster_agent.admission_webhooks.rc_provider.configs
(gauge)
Number of valid remote configuration
datadog.cluster_agent.admission_webhooks.rc_provider.invalid_configs
(gauge)
Number of invalid remote configurations
datadog.cluster_agent.admission_webhooks.reconcile_errors
(gauge)
Number of reconcile errors per controller
datadog.cluster_agent.admission_webhooks.reconcile_success
(gauge)
Number of reconcile successes per controller
Shown as success
datadog.cluster_agent.admission_webhooks.response_duration.count
(count)
Webhook response duration count
datadog.cluster_agent.admission_webhooks.response_duration.sum
(count)
Webhook response duration sum
Shown as second
datadog.cluster_agent.admission_webhooks.webhooks_received
(gauge)
Number of mutation webhook requests received
datadog.cluster_agent.aggregator.flush
(count)
Number of metrics/service checks/events flushed by (data_type, state)
datadog.cluster_agent.aggregator.processed
(count)
Amount of metrics/serviceschecks/events processed by the aggregator by datatype
datadog.cluster_agent.api_requests
(count)
Requests made to the cluster agent API by (handler, status)
Shown as request
datadog.cluster_agent.autodiscovery.errors
(gauge)
Number of Autodiscovery errors
datadog.cluster_agent.autodiscovery.poll_duration.count
(count)
Autodiscovery poll duration count
datadog.cluster_agent.autodiscovery.poll_duration.sum
(count)
Autodiscovery poll duration sum
Shown as second
datadog.cluster_agent.autodiscovery.watched_resources
(gauge)
Number of watched resources (Services and Endpoints)
datadog.cluster_agent.cluster_checks.busyness
(gauge)
Busyness of a node per the number of metrics submitted and average duration of all checks run
datadog.cluster_agent.cluster_checks.configs_dangling
(gauge)
Number of check configurations not dispatched
datadog.cluster_agent.cluster_checks.configs_dispatched
(gauge)
Number of check configurations dispatched by node
datadog.cluster_agent.cluster_checks.configs_info
(gauge)
Information about check configurations dispatched (node and check ID)
datadog.cluster_agent.cluster_checks.failed_stats_collection
(count)
Total number of unsuccessful stats collection attempts
datadog.cluster_agent.cluster_checks.nodes_reporting
(gauge)
Number of node agents reporting
datadog.cluster_agent.cluster_checks.rebalancing_decisions
(count)
Total number of check rebalancing decisions
datadog.cluster_agent.cluster_checks.rebalancing_duration_seconds
(gauge)
Duration of the check rebalancing algorithm last execution
Shown as second
datadog.cluster_agent.cluster_checks.successful_rebalancing_moves
(count)
Total number of successful check rebalancing decisions
Shown as check
datadog.cluster_agent.cluster_checks.updating_stats_duration_seconds
(gauge)
Duration of collecting stats from check runners and updating cache
Shown as second
datadog.cluster_agent.datadog.rate_limit_queries.limit
(gauge)
Maximum number of queries to the Datadog API allowed in the period by endpoint
Shown as query
datadog.cluster_agent.datadog.rate_limit_queries.period
(gauge)
Period of rate limiting for the Datadog API by endpoint
Shown as second
datadog.cluster_agent.datadog.rate_limit_queries.remaining
(gauge)
Number of queries to the Datadog API remaining before next reset by endpoint
Shown as query
datadog.cluster_agent.datadog.rate_limit_queries.remaining_min
(gauge)
Minimum number of queries remaining before next reset observed during an expiration interval of 2*refresh period
Shown as query
datadog.cluster_agent.datadog.rate_limit_queries.reset
(gauge)
Number of seconds before next reset applied to the Datadog API by endpoint
Shown as second
datadog.cluster_agent.datadog.requests
(count)
Requests made to Datadog by status
Shown as request
datadog.cluster_agent.endpoint_checks.configs_dispatched
(gauge)
Number of endpoint-check configurations dispatched by node
datadog.cluster_agent.external_metrics
(gauge)
Number of external metrics tagged
datadog.cluster_agent.external_metrics.api_elapsed.count
(count)
Count of API Requests received
datadog.cluster_agent.external_metrics.api_elapsed.sum
(count)
Count of API Requests received
datadog.cluster_agent.external_metrics.api_requests
(gauge)
Count of API Requests received
datadog.cluster_agent.external_metrics.datadog_metrics
(gauge)
The label valid is true if the DatadogMetric CR is valid, false otherwise
datadog.cluster_agent.external_metrics.delay_seconds
(gauge)
Freshness of the metric evaluated from querying Datadog
Shown as second
datadog.cluster_agent.external_metrics.processed_value
(gauge)
Value processed from querying Datadog by metric
datadog.cluster_agent.go.goroutines
(gauge)
Number of goroutines that currently exist
datadog.cluster_agent.go.memstats.alloc_bytes
(gauge)
Number of bytes allocated and still in use
Shown as byte
datadog.cluster_agent.go.threads
(gauge)
Number of OS threads created
Shown as thread
datadog.cluster_agent.kubernetes_apiserver.emitted_events
(count)
Datadog events emitted by the kubernetes_apiserver check
datadog.cluster_agent.kubernetes_apiserver.kube_events
(count)
Kubernetes events processed by the kubernetes_apiserver check
datadog.cluster_agent.language_detection_dca_handler.processed_requests
(count)
The number of process language detection requests processed by the handler
datadog.cluster_agent.language_detection_patcher.patches
(count)
The number of patch requests sent by the patcher to the kube api server
datadog.cluster_agent.secret_backend.elapsed
(gauge)
The elapsed time of secret backend invocation
Shown as millisecond
datadog.cluster_agent.tagger.stored_entities
(gauge)
Number of entities stored in the tagger
datadog.cluster_agent.tagger.updated_entities
(count)
Number of updates made to entities in the tagger
datadog.cluster_agent.workloadmeta.events_received
(count)
Number of events received by workloadmeta
datadog.cluster_agent.workloadmeta.notifications_sent
(count)
Number of notifications sent by workloadmeta to its subscribers
datadog.cluster_agent.workloadmeta.stored_entities
(gauge)
Number of entities stored in workloadmeta
datadog.cluster_agent.workloadmeta.subscribers
(gauge)
Number of workloadmeta subscribers

Events

The Datadog-Cluster-Agent integration does not include any events.

Service Checks

datadog.cluster_agent.prometheus.health
Returns CRITICAL if the check cannot access the metrics endpoint. Returns OK otherwise.
Statuses: ok, critical

Troubleshooting

Need help? Contact Datadog support.

PREVIEWING: esther/docs-8632-slo-blog-links