Kubernetes State Core

Supported OS Linux Mac OS Windows

Overview

Get metrics from Kubernetes service in real-time to:

  • Visualize and monitor Kubernetes states.
  • Be notified about Kubernetes failovers and events.

The Kubernetes State Metrics Core check leverages kube-state-metrics version 2+ and includes major performance and tagging improvements compared to the legacy kubernetes_state check.

As opposed to the legacy check, with the Kubernetes State Metrics Core check, you no longer need to deploy kube-state-metrics in your cluster.

Kubernetes State Metrics Core provides a better alternative to the legacy kubernetes_state check as it offers more granular metrics and tags. See the Major Changes and Data Collected for more details.

Setup

Installation

The Kubernetes State Metrics Core check is included in the Datadog Cluster Agent image, so you don’t need to install anything else on your Kubernetes servers.

Requirements

  • Datadog Cluster Agent v1.12+

Configuration

In your Helm values.yaml, add the following:

datadog:
  # (...)
  kubeStateMetricsCore:
    enabled: true

To enable the kubernetes_state_core check, the setting spec.features.kubeStateMetricsCore.enabled must be set to true in the DatadogAgent resource:

kind: DatadogAgent
apiVersion: datadoghq.com/v2alpha1
metadata:
  name: datadog
spec:
  global:
    credentials:
      apiKey: <DATADOG_API_KEY>
  features:
    kubeStateMetricsCore:
      enabled: true

Note: Datadog Operator v0.7.0 or greater is required.

Migration from kubernetes_state to kubernetes_state_core

Tags removal

In the original kubernetes_state check, several tags have been flagged as deprecated and replaced by new tags. To determine your migration path, check which tags are submitted with your metrics.

In the kubernetes_state_core check, only the non-deprecated tags are submitted. Before migrating from kubernetes_state to kubernetes_state_core, verify that only official tags are used in monitors and dashboards.

Here is the mapping between deprecated tags and the official tags that have replaced them:

deprecated tagofficial tag
cluster_namekube_cluster_name
containerkube_container_name
cronjobkube_cronjob
daemonsetkube_daemon_set
deploymentkube_deployment
hpahorizontalpodautoscaler
imageimage_name
jobkube_job
job_namekube_job
namespacekube_namespace
phasepod_phase
podpod_name
replicasetkube_replica_set
replicationcontrollerkube_replication_controller
statefulsetkube_stateful_set

Backward incompatibility changes

The Kubernetes State Metrics Core check is not backward compatible, be sure to read the changes carefully before migrating from the legacy kubernetes_state check.

kubernetes_state.node.by_condition
A new metric with node name granularity. The legacy metric kubernetes_state.nodes.by_condition is deprecated in favor of this one. Note: This metric is backported into the legacy check, where both metrics (it and the legacy metric it replaces) are available.
kubernetes_state.persistentvolume.by_phase
A new metric with persistentvolume name granularity. It replaces kubernetes_state.persistentvolumes.by_phase.
kubernetes_state.pod.status_phase
The metric is tagged with pod level tags, like pod_name.
kubernetes_state.node.count
The metric is not tagged with host anymore. It aggregates the nodes count by kernel_version os_image container_runtime_version kubelet_version.
kubernetes_state.container.waiting and kubernetes_state.container.status_report.count.waiting
These metrics no longer emit a 0 value if no pods are waiting. They only report non-zero values.
kube_job
In kubernetes_state, the kube_job tag value is the CronJob name if the Job had CronJob as an owner, otherwise it is the Job name. In kubernetes_state_core, the kube_job tag value is always the Job name, and a new kube_cronjob tag key is added with the CronJob name as the tag value. When migrating to kubernetes_state_core, it’s recommended to use the new tag or kube_job:foo*, where foo is the CronJob name, for query filters.
kubernetes_state.job.succeeded
In kubernetes_state, the kuberenetes.job.succeeded was count type. In kubernetes_state_core it is gauge type.

Node-level tag assignment

Host or node-level tags no longer appear on cluster-centric metrics. Only metrics relative to an actual node in the cluster, like kubernetes_state.node.by_condition or kubernetes_state.container.restarts, continue to inherit their respective host or node level tags.

To add tags globally, use the DD_TAGS environment variable, or use the respective Helm or Operator configurations. Instance-only level tags can be specified by mounting a custom kubernetes_state_core.yaml into the Cluster Agent.

datadog:
  kubeStateMetricsCore:
    enabled: true
  tags: 
    - "<TAG_KEY>:<TAG_VALUE>"
kind: DatadogAgent
apiVersion: datadoghq.com/v2alpha1
metadata:
  name: datadog
spec:
  global:
    credentials:
      apiKey: <DATADOG_API_KEY>
    tags:
      - "<TAG_KEY>:<TAG_VALUE>"
  features:
    kubeStateMetricsCore:
      enabled: true

Metrics like kubernetes_state.container.memory_limit.total or kubernetes_state.node.count are aggregate counts of groups within a cluster, and host or node-level tags are not added.

Legacy check

Enabling kubeStateMetricsCore in your Helm values.yaml configures the Agent to ignore the auto configuration file for legacy kubernetes_state check. The goal is to avoid running both checks simultaneously.

If you still want to enable both checks simultaneously for the migration phase, disable the ignoreLegacyKSMCheck field in your values.yaml.

Note: ignoreLegacyKSMCheck makes the Agent only ignore the auto configuration for the legacy kubernetes_state check. Custom kubernetes_state configurations need to be removed manually.

The Kubernetes State Metrics Core check does not require deploying kube-state-metrics in your cluster anymore, you can disable deploying kube-state-metrics as part of the Datadog Helm Chart. To do this, add the following in your Helm values.yaml:

datadog:
  # (...)
  kubeStateMetricsEnabled: false

Important Note: The Kubernetes State Metrics Core check is an alternative to the legacy kubernetes_state check. Datadog recommends not enabling both checks simultaneously to guarantee consistent metrics.

Data Collected

Metrics

kubernetes_state.apiservice.condition
(gauge)
The current condition of this apiservice. Tags:kube_namespace apiservice condition status.
kubernetes_state.apiservice.count
(gauge)
The current count of apiservices.
kubernetes_state.configmap.count
(gauge)
Number of ConfigMaps. Requires ConfigMaps to be added to Cluster Agent collector. Tags: kube_namespace.
kubernetes_state.container.cpu_limit
(gauge)
The value of CPU limit by a container. Tags:kube_namespace pod_name kube_container_name node resource unit (env service version from standard labels).
Shown as cpu
kubernetes_state.container.cpu_limit.total
(gauge)
The total value of CPU limits by all containers in the cluster. Tags:kube_namespace kube_container_name kube_<owner kind>.
Shown as cpu
kubernetes_state.container.cpu_requested
(gauge)
The value of CPU requested by a container. Tags:kube_namespace pod_name kube_container_name node resource unit (env service version from standard labels).
Shown as cpu
kubernetes_state.container.cpu_requested.total
(gauge)
The total value of CPU requested by all containers in the cluster. Tags:kube_namespace kube_container_name kube_<owner kind>.
Shown as cpu
kubernetes_state.container.gpu_limit
(gauge)
The value of GPU limit by a container. Tags:kube_namespace pod_name kube_container_name node resource mig_profile unit (env service version from standard labels).
kubernetes_state.container.gpu_limit.total
(gauge)
The total value of GPU limits by all containers in the cluster. Tags:kube_namespace kube_container_name kube_<owner kind>.
kubernetes_state.container.gpu_requested
(gauge)
The value of GPU requested by a container. Tags:kube_namespace pod_name kube_container_name node resource mig_profile unit (env service version from standard labels).
kubernetes_state.container.gpu_requested.total
(gauge)
The total value of GPU requested by all containers in the cluster. Tags:kube_namespace kube_container_name kube_<owner kind>.
kubernetes_state.container.memory_limit
(gauge)
The value of memory limit by a container. Tags:kube_namespace pod_name kube_container_name node resource unit (env service version from standard labels).
Shown as byte
kubernetes_state.container.memory_limit.total
(gauge)
The total value of memory limits by all containers in the cluster. Tags:kube_namespace kube_container_name kube_<owner kind>.
Shown as byte
kubernetes_state.container.memory_requested
(gauge)
The value of memory requested by a container. Tags:kube_namespace pod_name kube_container_name node resource unit (env service version from standard labels).
Shown as byte
kubernetes_state.container.memory_requested.total
(gauge)
The total value of memory requested by all containers in the cluster. Tags:kube_namespace kube_container_name kube_<owner kind>.
Shown as byte
kubernetes_state.container.network_bandwidth_limit
(gauge)
The value of network bandwidth limit for a container. Tags:kube_namespace pod_name kube_container_name node resource unit (env service version from standard labels).
kubernetes_state.container.network_bandwidth_requested
(gauge)
The value of network bandwidth requested by a container. Tags:kube_namespace pod_name kube_container_name node resource unit (env service version from standard labels).
kubernetes_state.container.ready
(gauge)
Describes whether the containers readiness check succeeded. Tags:kube_namespace pod_name kube_container_name (env service version from standard labels).
kubernetes_state.container.restarts
(gauge)
The number of container restarts per container. Tags:kube_namespace pod_name kube_container_name (env service version from standard labels).
kubernetes_state.container.running
(gauge)
Describes whether the container is currently in running state. Tags:kube_namespace pod_name kube_container_name (env service version from standard labels).
kubernetes_state.container.status_report.count.terminated
(gauge)
Describes the reason the container is currently in terminated state. Tags:kube_namespace pod_name kube_container_name reason (env service version from standard labels).
kubernetes_state.container.status_report.count.waiting
(gauge)
Describes the reason the container is currently in waiting state. Tags:kube_namespace pod_name kube_container_name reason (env service version from standard labels).
kubernetes_state.container.terminated
(gauge)
Describes whether the container is currently in terminated state. Tags:kube_namespace pod_name kube_container_name (env service version from standard labels).
kubernetes_state.container.waiting
(gauge)
Describes whether the container is currently in waiting state. Tags:kube_namespace pod_name kube_container_name (env service version from standard labels).
kubernetes_state.crd.condition
(gauge)
The current condition of this custom resource definition. Tags: customresourcedefinition condition status.
kubernetes_state.crd.count
(gauge)
Number of custom resource definitions.
kubernetes_state.cronjob.count
(gauge)
Number of cronjobs. Tags:kube_namespace.
kubernetes_state.cronjob.duration_since_last_schedule
(gauge)
The duration since the last time the cronjob was scheduled. Tags:kube_cronjob kube_namespace (env service version from standard labels).
kubernetes_state.cronjob.spec_suspend
(gauge)
Suspend flag tells the controller to suspend subsequent executions. Tags:kube_namespace kube_cronjob (env service version from standard labels).
kubernetes_state.daemonset.count
(gauge)
Number of DaemonSets. Tags:kube_namespace.
kubernetes_state.daemonset.daemons_available
(gauge)
The number of nodes that should be running the daemon pod and have one or more of the daemon pod running and available. Tags:kube_daemon_set kube_namespace (env service version from standard labels).
kubernetes_state.daemonset.daemons_unavailable
(gauge)
The number of nodes that should be running the daemon pod and have none of the daemon pod running and available. Tags:kube_daemon_set kube_namespace (env service version from standard labels).
kubernetes_state.daemonset.desired
(gauge)
The number of nodes that should be running the daemon pod. Tags:kube_daemon_set kube_namespace (env service version from standard labels).
kubernetes_state.daemonset.misscheduled
(gauge)
The number of nodes running a daemon pod but are not supposed to. Tags:kube_daemon_set kube_namespace (env service version from standard labels).
kubernetes_state.daemonset.ready
(gauge)
The number of nodes that should be running the daemon pod and have one or more of the daemon pod running and ready. Tags:kube_daemon_set kube_namespace (env service version from standard labels).
kubernetes_state.daemonset.scheduled
(gauge)
The number of nodes running at least one daemon pod and are supposed to. Tags:kube_daemon_set kube_namespace (env service version from standard labels).
kubernetes_state.daemonset.updated
(gauge)
The total number of nodes that are running updated daemon pod. Tags:kube_daemon_set kube_namespace (env service version from standard labels).
kubernetes_state.deployment.condition
(gauge)
The current status conditions of a deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.count
(gauge)
Number of deployments. Tags:kube_namespace.
kubernetes_state.deployment.paused
(gauge)
Whether the deployment is paused and will not be processed by the deployment controller. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.replicas
(gauge)
The number of replicas per deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.replicas_available
(gauge)
The number of available replicas per deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.replicas_desired
(gauge)
Number of desired pods for a deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.replicas_ready
(gauge)
The number of ready replicas per deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.replicas_unavailable
(gauge)
The number of unavailable replicas per deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.replicas_updated
(gauge)
The number of updated replicas per deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.rollingupdate.max_surge
(gauge)
Maximum number of replicas that can be scheduled above the desired number of replicas during a rolling update of a deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.deployment.rollingupdate.max_unavailable
(gauge)
Maximum number of unavailable replicas during a rolling update of a deployment. Tags:kube_deployment kube_namespace (env service version from standard labels).
kubernetes_state.endpoint.address_available
(gauge)
Number of addresses available in endpoint. Tags:endpoint kube_namespace.
kubernetes_state.endpoint.address_not_ready
(gauge)
Number of addresses not ready in endpoint. Tags:endpoint kube_namespace.
kubernetes_state.endpoint.count
(gauge)
Number of endpoints. Tags:kube_namespace.
kubernetes_state.hpa.condition
(gauge)
The condition of this autoscaler. Tags:kube_namespace horizontalpodautoscaler condition status.
kubernetes_state.hpa.count
(gauge)
Number of horizontal pod autoscaler. Tags: kube_namespace.
kubernetes_state.hpa.current_replicas
(gauge)
Current number of replicas of pods managed by this autoscaler. Tags:kube_namespace horizontalpodautoscaler.
kubernetes_state.hpa.desired_replicas
(gauge)
Desired number of replicas of pods managed by this autoscaler. Tags:kube_namespace horizontalpodautoscaler.
kubernetes_state.hpa.max_replicas
(gauge)
Upper limit for the number of pods that can be set by the autoscaler; cannot be smaller than MinReplicas. Tags:kube_namespace horizontalpodautoscaler.
kubernetes_state.hpa.min_replicas
(gauge)
Lower limit for the number of pods that can be set by the autoscaler default 1. Tags:kube_namespace horizontalpodautoscaler.
kubernetes_state.hpa.spec_target_metric
(gauge)
The metric specifications used by this autoscaler when calculating the desired replica count. Tags:kube_namespace horizontalpodautoscaler metric_name metric_target_type.
kubernetes_state.hpa.status_target_metric
(gauge)
The current metric status used by this autoscaler when calculating the desired replica count. Tags:kube_namespace horizontalpodautoscaler metric_name metric_target_type.
kubernetes_state.ingress.count
(gauge)
Number of ingresses. Tags:kube_namespace.
kubernetes_state.ingress.path
(gauge)
Information about the ingress path. Tags:kube_namespace kube_ingress_path kube_ingress kube_service kube_service_port kube_ingress_host .
kubernetes_state.initcontainer.restarts
(gauge)
Describes whether the number of restarts for the init container. Tags:kube_namespace pod_name kube_container_name (env service version from standard labels).
kubernetes_state.initcontainer.waiting
(gauge)
Describes whether the init container is currently in waiting state. Tags:kube_namespace pod_name kube_container_name (env service version from standard labels).
kubernetes_state.job.completion.failed
(gauge)
The job has failed its execution. Tags:kube_job or kube_cronjob kube_namespace (env service version from standard labels).
kubernetes_state.job.completion.succeeded
(gauge)
The job has completed its execution. Tags:kube_job or kube_cronjob kube_namespace (env service version from standard labels).
kubernetes_state.job.count
(gauge)
Number of jobs. Tags:kube_namespace kube_cronjob.
kubernetes_state.job.duration
(gauge)
Time elapsed between the start and completion time of the job or the current time if the job is still running. Tags:kube_job kube_namespace (env service version from standard labels).
kubernetes_state.job.failed
(gauge)
The number of pods which reached Phase Failed. Tags:kube_job or kube_cronjob kube_namespace (env service version from standard labels).
kubernetes_state.job.succeeded
(gauge)
The number of pods which reached Phase Succeeded. Tags:kube_job or kube_cronjob kube_namespace (env service version from standard labels).
kubernetes_state.limitrange.cpu.default
(gauge)
Information about CPU limit range usage by constraint. Tags:kube_namespace limitrange type.
Shown as cpu
kubernetes_state.limitrange.cpu.default_request
(gauge)
Information about CPU limit range usage by constraint. Tags:kube_namespace limitrange type.
Shown as cpu
kubernetes_state.limitrange.cpu.max
(gauge)
Information about CPU limit range usage by constraint. Tags:kube_namespace limitrange type.
Shown as cpu
kubernetes_state.limitrange.cpu.max_limit_request_ratio
(gauge)
Information about CPU limit range usage by constraint. Tags:kube_namespace limitrange type.
Shown as cpu
kubernetes_state.limitrange.cpu.min
(gauge)
Information about CPU limit range usage by constraint. Tags:kube_namespace limitrange type.
Shown as cpu
kubernetes_state.limitrange.memory.default
(gauge)
Information about memory limit range usage by constraint. Tags:kube_namespace limitrange type.
Shown as byte
kubernetes_state.limitrange.memory.default_request
(gauge)
Information about memory limit range usage by constraint. Tags:kube_namespace limitrange type.
Shown as byte
kubernetes_state.limitrange.memory.max
(gauge)
Information about memory limit range usage by constraint. Tags:kube_namespace limitrange type.
Shown as byte
kubernetes_state.limitrange.memory.max_limit_request_ratio
(gauge)
Information about memory limit range usage by constraint. Tags:kube_namespace limitrange type.
Shown as byte
kubernetes_state.limitrange.memory.min
(gauge)
Information about memory limit range usage by constraint. Tags:kube_namespace limitrange type.
Shown as byte
kubernetes_state.namespace.count
(gauge)
Number of namespaces. Tags:phase.
kubernetes_state.node.age
(gauge)
The time in seconds since the creation of the node. Tags:node.
Shown as second
kubernetes_state.node.by_condition
(gauge)
The condition of a cluster node. Tags:condition node status.
kubernetes_state.node.count
(gauge)
Number of nodes. Tags:kernel_version os_image container_runtime_version kubelet_version.
kubernetes_state.node.cpu_allocatable
(gauge)
The allocatable CPU of a node that is available for scheduling. Tags:node resource unit.
Shown as cpu
kubernetes_state.node.cpu_allocatable.total
(gauge)
The total allocatable CPU of all nodes in the cluster that is available for scheduling.
Shown as cpu
kubernetes_state.node.cpu_capacity
(gauge)
The CPU capacity of a node. Tags:node resource unit.
Shown as cpu
kubernetes_state.node.cpu_capacity.total
(gauge)
The total CPU capacity of all nodes in the cluster.
Shown as cpu
kubernetes_state.node.ephemeral_storage_allocatable
(gauge)
The allocatable ephemeral-storage of a node that is available for scheduling. Tags:node resource unit.
kubernetes_state.node.ephemeral_storage_capacity
(gauge)
The ephemeral-storage capacity of a node. Tags:node resource unit.
kubernetes_state.node.gpu_allocatable
(gauge)
The allocatable GPU of a node that is available for scheduling. Tags:node resource mig_profile unit.
kubernetes_state.node.gpu_allocatable.total
(gauge)
The total allocatable GPU of all nodes in the cluster that is available for scheduling.
kubernetes_state.node.gpu_capacity
(gauge)
The GPU capacity of a node. Tags:node resource mig_profile unit.
kubernetes_state.node.gpu_capacity.total
(gauge)
The total GPU capacity of all nodes in the cluster.
kubernetes_state.node.memory_allocatable
(gauge)
The allocatable memory of a node that is available for scheduling. Tags:node resource unit.
Shown as byte
kubernetes_state.node.memory_allocatable.total
(gauge)
The total allocatable memory of all nodes in the cluster that is available for scheduling.
Shown as byte
kubernetes_state.node.memory_capacity
(gauge)
The memory capacity of a node. Tags:node resource unit.
Shown as byte
kubernetes_state.node.memory_capacity.total
(gauge)
The total memory capacity of all nodes in the cluster.
Shown as byte
kubernetes_state.node.network_bandwidth_allocatable
(gauge)
The allocatable network bandwidth of a node that is available for scheduling. Tags:node resource unit.
kubernetes_state.node.network_bandwidth_capacity
(gauge)
The network bandwidth capacity of a node. Tags:node resource unit.
kubernetes_state.node.pods_allocatable
(gauge)
The allocatable memory of a node that is available for scheduling. Tags:node resource unit.
kubernetes_state.node.pods_capacity
(gauge)
The pods capacity of a node. Tags:node resource unit.
kubernetes_state.node.status
(gauge)
Whether the node can schedule new pods. Tags:node status.
kubernetes_state.pdb.disruptions_allowed
(gauge)
Number of pod disruptions that are currently allowed. Tags:kube_namespace poddisruptionbudget.
kubernetes_state.pdb.pods_desired
(gauge)
Minimum desired number of healthy pods. Tags:kube_namespace poddisruptionbudget.
kubernetes_state.pdb.pods_healthy
(gauge)
Current number of healthy pods. Tags:kube_namespace poddisruptionbudget.
kubernetes_state.pdb.pods_total
(gauge)
Total number of pods counted by this disruption budget. Tags:kube_namespace poddisruptionbudget.
kubernetes_state.persistentvolume.by_phase
(gauge)
The phase indicates if a volume is available bound to a claim or released by a claim. Tags:persistentvolume storageclass phase.
kubernetes_state.persistentvolume.capacity
(gauge)
Persistentvolume capacity in bytes. Tags:persistentvolume storageclass.
kubernetes_state.persistentvolumeclaim.access_mode
(gauge)
The access mode(s) specified by the persistent volume claim. Tags:kube_namespace persistentvolumeclaim access_mode storageclass.
kubernetes_state.persistentvolumeclaim.request_storage
(gauge)
The capacity of storage requested by the persistent volume claim. Tags:kube_namespace persistentvolumeclaim storageclass.
kubernetes_state.persistentvolumeclaim.status
(gauge)
The phase the persistent volume claim is currently in. Tags:kube_namespace persistentvolumeclaim phase storageclass.
kubernetes_state.pod.age
(gauge)
The time in seconds since the creation of the pod. Tags:node kube_namespace pod_name pod_phase (env service version from standard labels).
Shown as second
kubernetes_state.pod.count
(gauge)
Number of Pods. Tags:node kube_namespace kube_<owner kind>.
kubernetes_state.pod.ready
(gauge)
Describes whether the pod is ready to serve requests. Tags:node kube_namespace pod_name condition (env service version from standard labels).
kubernetes_state.pod.scheduled
(gauge)
Describes the status of the scheduling process for the pod. Tags:node kube_namespace pod_name condition (env service version from standard labels).
kubernetes_state.pod.status_phase
(gauge)
The pods current phase. Tags:node kube_namespace pod_name pod_phase (env service version from standard labels).
kubernetes_state.pod.tolerations
(gauge)
Information about the pod tolerations
kubernetes_state.pod.unschedulable
(gauge)
Describes the unschedulable status for the pod. Tags:kube_namespace pod_name (env service version from standard labels).
kubernetes_state.pod.uptime
(gauge)
The time in seconds since the pod has been scheduled and acknowledged by the Kubelet. Tags:node kube_namespace pod_name pod_phase (env service version from standard labels).
kubernetes_state.pod.volumes.persistentvolumeclaims_readonly
(gauge)
Describes whether a persistentvolumeclaim is mounted read only. Tags:node kube_namespace pod_name volume persistentvolumeclaim (env service version from standard labels).
kubernetes_state.replicaset.count
(gauge)
Number of ReplicaSets Tags:kube_namespace kube_deployment.
kubernetes_state.replicaset.fully_labeled_replicas
(gauge)
The number of fully labeled replicas per ReplicaSet. Tags:kube_namespace kube_replica_set (env service version from standard labels).
kubernetes_state.replicaset.replicas
(gauge)
The number of replicas per ReplicaSet. Tags:kube_namespace kube_replica_set (env service version from standard labels).
kubernetes_state.replicaset.replicas_desired
(gauge)
Number of desired pods for a ReplicaSet. Tags:kube_namespace kube_replica_set (env service version from standard labels).
kubernetes_state.replicaset.replicas_ready
(gauge)
The number of ready replicas per ReplicaSet. Tags:kube_namespace kube_replica_set (env service version from standard labels).
kubernetes_state.replicationcontroller.fully_labeled_replicas
(gauge)
The number of fully labeled replicas per ReplicationController. Tags:kube_namespace kube_replication_controller.
kubernetes_state.replicationcontroller.replicas
(gauge)
The number of replicas per ReplicationController. Tags:kube_namespace kube_replication_controller.
kubernetes_state.replicationcontroller.replicas_available
(gauge)
The number of available replicas per ReplicationController. Tags:kube_namespace kube_replication_controller.
kubernetes_state.replicationcontroller.replicas_desired
(gauge)
Number of desired pods for a ReplicationController. Tags:kube_namespace kube_replication_controller.
kubernetes_state.replicationcontroller.replicas_ready
(gauge)
The number of ready replicas per ReplicationController. Tags:kube_namespace kube_replication_controller.
kubernetes_state.resourcequota.count_configmaps.limit
(gauge)
Information about resource quota limits by resource. Tags:kube_namespace resourcequota.
kubernetes_state.resourcequota.count_configmaps.used
(gauge)
Information about resource quota usage by resource. Tags:kube_namespace resourcequota.
kubernetes_state.resourcequota.count_secrets.limit
(gauge)
Information about resource quota limits by resource. Tags:kube_namespace resourcequota.
kubernetes_state.resourcequota.count_secrets.used
(gauge)
Information about resource quota usage by resource. Tags:kube_namespace resourcequota.
kubernetes_state.resourcequota.pods.limit
(gauge)
Information about resource quota limits by resource. Tags:kube_namespace resourcequota.
kubernetes_state.resourcequota.pods.used
(gauge)
Information about resource quota usage by resource. Tags:kube_namespace resourcequota.
kubernetes_state.resourcequota.requests.cpu.limit
(gauge)
Information about resource quota limits by resource. Tags:kube_namespace resourcequota.
kubernetes_state.resourcequota.requests.cpu.used
(gauge)
Information about resource quota usage by resource. Tags:kube_namespace resourcequota.
kubernetes_state.secret.count
(gauge)
Number of Secrets. Requires Secrets to be added to Cluster Agent collector. Tags: kube_namespace.
kubernetes_state.secret.type
(gauge)
Type about secret. Tags:kube_namespace secret type.
kubernetes_state.service.count
(gauge)
Number of services. Tags:kube_namespace type.
kubernetes_state.service.type
(gauge)
Service types. Tags:kube_namespace kube_service type.
kubernetes_state.statefulset.count
(gauge)
Number of StatefulSets Tags:kube_namespace.
kubernetes_state.statefulset.replicas
(gauge)
The number of replicas per StatefulSet. Tags:kube_namespace kube_stateful_set (env service version from standard labels).
kubernetes_state.statefulset.replicas_current
(gauge)
The number of current replicas per StatefulSet. Tags:kube_namespace kube_stateful_set (env service version from standard labels).
kubernetes_state.statefulset.replicas_desired
(gauge)
Number of desired pods for a StatefulSet. Tags:kube_namespace kube_stateful_set (env service version from standard labels).
kubernetes_state.statefulset.replicas_ready
(gauge)
The number of ready replicas per StatefulSet. Tags:kube_namespace kube_stateful_set (env service version from standard labels).
kubernetes_state.statefulset.replicas_updated
(gauge)
The number of updated replicas per StatefulSet. Tags:kube_namespace kube_stateful_set (env service version from standard labels).
kubernetes_state.vpa.count
(gauge)
Number of vertical pod autoscaler. Tags: kube_namespace.
kubernetes_state.vpa.lower_bound
(gauge)
Minimum resources the container can use before the VerticalPodAutoscaler updater evicts it. Tags:kube_namespace verticalpodautoscaler kube_container_name resource target_api_version target_kind target_name unit.
kubernetes_state.vpa.spec_container_maxallowed
(gauge)
Maximum resources the VerticalPodAutoscaler can set for containers matching the name. Tags:kube_namespace verticalpodautoscaler kube_container_name resource target_api_version target_kind target_name unit.
kubernetes_state.vpa.spec_container_minallowed
(gauge)
Minimum resources the VerticalPodAutoscaler can set for containers matching the name. Tags:kube_namespace verticalpodautoscaler kube_container_name resource target_api_version target_kind target_name unit.
kubernetes_state.vpa.target
(gauge)
Target resources the VerticalPodAutoscaler recommends for the container. Tags:kube_namespace verticalpodautoscaler kube_container_name resource target_api_version target_kind target_name unit.
kubernetes_state.vpa.uncapped_target
(gauge)
Target resources the VerticalPodAutoscaler recommends for the container ignoring bounds. Tags:kube_namespace verticalpodautoscaler kube_container_name resource target_api_version target_kind target_name unit.
kubernetes_state.vpa.update_mode
(gauge)
Update mode of the VerticalPodAutoscaler. Tags:kube_namespace verticalpodautoscaler target_api_version target_kind target_name update_mode.
kubernetes_state.vpa.upperbound
(gauge)
Maximum resources the container can use before the VerticalPodAutoscaler updater evicts it. Tags:kube_namespace verticalpodautoscaler kube_container_name resource target_api_version target_kind target_name unit.

Note: You can configure Datadog Standard labels on your Kubernetes objects to get the env service version tags.

Events

The Kubernetes State Metrics Core check does not include any events.

Default labels as tags

Recommended LabelTag
app.kubernetes.io/namekube_app_name
app.kubernetes.io/instancekube_app_instance
app.kubernetes.io/versionkube_app_version
app.kubernetes.io/componentkube_app_component
app.kubernetes.io/part-ofkube_app_part_of
app.kubernetes.io/managed-bykube_app_managed_by
helm.sh/charthelm_chart
Recommended LabelTag
topology.kubernetes.io/regionkube_region
topology.kubernetes.io/zonekube_zone
failure-domain.beta.kubernetes.io/regionkube_region
failure-domain.beta.kubernetes.io/zonekube_zone

Datadog labels (Unified Service Tagging)

Datadog LabelTag
tags.datadoghq.com/envenv
tags.datadoghq.com/serviceservice
tags.datadoghq.com/versionversion

Service Checks

kubernetes_state.cronjob.complete
Whether the last job of the cronjob is failed or not. Tags:kube_cronjob kube_namespace (env service version from standard labels).
kubernetes_state.cronjob.on_schedule_check
Alert if the cronjob’s next schedule is in the past. Tags:kube_cronjob kube_namespace (env service version from standard labels).
kubernetes_state.job.complete
Whether the job is failed or not. Tags:kube_job or kube_cronjob kube_namespace (env service version from standard labels).
kubernetes_state.node.ready
Whether the node is ready. Tags:node condition status.
kubernetes_state.node.out_of_disk
Whether the node is out of disk. Tags:node condition status.
kubernetes_state.node.disk_pressure
Whether the node is under disk pressure. Tags:node condition status.
kubernetes_state.node.network_unavailable
Whether the node network is unavailable. Tags:node condition status.
kubernetes_state.node.memory_pressure
Whether the node network is under memory pressure. Tags:node condition status.

Validation

Run the Cluster Agent’s status subcommand inside your Cluster Agent container and look for kubernetes_state_core under the Checks section.

Troubleshooting

Timeout errors

By default, the Kubernetes State Metrics Core check waits 10 seconds for a response from the Kubernetes API server. For large clusters, the request may time out, resulting in missing metrics.

You can avoid this by setting the environment variable DD_KUBERNETES_APISERVER_CLIENT_TIMEOUT to a higher value than the default 10 seconds.

Update your datadog-agent.yaml with the following configuration:

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  override:
    clusterAgent:
      env:
        - name: DD_KUBERNETES_APISERVER_CLIENT_TIMEOUT
          value: <value_greater_than_10>

Then apply the new configuration:

kubectl apply -n $DD_NAMESPACE -f datadog-agent.yaml

Update your datadog-values.yaml with the following configuration:

clusterAgent:
  env:
    - name: DD_KUBERNETES_APISERVER_CLIENT_TIMEOUT
      value: <value_greater_than_10>

Then upgrade your Helm chart:

helm upgrade -f datadog-values.yaml <RELEASE_NAME> datadog/datadog

Need help? Contact Datadog support.

Further Reading

PREVIEWING: may/unit-testing