Kubelet

Supported OS Linux Mac OS Windows

Integration version9.1.0

Overview

This integration gets container metrics from kubelet

  • Visualize and monitor kubelet stats
  • Be notified about kubelet failovers and events.

Setup

Installation

The Kubelet check is included in the Datadog Agent package, so you don’t need to install anything else on your servers.

Configuration

Edit the kubelet.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory. See the sample kubelet.d/conf.yaml for all available configuration options.

Validation

Run the Agent’s status subcommand and look for kubelet under the Checks section.

Compatibility

The kubelet check can run in two modes:

  • The default prometheus mode is compatible with Kubernetes version 1.7.6 or superior
  • The cAdvisor mode (enabled by setting the cadvisor_port option) should be compatible with versions 1.3 and up. Consistent tagging and filtering requires at least version 6.2 of the Agent.

OpenShift <3.7 support

The cAdvisor 4194 port is disabled by default on OpenShift. To enable it, you need to add the following lines to your node-config file:

kubeletArguments:
  cadvisor-port: ["4194"]

If you cannot open the port, disable both sources of container metric collection, by setting:

  • cadvisor_port to 0
  • metrics_endpoint to ""

The check can still collect:

  • kubelet health service checks
  • pod running/stopped metrics
  • pod limits and requests
  • node capacity metrics

Data Collected

Metrics

kubernetes.containers.last_state.terminated
(gauge)
The number of containers that were previously terminated
kubernetes.pods.running
(gauge)
The number of running pods
kubernetes.pods.expired
(gauge)
The number of expired pods the check ignored
kubernetes.containers.running
(gauge)
The number of running containers
kubernetes.containers.restarts
(gauge)
The number of times the container has been restarted
kubernetes.containers.state.terminated
(gauge)
The number of currently terminated containers
kubernetes.containers.state.waiting
(gauge)
The number of currently waiting containers
kubernetes.cpu.load.10s.avg
(gauge)
Container cpu load average over the last 10 seconds
kubernetes.cpu.system.total
(gauge)
The number of cores used for system time
Shown as core
kubernetes.cpu.user.total
(gauge)
The number of cores used for user time
Shown as core
kubernetes.cpu.cfs.periods
(gauge)
Number of elapsed enforcement period intervals
kubernetes.cpu.cfs.throttled.periods
(gauge)
Number of throttled period intervals
kubernetes.cpu.cfs.throttled.seconds
(gauge)
Total time duration the container has been throttled
kubernetes.cpu.capacity
(gauge)
The number of cores in this machine (available until kubernetes v1.18)
Shown as core
kubernetes.cpu.usage.total
(gauge)
The number of cores used
Shown as nanocore
kubernetes.cpu.limits
(gauge)
The limit of cpu cores set
Shown as core
kubernetes.cpu.requests
(gauge)
The requested cpu cores
Shown as core
kubernetes.filesystem.usage
(gauge)
The amount of disk used
Shown as byte
kubernetes.filesystem.usage_pct
(gauge)
The percentage of disk used
Shown as fraction
kubernetes.io.read_bytes
(gauge)
The amount of bytes read from the disk
Shown as byte
kubernetes.io.write_bytes
(gauge)
The amount of bytes written to the disk
Shown as byte
kubernetes.memory.capacity
(gauge)
The amount of memory (in bytes) in this machine (available until kubernetes v1.18)
Shown as byte
kubernetes.memory.limits
(gauge)
The limit of memory set
Shown as byte
kubernetes.memory.sw_limit
(gauge)
The limit of swap space set
Shown as byte
kubernetes.memory.requests
(gauge)
The requested memory
Shown as byte
kubernetes.memory.usage
(gauge)
Current memory usage in bytes including all memory regardless of when it was accessed
Shown as byte
kubernetes.memory.working_set
(gauge)
Current working set in bytes - this is what the OOM killer is watching for
Shown as byte
kubernetes.memory.cache
(gauge)
The amount of memory that is being used to cache data from disk (e.g. memory contents that can be associated precisely with a block on a block device)
Shown as byte
kubernetes.memory.rss
(gauge)
Size of RSS in bytes
Shown as byte
kubernetes.memory.swap
(gauge)
The amount of swap currently used by by processes in this cgroup
Shown as byte
kubernetes.memory.usage_pct
(gauge)
The percentage of memory used per pod (memory limit must be set)
Shown as fraction
kubernetes.memory.sw_in_use
(gauge)
The percentage of swap space used
Shown as fraction
kubernetes.network.rx_bytes
(gauge)
The amount of bytes per second received
Shown as byte
kubernetes.network.rx_dropped
(gauge)
The amount of rx packets dropped per second
Shown as packet
kubernetes.network.rx_errors
(gauge)
The amount of rx errors per second
Shown as error
kubernetes.network.tx_bytes
(gauge)
The amount of bytes per second transmitted
Shown as byte
kubernetes.network.tx_dropped
(gauge)
The amount of tx packets dropped per second
Shown as packet
kubernetes.network.tx_errors
(gauge)
The amount of tx errors per second
Shown as error
kubernetes.diskio.io_service_bytes.stats.total
(gauge)
The amount of disk space the container uses
Shown as byte
kubernetes.apiserver.certificate.expiration.count
(gauge)
The count of remaining lifetime on the certificate used to authenticate a request
Shown as second
kubernetes.apiserver.certificate.expiration.sum
(gauge)
The sum of remaining lifetime on the certificate used to authenticate a request
Shown as second
kubernetes.rest.client.requests
(gauge)
The number of HTTP requests
Shown as operation
kubernetes.rest.client.latency.count
(gauge)
The count of request latency in seconds broken down by verb and URL
kubernetes.rest.client.latency.sum
(gauge)
The sum of request latency in seconds broken down by verb and URL
Shown as second
kubernetes.kubelet.pleg.discard_events
(count)
The number of discard events in PLEG
kubernetes.kubelet.pleg.last_seen
(gauge)
Timestamp in seconds when PLEG was last seen active
Shown as second
kubernetes.kubelet.pleg.relist_duration.count
(gauge)
The count of relisting pods in PLEG
kubernetes.kubelet.pleg.relist_duration.sum
(gauge)
The sum of duration in seconds for relisting pods in PLEG
Shown as second
kubernetes.kubelet.pleg.relist_interval.count
(gauge)
The count of relisting pods in PLEG
Shown as second
kubernetes.kubelet.pleg.relist_interval.sum
(gauge)
The sum of interval in seconds between relisting in PLEG
kubernetes.kubelet.runtime.operations
(count)
The number of runtime operations
Shown as operation
kubernetes.kubelet.runtime.errors
(gauge)
Cumulative number of runtime operations errors
Shown as operation
kubernetes.kubelet.runtime.operations.duration.sum
(gauge)
The sum of duration of operations
Shown as operation
kubernetes.kubelet.runtime.operations.duration.count
(gauge)
The count of operations
kubernetes.kubelet.network_plugin.latency.sum
(gauge)
The sum of latency in microseconds of network plugin operations
Shown as microsecond
kubernetes.kubelet.network_plugin.latency.count
(gauge)
The count of network plugin operations by latency
kubernetes.kubelet.network_plugin.latency.quantile
(gauge)
The quantiles of network plugin operations by latency
kubernetes.kubelet.volume.stats.available_bytes
(gauge)
The number of available bytes in the volume
Shown as byte
kubernetes.kubelet.volume.stats.capacity_bytes
(gauge)
The capacity in bytes of the volume
Shown as byte
kubernetes.kubelet.volume.stats.used_bytes
(gauge)
The number of used bytes in the volume
Shown as byte
kubernetes.kubelet.volume.stats.inodes
(gauge)
The maximum number of inodes in the volume
Shown as inode
kubernetes.kubelet.volume.stats.inodes_free
(gauge)
The number of free inodes in the volume
Shown as inode
kubernetes.kubelet.volume.stats.inodes_used
(gauge)
The number of used inodes in the volume
Shown as inode
kubernetes.ephemeral_storage.limits
(gauge)
Ephemeral storage limit of the container (requires kubernetes v1.8+)
Shown as byte
kubernetes.ephemeral_storage.requests
(gauge)
Ephemeral storage request of the container (requires kubernetes v1.8+)
Shown as byte
kubernetes.ephemeral_storage.usage
(gauge)
Ephemeral storage usage of the POD
Shown as byte
kubernetes.kubelet.evictions
(count)
The number of pods that have been evicted from the kubelet (ALPHA in kubernetes v1.16)
kubernetes.kubelet.cpu.usage
(gauge)
The number of cores used by kubelet
Shown as nanocore
kubernetes.kubelet.memory.usage
(gauge)
Current kubelet memory usage in bytes
Shown as byte
kubernetes.kubelet.memory.rss
(gauge)
Size of kubelet RSS in bytes
Shown as byte
kubernetes.runtime.cpu.usage
(gauge)
The number of cores used by the runtime
Shown as nanocore
kubernetes.runtime.memory.usage
(gauge)
Current runtime memory usage in bytes
Shown as byte
kubernetes.runtime.memory.rss
(gauge)
Size of runtime RSS in bytes
Shown as byte
kubernetes.kubelet.container.log_filesystem.used_bytes
(gauge)
Bytes used by the container’s logs on the filesystem (requires kubernetes 1.14+)
Shown as byte
kubernetes.kubelet.pod.start.duration
(gauge)
Duration in microseconds for a single pod to go from pending to running
Shown as microsecond
kubernetes.kubelet.pod.worker.duration
(gauge)
Duration in microseconds to sync a single pod. Broken down by operation type: create, update, or sync
Shown as microsecond
kubernetes.kubelet.pod.worker.start.duration
(gauge)
Duration in microseconds from seeing a pod to starting a worker
Shown as microsecond
kubernetes.kubelet.docker.operations
(count)
The number of docker operations
Shown as operation
kubernetes.kubelet.docker.errors
(count)
The number of docker operations errors
Shown as operation
kubernetes.kubelet.docker.operations.duration.sum
(gauge)
The sum of duration of docker operations
Shown as operation
kubernetes.kubelet.docker.operations.duration.count
(gauge)
The count of docker operations
kubernetes.go_threads
(gauge)
Number of OS threads created
kubernetes.go_goroutines
(gauge)
Number of goroutines that currently exist
kubernetes.liveness_probe.success.total
(gauge)
Cumulative number of successful liveness probe for a container (ALPHA in kubernetes v1.15)
kubernetes.liveness_probe.failure.total
(gauge)
Cumulative number of failed liveness probe for a container (ALPHA in kubernetes v1.15)
kubernetes.readiness_probe.success.total
(gauge)
Cumulative number of successful readiness probe for a container (ALPHA in kubernetes v1.15)
kubernetes.readiness_probe.failure.total
(gauge)
Cumulative number of failed readiness probe for a container (ALPHA in kubernetes v1.15)
kubernetes.startup_probe.success.total
(gauge)
Cumulative number of successful startup probe for a container (ALPHA in kubernetes v1.15)
kubernetes.startup_probe.failure.total
(gauge)
Cumulative number of failed startup probe for a container (ALPHA in kubernetes v1.15)
kubernetes.node.filesystem.usage
(gauge)
The amount of disk used at node level
Shown as byte
kubernetes.node.filesystem.usage_pct
(gauge)
The percentage of disk space used at node level
Shown as fraction
kubernetes.node.image.filesystem.usage
(gauge)
The amount of disk used on image filesystem (node level)
Shown as byte
kubernetes.node.image.filesystem.usage_pct
(gauge)
The percentage of disk used (node level)
Shown as fraction

Service Checks

kubernetes.kubelet.check.ping

Returns CRITICAL if the Kubelet doesn’t respond to Ping. OK, otherwise

Statuses: ok, critical

kubernetes.kubelet.check.docker

Returns CRITICAL if the Docker service doesn’t run on the Kubelet. OK, otherwise

Statuses: ok, critical

kubernetes.kubelet.check.syncloop

Returns CRITICAL if the syncloop health check is down. OK, otherwise

Statuses: ok, critical

kubernetes.kubelet.check

Returns CRITICAL if the overall Kubelet health check is down. OK, otherwise

Statuses: ok, critical

Excluded containers

To restrict the data collected to a subset of the containers deployed, set the DD_CONTAINER_EXCLUDE environment variable. Metrics are not included from the containers specified in that environment variable.

For network metrics reported at the pod level, containers cannot be excluded based on name or image name since other containers can be part of the same pod. So, if DD_CONTAINER_EXCLUDE applies to a namespace, the pod-level metrics are not reported if the pod is in that namespace. However, if DD_CONTAINER_EXCLUDE refers to a container name or image name, the pod-level metrics are reported even if the exclusion rules apply to some containers in the pod.

Troubleshooting

Need help? Contact Datadog support.

PREVIEWING: erikayasuda/python-migration