- 필수 기능
- 시작하기
- Glossary
- 표준 속성
- Guides
- Agent
- 통합
- 개방형텔레메트리
- 개발자
- API
- Datadog Mobile App
- CoScreen
- Cloudcraft
- 앱 내
- 서비스 관리
- 인프라스트럭처
- 애플리케이션 성능
- APM
- Continuous Profiler
- 스팬 시각화
- 데이터 스트림 모니터링
- 데이터 작업 모니터링
- 디지털 경험
- 소프트웨어 제공
- 보안
- AI Observability
- 로그 관리
- 관리
Supported OS
This check monitors Karpenter through the Datadog Agent. For more information, see Karpenter monitoring.
Follow the instructions below to install and configure this check for an Agent running in your Kubernetes environment. For more information about configuration in containerized environments, see the Autodiscovery Integration Templates for guidance.
Starting from Agent release 7.50.0, the Karpenter check is included in the Datadog Agent package. No additional installation is needed in your environment.
This check uses OpenMetrics to collect metrics from the OpenMetrics endpoint that Karpenter exposes, which requires Python 3.
Make sure that the Prometheus-formatted metrics are exposed in your Karpenter cluster and on which port. You can configure the port by following the instructions on the Metrics page in the Karpenter documentation. For the Agent to start collecting metrics, the Karpenter pods need to be annotated. For more information about annotations, refer to the Autodiscovery Integration Templates for guidance. You can find additional configuration options by reviewing the sample karpenter.d/conf.yaml.
Note: The listed metrics can only be collected if they are available. Some metrics are generated only when certain actions are performed. For example, the karpenter.nodes.terminated
metric is exposed only after a node is terminated.
The only parameter required for configuring the Karpenter check is:
openmetrics_endpoint
: This parameter should be set to the location where the Prometheus-formatted metrics are exposed. The default port is 8000
, but it can be configured using the METRICS_PORT
environment variable. In containerized environments, %%host%%
should be used for host autodetection.apiVersion: v1
kind: Pod
# (...)
metadata:
name: '<POD_NAME>'
annotations:
ad.datadoghq.com/controller.checks: |
{
"karpenter": {
"init_config": {},
"instances": [
{
"openmetrics_endpoint": "http://%%host%%:8000/metrics"
}
]
}
}
# (...)
spec:
containers:
- name: 'controller'
# (...)
Available for Agent versions >6.0
Karpenter logs can be collected from the different Karpenter pods through Kubernetes. Collecting logs is disabled by default in the Datadog Agent. To enable it, see Kubernetes Log Collection.
See the Autodiscovery Integration Templates for guidance on applying the parameters below.
Parameter | Value |
---|---|
<LOG_CONFIG> | {"source": "karpenter", "service": "<SERVICE_NAME>"} |
Run the Agent’s status subcommand and look for karpenter
under the Checks section.
karpenter.build_info (gauge) | A metric with a constant '1' value labeled by version from which Karpenter was built. |
karpenter.certwatcher.read.certificate.count (count) | The count of certificate reads Shown as read |
karpenter.certwatcher.read.certificate.errors.count (count) | The count of certificate read errors Shown as error |
karpenter.cloudprovider.batcher.batch.time_seconds.bucket (count) | The count of observation in the batching window histogram by upper_bound buckets |
karpenter.cloudprovider.batcher.batch.time_seconds.count (count) | The count of observation in the batching window histogram |
karpenter.cloudprovider.batcher.batch.time_seconds.sum (count) | The sum of the duration of the batching window per batcher Shown as second |
karpenter.cloudprovider.batcher.batch_size.bucket (count) | The count of observation in the request batch histogram by upper_bound buckets |
karpenter.cloudprovider.batcher.batch_size.count (count) | The count of observation in the request batch histogram |
karpenter.cloudprovider.batcher.batch_size.sum (count) | The sum of the size of the request batch per batcher |
karpenter.cloudprovider.duration_seconds.bucket (count) | The count of observations in the duration of cloud provider histogram by upper_bound buckets,method name and provider |
karpenter.cloudprovider.duration_seconds.count (count) | The count of observations in the duration of cloud provider histogram |
karpenter.cloudprovider.duration_seconds.sum (count) | The sum of the duration of cloud provider method calls. Labeled by the controller Shown as second |
karpenter.cloudprovider.errors.count (count) | The count of errors returned from CloudProvider calls Shown as error |
karpenter.cloudprovider.instance.type.cpu_cores (gauge) | VCPUs cores for a given instance type Shown as core |
karpenter.cloudprovider.instance.type.memory_bytes (gauge) | Memory, in bytes, for a given instance type Shown as byte |
karpenter.cloudprovider.instance.type.offering_available (gauge) | Instance type offering availability, based on instance type, capacity type, and zone |
karpenter.cloudprovider.instance.type.price_estimate (gauge) | Estimated hourly price used when making informed decisions on node cost calculation. This is updated once on startup and then every 12 hours |
karpenter.cluster_state.node_count (gauge) | Current count of nodes in cluster state. Shown as node |
karpenter.cluster_state.synced (gauge) | Returns 1 if cluster state is synced and 0 otherwise. Synced checks that nodeclaims and nodes that are stored in the APIServer have the same representation as Karpenter's cluster state |
karpenter.consistency.errors (gauge) | Number of consistency checks that have failed Shown as error |
karpenter.controller.runtime.active_workers (gauge) | Number of currently used workers per controller Shown as worker |
karpenter.controller.runtime.max.concurrent_reconciles (gauge) | Maximum number of concurrent reconciles per controller |
karpenter.controller.runtime.reconcile.count (count) | The count of reconciliations per controller |
karpenter.controller.runtime.reconcile.time_seconds.bucket (count) | The count of observations in the reconciliation per controller histogram by upper_bound buckets |
karpenter.controller.runtime.reconcile.time_seconds.count (count) | The count of observations in the reconciliation per controller histogram |
karpenter.controller.runtime.reconcile.time_seconds.sum (count) | The sum of time per reconciliation per controller Shown as second |
karpenter.controller.runtime.reconcile_errors.count (count) | The count of reconciliation errors per controller Shown as error |
karpenter.deprovisioning.actions_performed.count (count) | The count of deprovisioning actions performed. Labeled by deprovisioner Shown as execution |
karpenter.deprovisioning.consolidation_timeouts (gauge) | Number of times the Consolidation algorithm has reached a timeout. Labeled by consolidation type Shown as timeout |
karpenter.deprovisioning.eligible_machines (gauge) | Number of machines eligible for deprovisioning by Karpenter. Labeled by deprovisioner |
karpenter.deprovisioning.evaluation.duration_seconds.bucket (count) | The count of observations in the deprovisioning evaluation histogram by upper_bound buckets |
karpenter.deprovisioning.evaluation.duration_seconds.count (count) | The count of observations in the deprovisioning evaluation histogram |
karpenter.deprovisioning.evaluation.duration_seconds.sum (count) | The sum of the duration of the deprovisioning evaluation process in seconds Shown as second |
karpenter.deprovisioning.replacement.machine.initialized_seconds.bucket (count) | The count of the observation in the replacement machine histogram by upper_bound buckets |
karpenter.deprovisioning.replacement.machine.initialized_seconds.count (count) | The count of the observation in the replacement machine histogram |
karpenter.deprovisioning.replacement.machine.initialized_seconds.sum (count) | The sum of the time required for a replacement machine to become initialized Shown as second |
karpenter.deprovisioning.replacement.machine.launch.failure_counter.count (count) | The count of times that Karpenter failed to launch a replacement node for deprovisioning. Labeled by deprovisioner Shown as attempt |
karpenter.disruption.actions_performed.count (count) | The count of disruption actions performed. Labeled by disruption method Shown as execution |
karpenter.disruption.budgets.allowed_disruptions (gauge) | The number of nodes for a given NodePool that can be disrupted at a point in time. Labeled by NodePool. Note that allowed disruptions can change very rapidly, as new nodes may be created and others may be deleted at any point. Shown as node |
karpenter.disruption.consolidation_timeouts.count (count) | The count of times the Consolidation algorithm has reached a timeout. Labeled by consolidation type Shown as timeout |
karpenter.disruption.eligible_nodes (gauge) | Number of nodes eligible for disruption by Karpenter. Labeled by disruption method Shown as node |
karpenter.disruption.evaluation.duration_seconds.bucket (count) | The count of observations in the disruption evaluation histogram by upper_bound buckets |
karpenter.disruption.evaluation.duration_seconds.count (count) | The count of observations in the disruption evaluation histogram |
karpenter.disruption.evaluation.duration_seconds.sum (count) | The sum of the duration of the disruption evaluation process in seconds Shown as second |
karpenter.disruption.nodes.disrupted.count (count) | Total number of nodes disrupted. Labeled by NodePool, disruption action, method, and consolidation type. Shown as node |
karpenter.disruption.pods.disrupted.count (count) | Total number of reschedulable pods disrupted on nodes. Labeled by NodePool, disruption action, method, and consolidation type. |
karpenter.disruption.queue_depth (gauge) | The number of commands that are being waited on in the disruption orchestration queue. Shown as command |
karpenter.disruption.replacement.nodeclaim.failures.count (count) | The number of times that Karpenter failed to launch a replacement node for disruption. Labeled by disruption method Shown as attempt |
karpenter.disruption.replacement.nodeclaim.initialized_seconds.bucket (count) | The count of observations in the replacement nodeclaim histogram by upper_bound buckets |
karpenter.disruption.replacement.nodeclaim.initialized_seconds.count (count) | The count of observations in the replacement nodeclaim histogram |
karpenter.disruption.replacement.nodeclaim.initialized_seconds.sum (count) | The sum of the amount of time required for a replacement nodeclaim to become initialized Shown as second |
karpenter.go.gc.duration_seconds.count (count) | The summary count of garbage collection cycles in the Karpenter instance |
karpenter.go.gc.duration_seconds.quantile (gauge) | The pause duration of garbage collection cycles in the Karpenter instance by quantile |
karpenter.go.gc.duration_seconds.sum (count) | The sum of the pause duration of garbage collection cycles in the Karpenter instance Shown as second |
karpenter.go.memstats.alloc_bytes (gauge) | Number of bytes allocated and still in use Shown as byte |
karpenter.go.memstats.alloc_bytes.count (count) | Count of bytes allocated, even if freed. Shown as byte |
karpenter.go.memstats.buck.hash.sys_bytes (gauge) | Number of bytes used by the profiling bucket hash table Shown as byte |
karpenter.go.memstats.frees.count (count) | The count of frees |
karpenter.go.memstats.gc.sys_bytes (gauge) | Number of bytes used for garbage collection system metadata Shown as byte |
karpenter.go.memstats.heap.alloc_bytes (gauge) | Number of heap bytes allocated and still in use Shown as byte |
karpenter.go.memstats.heap.idle_bytes (gauge) | Number of heap bytes waiting to be used Shown as byte |
karpenter.go.memstats.heap.inuse_bytes (gauge) | Number of heap bytes that are in use Shown as byte |
karpenter.go.memstats.heap.objects (gauge) | Number of allocated objects Shown as object |
karpenter.go.memstats.heap.released_bytes (gauge) | Number of heap bytes released to OS Shown as byte |
karpenter.go.memstats.heap.sys_bytes (gauge) | Number of heap bytes obtained from system Shown as byte |
karpenter.go.memstats.last.gc.time_seconds (gauge) | Number of seconds since 1970 of last garbage collection Shown as second |
karpenter.go.memstats.lookups.count (count) | The count of pointer lookups |
karpenter.go.memstats.mallocs.count (count) | The count of mallocs |
karpenter.go.memstats.mcache.inuse_bytes (gauge) | Number of bytes in use by mcache structures Shown as byte |
karpenter.go.memstats.mcache.sys_bytes (gauge) | Number of bytes used for mcache structures obtained from system Shown as byte |
karpenter.go.memstats.mspan.inuse_bytes (gauge) | Number of bytes in use by mspan structures Shown as byte |
karpenter.go.memstats.mspan.sys_bytes (gauge) | Number of bytes used for mspan structures obtained from system Shown as byte |
karpenter.go.memstats.next.gc_bytes (gauge) | Number of heap bytes when next garbage collection will take place Shown as byte |
karpenter.go.memstats.other.sys_bytes (gauge) | Number of bytes used for other system allocations Shown as byte |
karpenter.go.memstats.stack.inuse_bytes (gauge) | Number of bytes in use by the stack allocator Shown as byte |
karpenter.go.memstats.stack.sys_bytes (gauge) | Number of bytes obtained from system for stack allocator Shown as byte |
karpenter.go.memstats.sys_bytes (gauge) | Number of bytes obtained from system Shown as byte |
karpenter.go_goroutines (gauge) | Number of goroutines that currently exist |
karpenter.go_info (gauge) | Information about the Go environment |
karpenter.go_threads (gauge) | Number of OS threads created Shown as thread |
karpenter.interruption.actions_performed.count (count) | The count of notification actions performed. Labeled by action Shown as execution |
karpenter.interruption.deleted_messages.count (count) | The count of messages deleted from the SQS queue Shown as message |
karpenter.interruption.message.latency.time_seconds.bucket (count) | The count of observations in the interruption message latency histogram by upper_bound buckets |
karpenter.interruption.message.latency.time_seconds.count (count) | The count of observations in the interruption message latency histogram |
karpenter.interruption.message.latency.time_seconds.sum (count) | The sum of the length of time between message creation in queue and an action taken on the message by the controller Shown as second |
karpenter.interruption.received_messages.count (count) | The count of messages received from the SQS queue. Broken down by message type and whether the message was actionable Shown as message |
karpenter.leader_election.master_status (gauge) | Gauge of if the reporting system is master of the relevant lease, 0 indicates backup, 1 indicates master. 'name' is the string used to identify the lease. |
karpenter.machines_created.count (count) | The count of machines created in total by Karpenter. Labeled by reason the machine was created and the owning provisioner |
karpenter.machines_disrupted.count (count) | The count of machines disrupted in total by Karpenter. Labeled by disruption type of the machine and the owning provisioner |
karpenter.machines_drifted.count (count) | The count of machine drifted reasons in total by Karpenter. Labeled by drift type of the machine and the owning provisioner |
karpenter.machines_initialized.count (count) | The count of machines initialized in total by Karpenter. Labeled by the owning provisioner |
karpenter.machines_launched.count (count) | The count of machines launched in total by Karpenter. Labeled by the owning provisioner |
karpenter.machines_registered.count (count) | The count of machines registered in total by Karpenter. Labeled by the owning provisioner |
karpenter.machines_terminated.count (count) | The count of machines terminated in total by Karpenter. Labeled by reason the machine was terminated and the owning provisioner |
karpenter.nodeclaims_created (gauge) | Number of nodeclaims created in total by Karpenter. Labeled by reason the nodeclaim was created and the owning nodepool |
karpenter.nodeclaims_disrupted (gauge) | Number of nodeclaims disrupted in total by Karpenter. Labeled by disruption type of the nodeclaim and the owning nodepool |
karpenter.nodeclaims_drifted (gauge) | Number of nodeclaims drifted reasons in total by Karpenter. Labeled by drift type of the nodeclaim and the owning nodepool |
karpenter.nodeclaims_initialized (gauge) | Number of nodeclaims initialized in total by Karpenter. Labeled by the owning nodepool |
karpenter.nodeclaims_launched (gauge) | Number of nodeclaims launched in total by Karpenter. Labeled by the owning nodepool |
karpenter.nodeclaims_registered (gauge) | Number of nodeclaims registered in total by Karpenter. Labeled by the owning nodepool |
karpenter.nodeclaims_terminated (gauge) | Number of nodeclaims terminated in total by Karpenter. Labeled by reason the nodeclaim was terminated and the owning nodepool |
karpenter.nodepool_limit (gauge) | The nodepool limits are the limits specified on the provisioner that restrict the quantity of resources provisioned. Labeled by nodepool name and resource type |
karpenter.nodepool_usage (gauge) | The nodepool usage is the amount of resources that have been provisioned by a particular nodepool. Labeled by nodepool name and resource type |
karpenter.nodes.allocatable (gauge) | The amount of resources allocatable by nodes |
karpenter.nodes.created.count (count) | The count of nodes created in total by Karpenter. Labeled by owning provisioner Shown as node |
karpenter.nodes.eviction.queue_depth (gauge) | The number of pods that are waiting for a successful eviction in the eviction queue. |
karpenter.nodes.leases_deleted.count (count) | The count of deleted leaked leases |
karpenter.nodes.system_overhead (gauge) | The resources reserved for system overhead, the difference between the nodes capacity and allocatable values are reported by the status. |
karpenter.nodes.terminated.count (count) | The count of nodes terminated in total by Karpenter. Labeled by owning provisioner Shown as node |
karpenter.nodes.termination.time_seconds.count (count) | The count of observations in the nodes termination time seconds summary |
karpenter.nodes.termination.time_seconds.quantile (gauge) | The time taken between a nodes deletion request and the removal of its finalizer by quantile |
karpenter.nodes.termination.time_seconds.sum (count) | The sum of the time taken between a nodes deletion request and the removal of its finalizer Shown as second |
karpenter.nodes.total.daemon_limits (gauge) | Total resources specified by DaemonSet pod limits |
karpenter.nodes.total.daemon_requests (gauge) | Total resources requested by DaemonSet pods |
karpenter.nodes.total.pod_limits (gauge) | Total pod resources specified by non-DaemonSet pod limits |
karpenter.nodes.total.pod_requests (gauge) | Total pod resources requested by non-DaemonSet pods bound |
karpenter.pods.startup.time_seconds.count (count) | The count of the observations in the pod startup summary |
karpenter.pods.startup.time_seconds.quantile (gauge) | The time taken between pod creation and the pod being in a running state by quantile |
karpenter.pods.startup.time_seconds.sum (count) | The sum of the time from pod creation and the pod being in a running state Shown as second |
karpenter.pods.state (gauge) | Pod state is the current state of pods. This metric can be used several ways as it is labeled by the pod name, namespace, owner, node, provisioner name, zone, architecture, capacity type, instance type and pod phase. |
karpenter.process.cpu_seconds.count (count) | Total user and system CPU time spent in seconds Shown as second |
karpenter.process.max_fds (gauge) | Maximum number of open file descriptors |
karpenter.process.open_fds (gauge) | Number of open file descriptors |
karpenter.process.resident.memory_bytes (gauge) | Resident memory size in bytes Shown as byte |
karpenter.process.start.time_seconds (gauge) | Start time of the process since unix epoch in seconds Shown as second |
karpenter.process.virtual.memory.max_bytes (gauge) | Maximum amount of virtual memory available in bytes Shown as byte |
karpenter.process.virtual.memory_bytes (gauge) | Virtual memory size in bytes Shown as byte |
karpenter.provisioner.limit (gauge) | The limits specified on the provisioner that restrict the quantity of resources provisioned. Labeled by provisioner name and resource type |
karpenter.provisioner.scheduling.duration_seconds.bucket (count) | The count of observations in the provisioner scheduling histogram by upper_bound buckets |
karpenter.provisioner.scheduling.duration_seconds.count (count) | The count of observations in the provisioner scheduling histogram |
karpenter.provisioner.scheduling.duration_seconds.sum (count) | The sum of the duration of scheduling process in seconds. Broken down by provisioner and error Shown as second |
karpenter.provisioner.scheduling.queue_depth (gauge) | The number of pods that are waiting to be scheduled. |
karpenter.provisioner.scheduling.simulation.duration_seconds.bucket (count) | The count of observations in the provisioner scheduling simulation histogram by upper_bound buckets |
karpenter.provisioner.scheduling.simulation.duration_seconds.count (count) | The count of observations in the provisioner scheduling simulation histogram |
karpenter.provisioner.scheduling.simulation.duration_seconds.sum (count) | The sum of the duration of scheduling simulations used for deprovisioning and provisioning in seconds Shown as second |
karpenter.provisioner.usage (gauge) | The amount of resources that have been provisioned by a particular provisioner. Labeled by provisioner name and resource type |
karpenter.provisioner.usage.pct (gauge) | The percentage of each resource used based on the resources provisioned and the limits that have been configured in the range [0,100]. Labeled by provisioner name and resource type Shown as percent |
karpenter.rest.client_requests.count (count) | Count of HTTP requests, partitioned by status code, method, and host. Shown as request |
karpenter.workqueue.longest.running.processor_seconds (gauge) | The amount of seconds the longest running processor for workqueue been running Shown as second |
karpenter.workqueue.queue.duration_seconds.bucket (count) | The count of observations in the workqueue queue duration histogram by upper_bound buckets |
karpenter.workqueue.queue.duration_seconds.count (count) | The count of observations in the workqueue queue duration histogram |
karpenter.workqueue.queue.duration_seconds.sum (count) | The sum of the duration of how long in seconds an item stays in workqueue before being requested Shown as second |
karpenter.workqueue.unfinished.work_seconds (gauge) | The amount of seconds of work that has been done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases |
karpenter.workqueue.work.duration_seconds.bucket (count) | The count of observations in the workqueue work duration histogram by upper_bound buckets |
karpenter.workqueue.work.duration_seconds.count (count) | The count of observations in the workqueue work duration histogram |
karpenter.workqueue.work.duration_seconds.sum (count) | The sum of the amount of seconds spent processing an item from workqueue takes Shown as second |
karpenter.workqueue_adds.count (count) | The count of adds handled by workqueue |
karpenter.workqueue_depth (gauge) | Current depth of workqueue |
karpenter.workqueue_retries.count (count) | The count of retries handled by workqueue Shown as attempt |
The Karpenter integration does not include any events.
karpenter.openmetrics.health
Returns CRITICAL
if the Agent is unable to connect to the Karpenter OpenMetrics endpoint, otherwise returns OK
.
Statuses: ok, critical
Need help? Contact Datadog support.
Additional helpful documentation, links, and articles: