This check submits metrics exposed by the NVIDIA DCGM Exporter in Datadog Agent format. For more information on NVIDIA Data Center GPU Manager (DCGM), see NVIDIA DCGM.
Starting from Agent release 7.47.0, the DCGM check is included in the Datadog Agent package. However, you need to spin up the DCGM Exporter container to expose the GPU metrics in order for the Agent to collect this data. As the default counters are not sufficient, Datadog recommends using the following DCGM configuration to cover the same ground as the NVML integration in addition to having useful metrics. This integration is fully supported by Agent 7.59+. Some telemetry may not be available for older agent versions.
# Format# If line starts with a '#' it is considered a comment# DCGM FIELD ,Prometheus metric type ,help message# ClocksDCGM_FI_DEV_SM_CLOCK,gauge,SMclockfrequency(inMHz).DCGM_FI_DEV_MEM_CLOCK,gauge,Memoryclockfrequency(inMHz).# TemperatureDCGM_FI_DEV_MEMORY_TEMP,gauge,Memorytemperature(inC).DCGM_FI_DEV_GPU_TEMP,gauge,GPUtemperature(inC).# PowerDCGM_FI_DEV_POWER_USAGE,gauge,Powerdraw(inW).DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION,counter,Totalenergyconsumptionsinceboot(inmJ).# PCIEDCGM_FI_DEV_PCIE_REPLAY_COUNTER,counter,TotalnumberofPCIeretries.# Utilization (the sample period varies depending on the product)DCGM_FI_DEV_GPU_UTIL,gauge,GPUutilization(in%).DCGM_FI_DEV_MEM_COPY_UTIL,gauge,Memoryutilization(in%).DCGM_FI_DEV_ENC_UTIL,gauge,Encoderutilization(in%).DCGM_FI_DEV_DEC_UTIL,gauge,Decoderutilization(in%).# Errors and violationsDCGM_FI_DEV_XID_ERRORS,gauge,ValueofthelastXIDerrorencountered.# Memory usageDCGM_FI_DEV_FB_FREE,gauge,Framebuffermemoryfree(inMiB).DCGM_FI_DEV_FB_USED,gauge,Framebuffermemoryused(inMiB).# NVLinkDCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,counter,TotalnumberofNVLinkbandwidthcountersforalllanes.# VGPU License statusDCGM_FI_DEV_VGPU_LICENSE_STATUS,gauge,vGPULicensestatus# Remapped rowsDCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS,counter,NumberofremappedrowsforuncorrectableerrorsDCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,counter,NumberofremappedrowsforcorrectableerrorsDCGM_FI_DEV_ROW_REMAP_FAILURE,gauge,Whetherremappingofrowshasfailed# DCP metricsDCGM_FI_PROF_PCIE_TX_BYTES,counter,Thenumberofbytesofactivepcietxdataincludingbothheaderandpayload.DCGM_FI_PROF_PCIE_RX_BYTES,counter,Thenumberofbytesofactivepcierxdataincludingbothheaderandpayload.DCGM_FI_PROF_GR_ENGINE_ACTIVE,gauge,Ratiooftimethegraphicsengineisactive(in%).DCGM_FI_PROF_SM_ACTIVE,gauge,TheratioofcyclesanSMhasatleast1warpassigned(in%).DCGM_FI_PROF_SM_OCCUPANCY,gauge,TheratioofnumberofwarpsresidentonanSM(in%).DCGM_FI_PROF_PIPE_TENSOR_ACTIVE,gauge,Ratioofcyclesthetensor(HMMA)pipeisactive(in%).DCGM_FI_PROF_DRAM_ACTIVE,gauge,Ratioofcyclesthedevicememoryinterfaceisactivesendingorreceivingdata(in%).DCGM_FI_PROF_PIPE_FP64_ACTIVE,gauge,Ratioofcyclesthefp64pipesareactive(in%).DCGM_FI_PROF_PIPE_FP32_ACTIVE,gauge,Ratioofcyclesthefp32pipesareactive(in%).DCGM_FI_PROF_PIPE_FP16_ACTIVE,gauge,Ratioofcyclesthefp16pipesareactive(in%).# Datadog additional recommended fieldsDCGM_FI_DEV_COUNT,counter,NumberofDevicesonthenode.DCGM_FI_DEV_FAN_SPEED,gauge,Fanspeedforthedeviceinpercent0-100.DCGM_FI_DEV_SLOWDOWN_TEMP,gauge,Slowdowntemperatureforthedevice.DCGM_FI_DEV_POWER_MGMT_LIMIT,gauge,Currentpowerlimitforthedevice.DCGM_FI_DEV_PSTATE,gauge,Performancestate(P-State)0-15.0=highestDCGM_FI_DEV_FB_TOTAL,gauge,DCGM_FI_DEV_FB_RESERVED,gauge,DCGM_FI_DEV_FB_USED_PERCENT,gauge,DCGM_FI_DEV_CLOCK_THROTTLE_REASONS,gauge,Currentclockthrottlereasons(bitmaskofDCGM_CLOCKS_THROTTLE_REASON_*)DCGM_FI_PROCESS_NAME,label,TheProcessName.DCGM_FI_CUDA_DRIVER_VERSION,label,DCGM_FI_DEV_NAME,label,DCGM_FI_DEV_MINOR_NUMBER,label,DCGM_FI_DRIVER_VERSION,label,DCGM_FI_DEV_BRAND,label,DCGM_FI_DEV_SERIAL,label,
To configure the exporter in a Docker environment:
Create the file $PWD/default-counters.csv which contains the default fields from NVIDIA etc/default-counters.csv as well as additional Datadog-recommended fields. To add more fields for collection, follow these instructions. For the complete list of fields, see the DCGM API reference manual.
Run the Docker container using the following command:
sudo docker run --pid=host --privileged -e DCGM_EXPORTER_INTERVAL=5000 --gpus all -d -v /proc:/proc -v $PWD/default-counters.csv:/etc/dcgm-exporter/default-counters.csv -p 9400:9400 --name dcgm-exporter nvcr.io/nvidia/k8s/dcgm-exporter:3.1.7-3.1.4-ubuntu20.04
The DCGM exporter can quickly be installed in a Kubernetes environment using the NVIDIA DCGM Exporter Helm chart. The instructions below are derived from the template provided by NVIDIA here.
Add the NVIDIA DCGM Exporter Helm repository and ensure it is up-to-date :
Create a ConfigMap containing the Datadog-recommended metrics from Installation, as well as the RoleBinding and Role used by the DCGM pods to retrieve the ConfigMap using the manifest below :
apiVersion:rbac.authorization.k8s.io/v1kind:Rolemetadata:name:dcgm-exporter-read-datadog-cmnamespace:defaultrules:- apiGroups:[""]resources:["configmaps"]resourceNames:["datadog-dcgm-exporter-configmap"]verbs:["get"]---apiVersion:rbac.authorization.k8s.io/v1kind:RoleBindingmetadata:name:dcgm-exporter-datadognamespace:defaultsubjects:- kind:ServiceAccountname:dcgm-datadog-dcgm-exporternamespace:defaultroleRef:kind:Role name:dcgm-exporter-read-datadog-cmapiGroup:rbac.authorization.k8s.io---apiVersion:v1kind:ConfigMapmetadata:name:datadog-dcgm-exporter-configmapnamespace:defaultdata:metrics:| # Copy the content from the Installation section.
Create your DCGM Exporter Helm chart dcgm-values.yaml with the following content :
# Exposing more metrics than the default for additional monitoring - this requires the use of a dedicated ConfigMap for which the Kubernetes ServiceAccount used by the exporter has access thanks to step 1.# Ref: https://github.com/NVIDIA/dcgm-exporter/blob/e55ec750def325f9f1fdbd0a6f98c932672002e4/deployment/values.yaml#L38arguments:["-m","default:datadog-dcgm-exporter-configmap"]# Datadog Autodiscovery V2 annotationspodAnnotations:ad.datadoghq.com/exporter.checks:|- {
"dcgm": {
"instances": [
{
"openmetrics_endpoint": "http://%%host%%:9400/metrics"
}
]
}
}# Optional - Disabling the ServiceMonitor which requires Prometheus CRD - can be re-enabled if Prometheus CRDs are installed in your clusterserviceMonitor:enabled:false
Install the DCGM Exporter Helm chart in the default namespace with the following command, while being in the directory with your dcgm-values.yaml :
The DCGM exporter can be installed in a Kubernetes environment by using NVIDIA GPU Operator. The instructions below are derived from the template provided by NVIDIA here.
Add the NVIDIA GPU Operator Helm repository and ensure it is up-to-date :
Fetch the metrics file and save as dcgm-metrics.csv: curl https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/main/etc/dcp-metrics-included.csv > dcgm-metrics.csv
Edit the metrics file by replacing its content with the Datadog-provided mapping.
Create a namespace gpu-operator if one is not already present: kubectl create namespace gpu-operator.
Create a ConfigMap using the file edited above: kubectl create configmap metrics-config -n gpu-operator --from-file=dcgm-metrics.csv
Create your GPU Operator Helm chart dcgm-values.yaml with the following content:
# Refer to NVIDIA documentation for the driver and toolkit for your GPU-enabled nodes - example below for Amazon Linux 2 g5.xlargedriver:enabled:truetoolkit:version:v1.13.5-centos7# Using custom metrics configuration to collect recommended Datadog additional metrics - requires the creation of the metrics-config ConfigMap from the previous step# Ref: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#custom-metrics-configdcgmExporter:config:name:metrics-configenv:- name:DCGM_EXPORTER_COLLECTORSvalue:/etc/dcgm-exporter/dcgm-metrics.csv# Adding Datadog autodiscovery V2 annotationsdaemonsets:annotations:ad.datadoghq.com/nvidia-dcgm-exporter.checks:|- {
"dcgm": {
"instances": [
{
"openmetrics_endpoint": "http://%%host%%:9400/metrics"
}
]
}
}
Install the DCGM Exporter Helm chart in the default namespace with the following command, while being in the directory with your dcgm-values.yaml:
Edit the dcgm.d/conf.yaml file (located in the conf.d/ folder at the root of your Agent’s configuration directory) to start collecting your GPU Metrics. See the sample dcgm.d/conf.yaml for all available configuration options.
instances:
## @param openmetrics_endpoint - string - required
## The URL exposing metrics in the OpenMetrics format.
##
## Set this to <listenAddress>/<handlerPath> as configured in your DCGM Server
#
- openmetrics_endpoint: http://localhost:9400/metrics
Note: If you followed the instructions for the DCGM Exporter Helm chart or GPU Operator, the annotations are already applied to the pods and the instructions below can be ignored.
To configure this check for an Agent running on Kubernetes:
The out-of-the-box monitors that come with this integration have some default values based on their alert thresholds. For example, the GPU temperature is determined based on an acceptable range for industrial devices.
However, Datadog recommends that you check to make sure these values suit your particular needs.
If you have added some metrics that don’t appear in the metadata.csv above but appear in your account with the format DCGM_FI_DEV_NEW_METRIC, remap these metrics in the dcgm.d/conf.yaml configuration file:
## @param extra_metrics - (list of string or mapping) - optional## This list defines metrics to collect from the `openmetrics_endpoint`, in addition to## what the check collects by default. If the check already collects a metric, then## metric definitions here take precedence. Metrics may be defined in 3 ways:...
The example below appends the part in NEW_METRIC to the namespace (dcgm.), giving dcgm.new_metric:
If a field is not being collected even after enabling it in default-counters.csv and performing a curl request to host:9400/metrics, the dcgm-exporter developers recommend looking at the log file at var/log/nv-hostengine.log.
Note: The dcgm-exporter is a thin wrapper around lower-level libraries and drivers which do the actual reporting.