Nvidia NVML

Documentos > Integraciones > Nvidia NVML

Supported OS Linux Windows Mac OS

Versión de la integración1.0.9

Información general

Este check monitoriza métricas de NVIDIA Management Library (NVML) expuestas a través del Datadog Agent y puede correlacionarlas con los dispositivos expuestos de Kubernetes.

Configuración

El check de NVML no está incluido en el paquete del Datadog Agent, por lo que es necesario instalarlo.

Instalación

Para el Agent v7.21+/v6.21+, sigue las instrucciones a continuación para instalar el check de NVML en tu host. Consulta Usar integraciones de comunidad para realizar la instalación con Agent de Docker o versiones anteriores del Agent.

Ejecuta el siguiente comando para instalar la integración del Agent:

Para Linux:

datadog-agent integration install -t datadog-nvml==<INTEGRATION_VERSION>
# You may also need to install dependencies since those aren't packaged into the wheel
sudo -u dd-agent -H /opt/datadog-agent/embedded/bin/pip3 install grpcio pynvml

Para Windows (con Powershell ejecutado como administrador):

& "$env:ProgramFiles\Datadog\Datadog Agent\bin\agent.exe" integration install -t datadog-nvml==<INTEGRATION_VERSION>
# You may also need to install dependencies since those aren't packaged into the wheel
& "$env:ProgramFiles\Datadog\Datadog Agent\embedded3\python" -m pip install grpcio pynvml

Configura tu integración como si fuese una integración de base.

Si utilizas Docker, existe un archivo Docker de ejemplo en el repositorio NVML.

docker build -t dd-agent-nvml .

Si utilizas Docker y Kubernetes, deberás exponer las variables de entorno NVIDIA_VISIBLE_DEVICES y NVIDIA_DRIVER_CAPABILITIES. Consulta el archivo Docker incluido para ver un ejemplo.

Para correlacionar los dispositivos reservados de NVIDIA Kubernetes con el pod de Kubernetes que utiliza el dispositivo, monta el socket de dominio de Unix /var/lib/kubelet/pod-resources/kubelet.sock en tu configuración del Agent. Más información sobre este socket en el sitio web de Kubernetes. Nota: Este dispositivo tiene compatibilidad en fase beta para la versión 1.15.

Configuración

Edita el archivo nvml.d/conf.yaml, en la carpeta conf.d/ en la raíz del directorio de configuración del Agent para comenzar a recopilar tus datos de rendimiento de NVML. Consulta el nvml.d/conf.yaml de ejemplo para todas las opciones disponibles de configuración.
Reinicia el Agent.

Validación

Ejecuta el subcomando de estado del Agent y busca nvml en la sección Checks.

Datos recopilados

Métricas

nvml.device_count (gauge)	Number of GPU on this instance.
nvml.gpu_utilization (gauge)	Percent of time over the past sample period during which one or more kernels was executing on the GPU. Shown as percent
nvml.mem_copy_utilization (gauge)	Percent of time over the past sample period during which global (device) memory was being read or written. Shown as percent
nvml.fb_free (gauge)	Unallocated FB memory. Shown as byte
nvml.fb_used (gauge)	Allocated FB memory. Shown as byte
nvml.fb_total (gauge)	Total installed FB memory. Shown as byte
nvml.power_usage (gauge)	Power usage for this GPU in milliwatts and its associated circuitry (e.g. memory)
nvml.total_energy_consumption (count)	Total energy consumption for this GPU in millijoules (mJ) since the driver was last reloaded
nvml.enc_utilization (gauge)	The current utilization for the Encoder Shown as percent
nvml.dec_utilization (gauge)	The current utilization for the Decoder Shown as percent
nvml.pcie_tx_throughput (gauge)	PCIe TX utilization Shown as kibibyte
nvml.pcie_rx_throughput (gauge)	PCIe RX utilization Shown as kibibyte
nvml.temperature (gauge)	Current temperature for this GPU in degrees celsius
nvml.fan_speed (gauge)	The current utilization for the fan Shown as percent
nvml.compute_running_process (gauge)	The current usage of gpu memory by process Shown as byte

La documentación autorizada de métricas se encuentra en el sitio web de NVIDIA.

Se ha intentado, en la medida de lo posible, que los nombres de métrica coincidan con el exportador Data Center GPU Manager (DCGM) de NVIDIA.

Nvidia NVML

Información general

Configuración

Instalación

Configuración

Validación

Datos recopilados

Métricas

Eventos

Checks de servicio

Solucionar problemas