Nvidia NVML

Supported OS Linux Windows Mac OS

インテグレーションバージョン1.0.9

概要

このチェックは、Datadog Agent を通じて NVIDIA Management Library (NVML) 公開メトリクスを監視し、公開された Kubernetes デバイスと関連付けることができます。

セットアップ

NVML チェックは Datadog Agent パッケージに含まれていないため、お客様自身でインストールする必要があります。

インストール

Agent v7.21 / v6.21 以降の場合は、下記の手順に従い NVML チェックをホストにインストールします。Docker Agent または上記バージョン以前の Agent でインストールする場合は、コミュニティインテグレーションの使用をご参照ください。

以下のコマンドを実行して、Agent インテグレーションをインストールします。

Linux の場合:

datadog-agent integration install -t datadog-nvml==<INTEGRATION_VERSION>
# You may also need to install dependencies since those aren't packaged into the wheel
sudo -u dd-agent -H /opt/datadog-agent/embedded/bin/pip3 install grpcio pynvml

Windows の場合 (管理者として実行する Powershell を使用):

& "$env:ProgramFiles\Datadog\Datadog Agent\bin\agent.exe" integration install -t datadog-nvml==<INTEGRATION_VERSION>
# You may also need to install dependencies since those aren't packaged into the wheel
& "$env:ProgramFiles\Datadog\Datadog Agent\embedded3\python" -m pip install grpcio pynvml

コアのインテグレーションと同様にインテグレーションを構成します。

Docker を使用している場合、NVML リポジトリに Dockerfile の例があります。

docker build -t dd-agent-nvml .

Docker と Kubernetes を使用している場合は、環境変数 NVIDIA_VISIBLE_DEVICES と NVIDIA_DRIVER_CAPABILITIES を公開する必要があります。例については、付属の Dockerfile を参照してください。

予約済みの Kubernetes NVIDIA デバイスを、そのデバイスを使用する Kubernetes ポッドと関連付けるには、Unix ドメインソケット /var/lib/kubelet/pod-resources/kubelet.sock を Agent のコンフィギュレーションにマウントします。このソケットの詳細については、Kubernetes のウェブサイトを参照してください。このデバイスはバージョン 1.15 のベータサポートであることに注意してください。

構成

NVML のパフォーマンスデータを収集するには、Agent のコンフィギュレーションディレクトリのルートにある conf.d/ フォルダーの nvml.d/conf.yaml ファイルを編集します。使用可能なすべてのコンフィギュレーションオプションについては、サンプル nvml.d/conf.yaml を参照してください。
Agent を再起動します。

検証

Agent の status サブコマンドを実行し、Checks セクションで nvml を探します。

収集データ

メトリクス

nvml.device_count (gauge)	Number of GPU on this instance.
nvml.gpu_utilization (gauge)	Percent of time over the past sample period during which one or more kernels was executing on the GPU. Shown as percent
nvml.mem_copy_utilization (gauge)	Percent of time over the past sample period during which global (device) memory was being read or written. Shown as percent
nvml.fb_free (gauge)	Unallocated FB memory. Shown as byte
nvml.fb_used (gauge)	Allocated FB memory. Shown as byte
nvml.fb_total (gauge)	Total installed FB memory. Shown as byte
nvml.power_usage (gauge)	Power usage for this GPU in milliwatts and its associated circuitry (e.g. memory)
nvml.total_energy_consumption (count)	Total energy consumption for this GPU in millijoules (mJ) since the driver was last reloaded
nvml.enc_utilization (gauge)	The current utilization for the Encoder Shown as percent
nvml.dec_utilization (gauge)	The current utilization for the Decoder Shown as percent
nvml.pcie_tx_throughput (gauge)	PCIe TX utilization Shown as kibibyte
nvml.pcie_rx_throughput (gauge)	PCIe RX utilization Shown as kibibyte
nvml.temperature (gauge)	Current temperature for this GPU in degrees celsius
nvml.fan_speed (gauge)	The current utilization for the fan Shown as percent
nvml.compute_running_process (gauge)	The current usage of gpu memory by process Shown as byte

信頼できるメトリクスのドキュメントは、NVIDIA ウェブサイトにあります。

可能な場合、メトリクス名を NVIDIA の Data Center GPU Manager (DCGM) エクスポーターと一致させる試みがあります。

イベント

NVML には、イベントは含まれません。

サービスチェック

トラブルシューティング

ご不明な点は、Datadog のサポートチームまでお問合せください。