Google Cloud TPU

概要

Google Cloud TPU 製品は、スケーラブルで使いやすいクラウドコンピューティングリソースを通じて Tensor Processing Unit (TPU) を利用できるようにします。ML 研究者、ML エンジニア、開発者、データサイエンティストの誰もが最先端の ML (機械学習) モデルを実行できます。

Datadog Google Cloud Platform インテグレーションを使用して、Google Cloud TPU からメトリクスを収集できます。

セットアップ

インストール

Google Cloud Platform インテグレーションをまだセットアップしていない場合は、最初にセットアップします。それ以上のインストール手順はありません。

収集データ

Google Cloud TPU のログは Google Cloud Logging で収集され、Cloud Pub/Sub トピックを通じて Dataflow ジョブに送信されます。まだの場合は、Datadog Dataflow テンプレートでロギングをセットアップしてください。

これが完了したら、Google Cloud TPU のログを Google Cloud Logging から Pub/Sub トピックへエクスポートします。

Google Cloud Logging のページに移動し、Google Cloud TPU のログを絞り込みます。
Create Export をクリックし、シンクに名前を付けます。
宛先として “Cloud Pub/Sub” を選択し、その目的で作成された Pub/Sub トピックを選択します。注: Pub/Sub トピックは別のプロジェクトに配置できます。
作成をクリックし、確認メッセージが表示されるまで待ちます。

収集データ

メトリクス

gcp.tpu.cpu.utilization (gauge)	Utilization of CPUs on the TPU Worker as a percent. Shown as percent
gcp.tpu.memory.usage (gauge)	Memory usage in bytes. Shown as byte
gcp.tpu.network.received_bytes_count (count)	Cumulative bytes of data this server has received over the network. Shown as byte
gcp.tpu.network.sent_bytes_count (count)	Cumulative bytes of data this server has sent over the network. Shown as byte
gcp.tpu.accelerator.duty_cycle (count)	Percentage of time over the sample period during which the accelerator was actively processing Shown as percent
gcp.tpu.instance.uptime_total (count)	Elapsed time since the VM was started, in seconds. Shown as second
gcp.gke.node.accelerator.tensorcore_utilization (count)	Current percentage of the Tensorcore that is utilized. Shown as percent
gcp.gke.node.accelerator.duty_cycle (count)	Percent of time over the past sample period (10s) during which the accelerator was actively processing. Shown as percent
gcp.gke.node.accelerator.memory_used (count)	Total accelerator memory allocated in bytes. Shown as byte
gcp.gke.node.accelerator.memory_total (count)	Total accelerator memory in bytes. Shown as byte
gcp.gke.node.accelerator.memory_bandwidth_utilization (count)	Current percentage of the accelerator memory bandwidth that is being used. Shown as percent
gcp.gke.container.accelerator.tensorcore_utilization (count)	Current percentage of the Tensorcore that is utilized. Shown as percent
gcp.gke.container.accelerator.duty_cycle (count)	Percent of time over the past sample period (10s) during which the accelerator was actively processing. Shown as percent
gcp.gke.container.accelerator.memory_used (count)	Total accelerator memory allocated in bytes. Shown as byte
gcp.gke.container.accelerator.memory_total (count)	Total accelerator memory in bytes. Shown as byte
gcp.gke.container.accelerator.memory_bandwidth_utilization (count)	Current percentage of the accelerator memory bandwidth that is being used. Shown as percent

イベント

Google Cloud TPU インテグレーションには、イベントは含まれません。

サービスチェック

Google Cloud TPU インテグレーションには、サービスのチェック機能は含まれません。

トラブルシューティング

ご不明な点は、Datadog のサポートチームまでお問い合わせください。