Google Cloud TPU

Présentation

Avec Google Cloud TPU, tous les chercheurs, ingénieurs et développeurs en ML, ainsi que tous les data scientists exécutant des modèles de ML novateurs, peuvent profiter des avantages offerts par les Tensor Processing Units (TPU) grâce à des ressources de cloud computing évolutives et faciles à utiliser.

Utilisez l’intégration Datadog/Google Cloud Platform pour recueillir des métriques de Google Cloud TPU.

Configuration

Installation

Si vous ne l’avez pas déjà fait, configurez d’abord l’intégration Google Cloud Platform. Aucune autre procédure d’installation n’est requise.

Collecte de logs

Les logs Google Cloud TPU sont recueillis avec Google Cloud Logging et envoyés à un Cloud Pub/Sub via un forwarder Push HTTP. Si vous ne l’avez pas déjà fait, configurez un Cloud Pub/Sub à l’aide d’un forwarder Push HTTP.

Une fois cette opération effectuée, exportez vos logs Google Cloud TPU depuis Google Cloud Logging vers le Pub/Sub :

Accédez à la page Google Cloud Logging et filtrez les logs Google Cloud TPU.
Cliquez sur Create Export et nommez le récepteur.
Choisissez Cloud Pub/Sub comme destination et sélectionnez le Pub/Sub créé à cette fin. Remarque : le Pub/Sub peut se situer dans un autre projet.
Cliquez sur Create et attendez que le message de confirmation s’affiche.

Données collectées

Métriques

gcp.tpu.cpu.utilization (gauge)	Utilization of CPUs on the TPU Worker as a percent. Shown as percent
gcp.tpu.memory.usage (gauge)	Memory usage in bytes. Shown as byte
gcp.tpu.network.received_bytes_count (count)	Cumulative bytes of data this server has received over the network. Shown as byte
gcp.tpu.network.sent_bytes_count (count)	Cumulative bytes of data this server has sent over the network. Shown as byte
gcp.tpu.accelerator.duty_cycle (count)	Percentage of time over the sample period during which the accelerator was actively processing Shown as percent
gcp.tpu.instance.uptime_total (count)	Elapsed time since the VM was started, in seconds. Shown as second
gcp.gke.node.accelerator.tensorcore_utilization (count)	Current percentage of the Tensorcore that is utilized. Shown as percent
gcp.gke.node.accelerator.duty_cycle (count)	Percent of time over the past sample period (10s) during which the accelerator was actively processing. Shown as percent
gcp.gke.node.accelerator.memory_used (count)	Total accelerator memory allocated in bytes. Shown as byte
gcp.gke.node.accelerator.memory_total (count)	Total accelerator memory in bytes. Shown as byte
gcp.gke.node.accelerator.memory_bandwidth_utilization (count)	Current percentage of the accelerator memory bandwidth that is being used. Shown as percent
gcp.gke.container.accelerator.tensorcore_utilization (count)	Current percentage of the Tensorcore that is utilized. Shown as percent
gcp.gke.container.accelerator.duty_cycle (count)	Percent of time over the past sample period (10s) during which the accelerator was actively processing. Shown as percent
gcp.gke.container.accelerator.memory_used (count)	Total accelerator memory allocated in bytes. Shown as byte
gcp.gke.container.accelerator.memory_total (count)	Total accelerator memory in bytes. Shown as byte
gcp.gke.container.accelerator.memory_bandwidth_utilization (count)	Current percentage of the accelerator memory bandwidth that is being used. Shown as percent

Événements

L’intégration Google Cloud TPU n’inclut aucun événement.

Checks de service

L’intégration Google Cloud TPU n’inclut aucun check de service.

Dépannage

Besoin d’aide ? Contactez l’assistance Datadog.