Google Cloud TPU

Overview

Google Cloud TPU products make the benefits of Tensor Processing Units (TPUs) available through scalable and easy-to-use cloud computing resource for all ML researchers, ML engineers, developers, and data scientists running cutting-edge ML models.

Use the Datadog Google Cloud Platform integration to collect metrics from Google Cloud TPU.

Setup

Installation

To use Google Cloud TPU, you only need to set up the Google Cloud Platform integration.

Log collection

Google Cloud TPU logs are collected with Google Cloud Logging and sent to a Dataflow job through a Cloud Pub/Sub topic. If you haven’t already, set up logging with the Datadog Dataflow template.

Once this is done, export your Google Cloud TPU logs from Google Cloud Logging to the Pub/Sub topic:

  1. Go to the Google Cloud Logging page and filter the Google Cloud TPU logs.
  2. Click Create Export and name the sink.
  3. Choose “Cloud Pub/Sub” as the destination and select the Pub/Sub topic that was created for that purpose. Note: The Pub/Sub topic can be located in a different project.
  4. Click Create and wait for the confirmation message to show up.

Data Collected

Metrics

gcp.tpu.cpu.utilization
(gauge)
Utilization of CPUs on the TPU Worker as a percent.
Shown as percent
gcp.tpu.memory.usage
(gauge)
Memory usage in bytes.
Shown as byte
gcp.tpu.network.received_bytes_count
(count)
Cumulative bytes of data this server has received over the network.
Shown as byte
gcp.tpu.network.sent_bytes_count
(count)
Cumulative bytes of data this server has sent over the network.
Shown as byte
gcp.tpu.accelerator.duty_cycle
(count)
Percentage of time over the sample period during which the accelerator was actively processing
Shown as percent
gcp.tpu.instance.uptime_total
(count)
Elapsed time since the VM was started, in seconds.
Shown as second
gcp.gke.node.accelerator.tensorcore_utilization
(count)
Current percentage of the Tensorcore that is utilized.
Shown as percent
gcp.gke.node.accelerator.duty_cycle
(count)
Percent of time over the past sample period (10s) during which the accelerator was actively processing.
Shown as percent
gcp.gke.node.accelerator.memory_used
(count)
Total accelerator memory allocated in bytes.
Shown as byte
gcp.gke.node.accelerator.memory_total
(count)
Total accelerator memory in bytes.
Shown as byte
gcp.gke.node.accelerator.memory_bandwidth_utilization
(count)
Current percentage of the accelerator memory bandwidth that is being used.
Shown as percent
gcp.gke.container.accelerator.tensorcore_utilization
(count)
Current percentage of the Tensorcore that is utilized.
Shown as percent
gcp.gke.container.accelerator.duty_cycle
(count)
Percent of time over the past sample period (10s) during which the accelerator was actively processing.
Shown as percent
gcp.gke.container.accelerator.memory_used
(count)
Total accelerator memory allocated in bytes.
Shown as byte
gcp.gke.container.accelerator.memory_total
(count)
Total accelerator memory in bytes.
Shown as byte
gcp.gke.container.accelerator.memory_bandwidth_utilization
(count)
Current percentage of the accelerator memory bandwidth that is being used.
Shown as percent

Events

The Google Cloud TPU integration does not include any events.

Service Checks

The Google Cloud TPU integration does not include any service checks.

Troubleshooting

Need help? Contact Datadog support.

PREVIEWING: may/unit-testing