AWS Inferentia and AWS Trainium Monitoring

Supported OS Linux Windows Mac OS

Integration version2.0.1

Overview

This check monitors AWS Neuron through the Datadog Agent. It enables monitoring of the Inferentia and Trainium devices and delivers insights into your machine learning model’s performance.

Setup

Follow the instructions below to install and configure this check for an Agent running on an EC2 instance. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.

Installation

The AWS Neuron check is included in the Datadog Agent package.

You also need to install the AWS Neuron Tools package.

No additional installation is needed on your server.

Configuration

Metrics

  1. Ensure that Neuron Monitor is being used to expose the Prometheus endpoint.

  2. Edit the aws_neuron.d/conf.yaml file, which is located in the conf.d/ folder at the root of your Agent’s configuration directory, to start collecting your AWS Neuron performance data. See the sample aws_neuron.d/conf.yaml for all available configuration options.

  3. Restart the Agent.

Logs

The AWS Neuron integration can collect logs from the Neuron containers and forward them to Datadog.

  1. Collecting logs is disabled by default in the Datadog Agent. Enable it in your datadog.yaml file:

    logs_enabled: true
    
  2. Uncomment and edit the logs configuration block in your aws_neuron.d/conf.yaml file. Here’s an example:

    logs:
      - type: docker
        source: aws_neuron
        service: aws_neuron
    

Collecting logs is disabled by default in the Datadog Agent. To enable it, see Kubernetes Log Collection.

Then, set Log Integrations as pod annotations. This can also be configured with a file, a configmap, or a key-value store. For more information, see the configuration section of Kubernetes Log Collection.

Validation

Run the Agent’s status subcommand and look for aws_neuron under the Checks section.

Data Collected

Metrics

aws_neuron.execution.errors.count
(count)
Execution errors total
aws_neuron.execution.errors_created
(gauge)
Execution errors total
aws_neuron.execution.latency_seconds
(gauge)
Execution latency in seconds
Shown as second
aws_neuron.execution.status.count
(count)
Execution status total
aws_neuron.execution.status_created
(gauge)
Execution status total
aws_neuron.hardware_ecc_events.count
(count)
Hardware ecc events total
aws_neuron.hardware_ecc_events_created
(gauge)
Hardware ecc events total
aws_neuron.instance_info
(gauge)
EC2 instance information
aws_neuron.neuron_hardware_info
(gauge)
Neuron Hardware Information
aws_neuron.neuron_runtime.memory_used_bytes
(gauge)
Runtime memory used bytes
Shown as byte
aws_neuron.neuron_runtime.vcpu_usage_ratio
(gauge)
Runtime vCPU utilization ratio
Shown as fraction
aws_neuron.neuroncore.memory_usage.constants
(gauge)
NeuronCore memory utilization for constants
Shown as byte
aws_neuron.neuroncore.memory_usage.model.code
(gauge)
NeuronCore memory utilization for model_code
Shown as byte
aws_neuron.neuroncore.memory_usage.model.shared_scratchpad
(gauge)
NeuronCore memory utilization for modelsharedscratchpad
Shown as byte
aws_neuron.neuroncore.memory_usage.runtime_memory
(gauge)
NeuronCore memory utilization for runtime_memory
Shown as byte
aws_neuron.neuroncore.memory_usage.tensors
(gauge)
NeuronCore memory utilization for tensors
Shown as byte
aws_neuron.neuroncore.utilization_ratio
(gauge)
NeuronCore utilization ratio
Shown as fraction
aws_neuron.process.cpu_seconds.count
(count)
Total user and system CPU time spent in seconds.
Shown as second
aws_neuron.process.max_fds
(gauge)
Maximum number of open file descriptors.
aws_neuron.process.open_fds
(gauge)
Number of open file descriptors.
aws_neuron.process.resident_memory_bytes
(gauge)
Resident memory size in bytes.
Shown as byte
aws_neuron.process.start_time_seconds
(gauge)
Start time of the process since unix epoch in seconds.
Shown as second
aws_neuron.process.virtual_memory_bytes
(gauge)
Virtual memory size in bytes.
Shown as byte
aws_neuron.python_gc.collections.count
(count)
Number of times this generation was collected
aws_neuron.python_gc.objects_collected.count
(count)
Objects collected during gc
aws_neuron.python_gc.objects_uncollectable.count
(count)
Uncollectable objects found during GC
aws_neuron.python_info
(gauge)
Python platform information
aws_neuron.system.memory.total_bytes
(gauge)
System memory total_bytes bytes
Shown as byte
aws_neuron.system.memory.used_bytes
(gauge)
System memory used_bytes bytes
Shown as byte
aws_neuron.system.swap.total_bytes
(gauge)
System swap total_bytes bytes
Shown as byte
aws_neuron.system.swap.used_bytes
(gauge)
System swap used_bytes bytes
Shown as byte
aws_neuron.system.vcpu.count
(gauge)
System vCPU count
aws_neuron.system.vcpu.usage_ratio
(gauge)
System CPU utilization ratio
Shown as fraction

Events

The AWS Neuron integration does not include any events.

Service Checks

aws_neuron.openmetrics.health
Returns CRITICAL if the Agent is unable to connect to the Neuron Monitor OpenMetrics endpoint, otherwise returns OK.
Statuses: ok, critical

Troubleshooting

In containerized environments, ensure that the Agent has network access to the endpoints specified in the aws_neuron.d/conf.yaml file.

Need help? Contact Datadog support.

PREVIEWING: esther/docs-9478-fix-split-after-example