- 필수 기능
- 시작하기
- Glossary
- 표준 속성
- Guides
- Agent
- 통합
- 개방형텔레메트리
- 개발자
- API
- Datadog Mobile App
- CoScreen
- Cloudcraft
- 앱 내
- 서비스 관리
- 인프라스트럭처
- 애플리케이션 성능
- APM
- Continuous Profiler
- 스팬 시각화
- 데이터 스트림 모니터링
- 데이터 작업 모니터링
- 디지털 경험
- 소프트웨어 제공
- 보안
- AI Observability
- 로그 관리
- 관리
Supported OS
This check monitors AWS Neuron through the Datadog Agent. It enables monitoring of the Inferentia and Trainium devices and delivers insights into your machine learning model’s performance.
Follow the instructions below to install and configure this check for an Agent running on an EC2 instance. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.
The AWS Neuron check is included in the Datadog Agent package.
You also need to install the AWS Neuron Tools package.
No additional installation is needed on your server.
Ensure that Neuron Monitor is being used to expose the Prometheus endpoint.
Edit the aws_neuron.d/conf.yaml
file, which is located in the conf.d/
folder at the root of your Agent’s configuration directory, to start collecting your AWS Neuron performance data. See the sample aws_neuron.d/conf.yaml for all available configuration options.
The AWS Neuron integration can collect logs from the Neuron containers and forward them to Datadog.
Collecting logs is disabled by default in the Datadog Agent. Enable it in your datadog.yaml
file:
logs_enabled: true
Uncomment and edit the logs configuration block in your aws_neuron.d/conf.yaml
file. Here’s an example:
logs:
- type: docker
source: aws_neuron
service: aws_neuron
Collecting logs is disabled by default in the Datadog Agent. To enable it, see Kubernetes Log Collection.
Then, set Log Integrations as pod annotations. This can also be configured with a file, a configmap, or a key-value store. For more information, see the configuration section of Kubernetes Log Collection.
Run the Agent’s status subcommand and look for aws_neuron
under the Checks section.
aws_neuron.execution.errors.count (count) | Execution errors total |
aws_neuron.execution.errors_created (gauge) | Execution errors total |
aws_neuron.execution.latency_seconds (gauge) | Execution latency in seconds Shown as second |
aws_neuron.execution.status.count (count) | Execution status total |
aws_neuron.execution.status_created (gauge) | Execution status total |
aws_neuron.hardware_ecc_events.count (count) | Hardware ecc events total |
aws_neuron.hardware_ecc_events_created (gauge) | Hardware ecc events total |
aws_neuron.instance_info (gauge) | EC2 instance information |
aws_neuron.neuron_hardware_info (gauge) | Neuron Hardware Information |
aws_neuron.neuron_runtime.memory_used_bytes (gauge) | Runtime memory used bytes Shown as byte |
aws_neuron.neuron_runtime.vcpu_usage_ratio (gauge) | Runtime vCPU utilization ratio Shown as fraction |
aws_neuron.neuroncore.memory_usage.constants (gauge) | NeuronCore memory utilization for constants Shown as byte |
aws_neuron.neuroncore.memory_usage.model.code (gauge) | NeuronCore memory utilization for model_code Shown as byte |
aws_neuron.neuroncore.memory_usage.model.shared_scratchpad (gauge) | NeuronCore memory utilization for modelsharedscratchpad Shown as byte |
aws_neuron.neuroncore.memory_usage.runtime_memory (gauge) | NeuronCore memory utilization for runtime_memory Shown as byte |
aws_neuron.neuroncore.memory_usage.tensors (gauge) | NeuronCore memory utilization for tensors Shown as byte |
aws_neuron.neuroncore.utilization_ratio (gauge) | NeuronCore utilization ratio Shown as fraction |
aws_neuron.process.cpu_seconds.count (count) | Total user and system CPU time spent in seconds. Shown as second |
aws_neuron.process.max_fds (gauge) | Maximum number of open file descriptors. |
aws_neuron.process.open_fds (gauge) | Number of open file descriptors. |
aws_neuron.process.resident_memory_bytes (gauge) | Resident memory size in bytes. Shown as byte |
aws_neuron.process.start_time_seconds (gauge) | Start time of the process since unix epoch in seconds. Shown as second |
aws_neuron.process.virtual_memory_bytes (gauge) | Virtual memory size in bytes. Shown as byte |
aws_neuron.python_gc.collections.count (count) | Number of times this generation was collected |
aws_neuron.python_gc.objects_collected.count (count) | Objects collected during gc |
aws_neuron.python_gc.objects_uncollectable.count (count) | Uncollectable objects found during GC |
aws_neuron.python_info (gauge) | Python platform information |
aws_neuron.system.memory.total_bytes (gauge) | System memory total_bytes bytes Shown as byte |
aws_neuron.system.memory.used_bytes (gauge) | System memory used_bytes bytes Shown as byte |
aws_neuron.system.swap.total_bytes (gauge) | System swap total_bytes bytes Shown as byte |
aws_neuron.system.swap.used_bytes (gauge) | System swap used_bytes bytes Shown as byte |
aws_neuron.system.vcpu.count (gauge) | System vCPU count |
aws_neuron.system.vcpu.usage_ratio (gauge) | System CPU utilization ratio Shown as fraction |
The AWS Neuron integration does not include any events.
aws_neuron.openmetrics.health
Returns CRITICAL
if the Agent is unable to connect to the Neuron Monitor OpenMetrics endpoint, otherwise returns OK
.
Statuses: ok, critical
In containerized environments, ensure that the Agent has network access to the endpoints specified in the aws_neuron.d/conf.yaml
file.
Need help? Contact Datadog support.