Versión de la integración3.1.0
This page is not yet available in Spanish. We are working on its translation.
If you have any questions or feedback about our current translation project,
feel free to reach out to us!Overview
This check monitors TorchServe through the Datadog Agent.
Setup
Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.
Installation
Starting from Agent release 7.47.0, the TorchServe check is included in the Datadog Agent package. No additional installation is needed on your server.
This check uses
OpenMetrics to collect metrics from the OpenMetrics endpoint TorchServe can expose, which requires Python 3.
Prerequisites
The TorchServe check collects TorchServe’s metrics and performance data using three different endpoints:
You can configure these endpoints using the config.properties
file, as described in the TorchServe documentation. For example:
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
metrics_mode=prometheus
number_of_netty_threads=32
default_workers_per_model=10
job_queue_size=1000
model_store=/home/model-server/model-store
workflow_store=/home/model-server/wf-store
load_models=all
This configuration file exposes the three different endpoints that can be used by the integration to monitor your instance.
OpenMetrics endpoint
To enable the Prometheus endpoint, you need to configure two options:
metrics_address
: Metrics API binding address. Defaults to http://127.0.0.1:8082
metrics_mode
: Two metric modes are supported by TorchServe: log
and prometheus
. Defaults to log
. You have to set it to prometheus
to collect metrics from this endpoint.
For instance:
metrics_address=http://0.0.0.0:8082
metrics_mode=prometheus
In this case, the OpenMetrics endpoint is exposed at this URL: http://<TORCHSERVE_ADDRESS>:8082/metrics
.
Configuration
These three different endpoints can be monitored independently and must be configured separately in the configuration file, one API per instance. See the sample torchserve.d/conf.yaml for all available configuration options.
Configuration options for the OpenMetrics endpoint can be found in the configuration file under the TorchServe OpenMetrics endpoint configuration
section. The minimal configuration only requires the openmetrics_endpoint
option:
init_config:
...
instances:
- openmetrics_endpoint: http://<TORCHSERVE_ADDRESS>:8082/metrics
For more options, see the sample torchserve.d/conf.yaml
file.
TorchServe allows the custom service code to emit metrics that will be available based on the configured metrics_mode
. You can configure this integration to collect these metrics using the extra_metrics
option. These metrics will have the torchserve.openmetrics
prefix, just like any other metrics coming from this endpoint.
These custom TorchServe metrics are considered standard metrics in Datadog.
This integration relies on the Inference API to get the overall status of your TorchServe instance. Configuration options for the Inference API can be found in the configuration file under the TorchServe Inference API endpoint configuration
section. The minimal configuration only requires the inference_api_url
option:
init_config:
...
instances:
- inference_api_url: http://<TORCHSERVE_ADDRESS>:8080
This integration leverages the Ping endpoint to collect the overall health status of your TorchServe server.
You can collect metrics related to the models that are currently running in your TorchServe server using the Management API. Configuration options for the Inference API can be found in the configuration file under the TorchServe Management API endpoint configuration
section. The minimal configuration only requires the management_api_url
option:
init_config:
...
instances:
- management_api_url: http://<TORCHSERVE_ADDRESS>:8081
By default, the integration collects data from every single models, up to 100 models. This can be modified using the limit
, include
, and exclude
options. For example:
init_config:
...
instances:
- management_api_url: http://<TORCHSERVE_ADDRESS>:8081
limit: 25
include:
- my_model.*
This configuration only collects metrics for model names that match the my_model.*
regular expression, up to 25 models.
You can also exclude some models:
init_config:
...
instances:
- management_api_url: http://<TORCHSERVE_ADDRESS>:8081
exclude:
- test.*
This configuration collects metrics for every model name that does not match the test.*
regular expression, up to 100 models.
You can use the `include` and `exclude` options in the same configuration. The `exclude` filters are applied after the `include` ones.
By default, the integration retrieves the full list of the models every time the check runs. You can cache this list by using the interval
option for increased performance of this check.
Using the `interval` option can also delay some metrics and events.
Complete configuration
This example demonstrates the complete configuration leveraging the three different APIs described in the previous sections:
init_config:
...
instances:
- openmetrics_endpoint: http://<TORCHSERVE_ADDRESS>:8082/metrics
# Also collect your own TorchServe metrics
extra_metrics:
- my_custom_torchserve_metric
- inference_api_url: http://<TORCHSERVE_ADDRESS>:8080
- management_api_url: http://<TORCHSERVE_ADDRESS>:8081
# Include all the model names that match this regex
include:
- my_models.*
# But exclude all the ones that finish with `-test`
exclude:
- .*-test
# Refresh the list of models only every hour
interval: 3600
Restart the Agent after modifying the configuration.
This example demonstrates the complete configuration leveraging the three different APIs described in the previous sections as a Docker label inside docker-compose.yml
:
labels:
com.datadoghq.ad.checks: '{"torchserve":{"instances":[{"openmetrics_endpoint":"http://%%host%%:8082/metrics","extra_metrics":["my_custom_torchserve_metric"]},{"inference_api_url":"http://%%host%%:8080"},{"management_api_url":"http://%%host%%:8081","include":["my_models.*"],"exclude":[".*-test"],"interval":3600}]}}'
This example demonstrates the complete configuration leveraging the three different APIs described in the previous sections as Kubernetes annotations on your Torchserve pods:
apiVersion: v1
kind: Pod
metadata:
name: '<POD_NAME>'
annotations:
ad.datadoghq.com/torchserve.checks: |-
{
"torchserve": {
"instances": [
{
"openmetrics_endpoint": "http://%%host%%:8082/metrics",
"extra_metrics": [
"my_custom_torchserve_metric"
]
},
{
"inference_api_url": "http://%%host%%:8080"
},
{
"management_api_url": "http://%%host%%:8081",
"include": [
".*"
],
"exclude": [
".*-test"
],
"interval": 3600
}
]
}
}
# (...)
spec:
containers:
- name: 'torchserve'
# (...)
Validation
Run the Agent’s status subcommand and look for torchserve
under the Checks section.
Data Collected
Metrics
torchserve.management_api.model.batch_size (gauge) | Maximum batch size that a model is expected to handle. |
torchserve.management_api.model.is_loaded_at_startup (gauge) | Whether or not the model was loaded when TorchServe started. 1 if true, 0 otherwise. |
torchserve.management_api.model.max_batch_delay (gauge) | The maximum batch delay time in ms TorchServe waits to receive batch_size number of requests. Shown as millisecond |
torchserve.management_api.model.version.is_default (gauge) | Whether or not this version of the model is the default one. 1 if true, 0 otherwise. |
torchserve.management_api.model.versions (gauge) | Total number of versions for a given model. |
torchserve.management_api.model.worker.is_gpu (gauge) | Whether or not this worker is using a GPU. 1 if true, 0 otherwise. |
torchserve.management_api.model.worker.memory_usage (gauge) | Memory used by the worker in byte. Shown as byte |
torchserve.management_api.model.worker.status (gauge) | The status of a given worker. 1 if ready, 2 if loading, 3 if unloading, 0 otherwise. |
torchserve.management_api.model.workers.current (gauge) | Current number of workers of a given model. |
torchserve.management_api.model.workers.max (gauge) | Maximum number of workers defined of a given model. |
torchserve.management_api.model.workers.min (gauge) | Minimum number of workers defined of a given model. |
torchserve.management_api.models (gauge) | Total number of models. |
torchserve.openmetrics.cpu.utilization (gauge) | CPU utilization on host. Shown as percent |
torchserve.openmetrics.disk.available (gauge) | Disk available on host. Shown as gigabyte |
torchserve.openmetrics.disk.used (gauge) | Memory used on host. Shown as gigabyte |
torchserve.openmetrics.disk.utilization (gauge) | Disk utilization on host. Shown as percent |
torchserve.openmetrics.gpu.memory.used (gauge) | GPU memory used on host. Shown as megabyte |
torchserve.openmetrics.gpu.memory.utilization (gauge) | GPU memory utilization on host. Shown as percent |
torchserve.openmetrics.gpu.utilization (gauge) | GPU utilization on host. Shown as percent |
torchserve.openmetrics.handler_time (gauge) | Time spent in backend handler. Shown as millisecond |
torchserve.openmetrics.inference.count (count) | Total number of inference requests received. Shown as request |
torchserve.openmetrics.inference.latency.count (count) | Total inference latency in Microseconds. Shown as microsecond |
torchserve.openmetrics.memory.available (gauge) | Memory available on host. Shown as megabyte |
torchserve.openmetrics.memory.used (gauge) | Memory used on host. Shown as megabyte |
torchserve.openmetrics.memory.utilization (gauge) | Memory utilization on host. Shown as percent |
torchserve.openmetrics.prediction_time (gauge) | Backend prediction time. Shown as millisecond |
torchserve.openmetrics.queue.latency.count (count) | Total queue latency in Microseconds. Shown as microsecond |
torchserve.openmetrics.queue.time (gauge) | Time spent by a job in request queue in Milliseconds. Shown as millisecond |
torchserve.openmetrics.requests.2xx.count (count) | Total number of requests with response in 200-300 status code range. Shown as request |
torchserve.openmetrics.requests.4xx.count (count) | Total number of requests with response in 400-500 status code range. Shown as request |
torchserve.openmetrics.requests.5xx.count (count) | Total number of requests with response status code above 500. Shown as request |
torchserve.openmetrics.worker.load_time (gauge) | Time taken by worker to load model in Milliseconds. Shown as millisecond |
torchserve.openmetrics.worker.thread_time (gauge) | Time spent in worker thread excluding backend response time in Milliseconds. Shown as millisecond |
Metrics are prefixed using the API they are coming from:
torchserve.openmetrics.*
for metrics coming from the OpenMetrics endpoint.torchserve.inference_api.*
for metrics coming from the Inference API.torchserve.management_api.*
for metrics coming from the Management API.
Events
The TorchServe integration include three events using the Management API:
torchserve.management_api.model_added
: This event fires when a new model has been added.torchserve.management_api.model_removed
: This event fires when a model has been removed.torchserve.management_api.default_version_changed
: This event fires when a default version has been set for a given model.
You can disable the events setting the `submit_events` option to `false` in your
configuration file.
Service Checks
torchserve.openmetrics.health
Returns CRITICAL
if the Agent is unable to connect to the OpenMetrics endpoint, otherwise returns OK
.
Statuses: ok, critical
torchserve.inference_api.health
Returns CRITICAL
if the Agent is unable to connect to the Inference API endpoint or if it is unhealthy, otherwise returns OK
.
Statuses: ok, critical
torchserve.management_api.health
Returns CRITICAL
if the Agent is unable to connect to the Management API endpoint, otherwise returns OK
.
Statuses: ok, critical
Logs
The TorchServe integration can collect logs from the TorchServe service and forward them to Datadog.
Collecting logs is disabled by default in the Datadog Agent. Enable it in your datadog.yaml
file:
Uncomment and edit the logs configuration block in your torchserve.d/conf.yaml
file. Here’s an example:
logs:
- type: file
path: /var/log/torchserve/model_log.log
source: torchserve
service: torchserve
- type: file
path: /var/log/torchserve/ts_log.log
source: torchserve
service: torchserve
See the example configuration file on how to collect all logs.
For more information about the logging configuration with TorchServe, see the official TorchServe documentation.
You can also collect logs from the `access_log.log` file. However, these logs are included in the `ts_log.log` file, leading you to duplicated logs in Datadog if you configure both files.
Troubleshooting
Need help? Contact Datadog support.