vLLM

Supported OS Linux Windows Mac OS

통합 버전1.0.0
이 페이지는 아직 한국어로 제공되지 않으며 번역 작업 중입니다. 번역에 관한 질문이나 의견이 있으시면 언제든지 저희에게 연락해 주십시오.

Overview

This check monitors vLLM through the Datadog Agent.

Setup

Follow the instructions below to install and configure this check for an Agent running on a host.

Installation

The vLLM check is included in the Datadog Agent package. No additional installation is needed on your server.

Configuration

  1. Edit the vllm.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your vllm performance data. See the sample vllm.d/conf.yaml for all available configuration options.

  2. Restart the Agent.

Validation

Run the Agent’s status subcommand and look for vllm under the Checks section.

Data Collected

Metrics

vllm.avg.generation_throughput.toks_per_s
(gauge)
Average generation throughput in tokens/s
vllm.avg.prompt.throughput.toks_per_s
(gauge)
Average prefill throughput in tokens/s
vllm.cache_config_info
(gauge)
Information on cache config
vllm.cpu_cache_usage_perc
(gauge)
CPU KV-cache usage. 1 means 100 percent usage
Shown as percent
vllm.e2e_request_latency.seconds.bucket
(count)
The observations of end to end request latency bucketed by seconds.
vllm.e2e_request_latency.seconds.count
(count)
The total number of observations of end to end request latency.
vllm.e2e_request_latency.seconds.sum
(count)
The sum of end to end request latency in seconds.
Shown as second
vllm.generation_tokens.count
(count)
Number of generation tokens processed.
vllm.gpu_cache_usage_perc
(gauge)
GPU KV-cache usage. 1 means 100 percent usage
Shown as percent
vllm.num_preemptions.count
(count)
Cumulative number of preemption from the engine.
vllm.num_requests.running
(gauge)
Number of requests currently running on GPU.
vllm.num_requests.swapped
(gauge)
Number of requests swapped to CPU.
vllm.num_requests.waiting
(gauge)
Number of requests waiting.
vllm.process.cpu_seconds.count
(count)
Total user and system CPU time spent in seconds.
Shown as second
vllm.process.max_fds
(gauge)
Maximum number of open file descriptors.
Shown as file
vllm.process.open_fds
(gauge)
Number of open file descriptors.
Shown as file
vllm.process.resident_memory_bytes
(gauge)
Resident memory size in bytes.
Shown as byte
vllm.process.start_time_seconds
(gauge)
Start time of the process since unix epoch in seconds.
Shown as second
vllm.process.virtual_memory_bytes
(gauge)
Virtual memory size in bytes.
Shown as byte
vllm.prompt_tokens.count
(count)
Number of prefill tokens processed.
vllm.python.gc.collections.count
(count)
Number of times this generation was collected
vllm.python.gc.objects.collected.count
(count)
Objects collected during gc
vllm.python.gc.objects.uncollectable.count
(count)
Uncollectable objects found during GC
vllm.python.info
(gauge)
Python platform information
vllm.request.generation_tokens.bucket
(count)
Number of generation tokens processed.
vllm.request.generation_tokens.count
(count)
Number of generation tokens processed.
vllm.request.generation_tokens.sum
(count)
Number of generation tokens processed.
vllm.request.params.best_of.bucket
(count)
Histogram of the best_of request parameter.
vllm.request.params.best_of.count
(count)
Histogram of the best_of request parameter.
vllm.request.params.best_of.sum
(count)
Histogram of the best_of request parameter.
vllm.request.params.n.bucket
(count)
Histogram of the n request parameter.
vllm.request.params.n.count
(count)
Histogram of the n request parameter.
vllm.request.params.n.sum
(count)
Histogram of the n request parameter.
vllm.request.prompt_tokens.bucket
(count)
Number of prefill tokens processed.
vllm.request.prompt_tokens.count
(count)
Number of prefill tokens processed.
vllm.request.prompt_tokens.sum
(count)
Number of prefill tokens processed.
vllm.request.success.count
(count)
Count of successfully processed requests.
vllm.time_per_output_token.seconds.bucket
(count)
The observations of time per output token bucketed by seconds.
vllm.time_per_output_token.seconds.count
(count)
The total number of observations of time per output token.
vllm.time_per_output_token.seconds.sum
(count)
The sum of time per output token in seconds.
Shown as second
vllm.time_to_first_token.seconds.bucket
(count)
The observations of time to first token bucketed by seconds.
vllm.time_to_first_token.seconds.count
(count)
The total number of observations of time to first token.
vllm.time_to_first_token.seconds.sum
(count)
The sum of time to first token in seconds.
Shown as second

Events

The vLLM integration does not include any events.

Service Checks

vllm.openmetrics.health
Returns CRITICAL if the Agent is unable to connect to the vLLM OpenMetrics endpoint, otherwise returns OK.
Statuses: ok, critical

vllm.health.status
Returns CRITICAL if the Server is having a 4xx or 5xx response, OK if the response is 200, and unknown for everything else.
Statuses: ok, warning, critical

Logs

Log collection is disabled by default in the Datadog Agent. If you are running your Agent as a container, see container installation to enable log collection. If you are running a host Agent, see host Agent instead. In either case, make sure that the source value for your logs is vllm. This setting ensures that the built-in processing pipeline finds your logs. To set your log configuration for a container, see log integrations.

Troubleshooting

Need help? Contact Datadog support.

PREVIEWING: may/unit-testing