- 필수 기능
- 시작하기
- Glossary
- 표준 속성
- Guides
- Agent
- 통합
- 개방형텔레메트리
- 개발자
- API
- Datadog Mobile App
- CoScreen
- Cloudcraft
- 앱 내
- 서비스 관리
- 인프라스트럭처
- 애플리케이션 성능
- APM
- Continuous Profiler
- 스팬 시각화
- 데이터 스트림 모니터링
- 데이터 작업 모니터링
- 디지털 경험
- 소프트웨어 제공
- 보안
- AI Observability
- 로그 관리
- 관리
Supported OS
This check monitors vLLM through the Datadog Agent.
Follow the instructions below to install and configure this check for an Agent running on a host.
The vLLM check is included in the Datadog Agent package. No additional installation is needed on your server.
Edit the vllm.d/conf.yaml
file, in the conf.d/
folder at the root of your Agent’s configuration directory to start collecting your vllm performance data. See the sample vllm.d/conf.yaml for all available configuration options.
Run the Agent’s status subcommand and look for vllm
under the Checks section.
vllm.avg.generation_throughput.toks_per_s (gauge) | Average generation throughput in tokens/s |
vllm.avg.prompt.throughput.toks_per_s (gauge) | Average prefill throughput in tokens/s |
vllm.cache_config_info (gauge) | Information on cache config |
vllm.cpu_cache_usage_perc (gauge) | CPU KV-cache usage. 1 means 100 percent usage Shown as percent |
vllm.e2e_request_latency.seconds.bucket (count) | The observations of end to end request latency bucketed by seconds. |
vllm.e2e_request_latency.seconds.count (count) | The total number of observations of end to end request latency. |
vllm.e2e_request_latency.seconds.sum (count) | The sum of end to end request latency in seconds. Shown as second |
vllm.generation_tokens.count (count) | Number of generation tokens processed. |
vllm.gpu_cache_usage_perc (gauge) | GPU KV-cache usage. 1 means 100 percent usage Shown as percent |
vllm.num_preemptions.count (count) | Cumulative number of preemption from the engine. |
vllm.num_requests.running (gauge) | Number of requests currently running on GPU. |
vllm.num_requests.swapped (gauge) | Number of requests swapped to CPU. |
vllm.num_requests.waiting (gauge) | Number of requests waiting. |
vllm.process.cpu_seconds.count (count) | Total user and system CPU time spent in seconds. Shown as second |
vllm.process.max_fds (gauge) | Maximum number of open file descriptors. Shown as file |
vllm.process.open_fds (gauge) | Number of open file descriptors. Shown as file |
vllm.process.resident_memory_bytes (gauge) | Resident memory size in bytes. Shown as byte |
vllm.process.start_time_seconds (gauge) | Start time of the process since unix epoch in seconds. Shown as second |
vllm.process.virtual_memory_bytes (gauge) | Virtual memory size in bytes. Shown as byte |
vllm.prompt_tokens.count (count) | Number of prefill tokens processed. |
vllm.python.gc.collections.count (count) | Number of times this generation was collected |
vllm.python.gc.objects.collected.count (count) | Objects collected during gc |
vllm.python.gc.objects.uncollectable.count (count) | Uncollectable objects found during GC |
vllm.python.info (gauge) | Python platform information |
vllm.request.generation_tokens.bucket (count) | Number of generation tokens processed. |
vllm.request.generation_tokens.count (count) | Number of generation tokens processed. |
vllm.request.generation_tokens.sum (count) | Number of generation tokens processed. |
vllm.request.params.best_of.bucket (count) | Histogram of the best_of request parameter. |
vllm.request.params.best_of.count (count) | Histogram of the best_of request parameter. |
vllm.request.params.best_of.sum (count) | Histogram of the best_of request parameter. |
vllm.request.params.n.bucket (count) | Histogram of the n request parameter. |
vllm.request.params.n.count (count) | Histogram of the n request parameter. |
vllm.request.params.n.sum (count) | Histogram of the n request parameter. |
vllm.request.prompt_tokens.bucket (count) | Number of prefill tokens processed. |
vllm.request.prompt_tokens.count (count) | Number of prefill tokens processed. |
vllm.request.prompt_tokens.sum (count) | Number of prefill tokens processed. |
vllm.request.success.count (count) | Count of successfully processed requests. |
vllm.time_per_output_token.seconds.bucket (count) | The observations of time per output token bucketed by seconds. |
vllm.time_per_output_token.seconds.count (count) | The total number of observations of time per output token. |
vllm.time_per_output_token.seconds.sum (count) | The sum of time per output token in seconds. Shown as second |
vllm.time_to_first_token.seconds.bucket (count) | The observations of time to first token bucketed by seconds. |
vllm.time_to_first_token.seconds.count (count) | The total number of observations of time to first token. |
vllm.time_to_first_token.seconds.sum (count) | The sum of time to first token in seconds. Shown as second |
The vLLM integration does not include any events.
vllm.openmetrics.health
Returns CRITICAL
if the Agent is unable to connect to the vLLM OpenMetrics endpoint, otherwise returns OK
.
Statuses: ok, critical
vllm.health.status
Returns CRITICAL
if the Server is having a 4xx or 5xx response, OK
if the response is 200, and unknown
for everything else.
Statuses: ok, warning, critical
Log collection is disabled by default in the Datadog Agent. If you are running your Agent as a container, see container installation to enable log collection. If you are running a host Agent, see host Agent instead.
In either case, make sure that the source
value for your logs is vllm
. This setting ensures that the built-in processing pipeline finds your logs. To set your log configuration for a container, see log integrations.
Need help? Contact Datadog support.