vllm.avg.generation_throughput.toks_per_s (gauge) | Average generation throughput in tokens/s |
vllm.avg.prompt.throughput.toks_per_s (gauge) | Average prefill throughput in tokens/s |
vllm.cache_config_info (gauge) | Information on cache config |
vllm.cpu_cache_usage_perc (gauge) | CPU KV-cache usage. 1 means 100 percent usage Shown as percent |
vllm.e2e_request_latency.seconds.bucket (count) | The observations of end to end request latency bucketed by seconds. |
vllm.e2e_request_latency.seconds.count (count) | The total number of observations of end to end request latency. |
vllm.e2e_request_latency.seconds.sum (count) | The sum of end to end request latency in seconds. Shown as second |
vllm.generation_tokens.count (count) | Number of generation tokens processed. |
vllm.gpu_cache_usage_perc (gauge) | GPU KV-cache usage. 1 means 100 percent usage Shown as percent |
vllm.num_preemptions.count (count) | Cumulative number of preemption from the engine. |
vllm.num_requests.running (gauge) | Number of requests currently running on GPU. |
vllm.num_requests.swapped (gauge) | Number of requests swapped to CPU. |
vllm.num_requests.waiting (gauge) | Number of requests waiting. |
vllm.process.cpu_seconds.count (count) | Total user and system CPU time spent in seconds. Shown as second |
vllm.process.max_fds (gauge) | Maximum number of open file descriptors. Shown as file |
vllm.process.open_fds (gauge) | Number of open file descriptors. Shown as file |
vllm.process.resident_memory_bytes (gauge) | Resident memory size in bytes. Shown as byte |
vllm.process.start_time_seconds (gauge) | Start time of the process since unix epoch in seconds. Shown as second |
vllm.process.virtual_memory_bytes (gauge) | Virtual memory size in bytes. Shown as byte |
vllm.prompt_tokens.count (count) | Number of prefill tokens processed. |
vllm.python.gc.collections.count (count) | Number of times this generation was collected |
vllm.python.gc.objects.collected.count (count) | Objects collected during gc |
vllm.python.gc.objects.uncollectable.count (count) | Uncollectable objects found during GC |
vllm.python.info (gauge) | Python platform information |
vllm.request.generation_tokens.bucket (count) | Number of generation tokens processed. |
vllm.request.generation_tokens.count (count) | Number of generation tokens processed. |
vllm.request.generation_tokens.sum (count) | Number of generation tokens processed. |
vllm.request.params.best_of.bucket (count) | Histogram of the best_of request parameter. |
vllm.request.params.best_of.count (count) | Histogram of the best_of request parameter. |
vllm.request.params.best_of.sum (count) | Histogram of the best_of request parameter. |
vllm.request.params.n.bucket (count) | Histogram of the n request parameter. |
vllm.request.params.n.count (count) | Histogram of the n request parameter. |
vllm.request.params.n.sum (count) | Histogram of the n request parameter. |
vllm.request.prompt_tokens.bucket (count) | Number of prefill tokens processed. |
vllm.request.prompt_tokens.count (count) | Number of prefill tokens processed. |
vllm.request.prompt_tokens.sum (count) | Number of prefill tokens processed. |
vllm.request.success.count (count) | Count of successfully processed requests. |
vllm.time_per_output_token.seconds.bucket (count) | The observations of time per output token bucketed by seconds. |
vllm.time_per_output_token.seconds.count (count) | The total number of observations of time per output token. |
vllm.time_per_output_token.seconds.sum (count) | The sum of time per output token in seconds. Shown as second |
vllm.time_to_first_token.seconds.bucket (count) | The observations of time to first token bucketed by seconds. |
vllm.time_to_first_token.seconds.count (count) | The total number of observations of time to first token. |
vllm.time_to_first_token.seconds.sum (count) | The sum of time to first token in seconds. Shown as second |