ray.actors (gauge) | Current number of actors currently in a particular state. |
ray.cluster.active_nodes (gauge) | Active nodes on the cluster Shown as node |
ray.cluster.failed_nodes (gauge) | Failed nodes on the cluster Shown as node |
ray.cluster.pending_nodes (gauge) | Pending nodes on the cluster Shown as node |
ray.component.cpu_percentage (gauge) | Total CPU usage of the components on a node. Shown as percent |
ray.component.mem_shared (gauge) | SHM usage of all components of the node. It is equivalent to the top command's SHR column. Shown as byte |
ray.component.rss (gauge) | RSS usage of all components on the node. Shown as megabyte |
ray.component.uss (gauge) | USS usage of all components on the node. Shown as megabyte |
ray.gcs.actors (gauge) | Number of actors per state {Created, Destroyed, Unresolved, Pending} |
ray.gcs.placement_group (gauge) | Number of placement groups broken down by state in {Registered, Pending, Infeasible} |
ray.gcs.storage_operation.count (count) | Number of operations invoked on Gcs storage |
ray.gcs.storage_operation.latency.bucket (count) | Time to invoke an operation on Gcs storage Shown as millisecond |
ray.gcs.storage_operation.latency.count (count) | Time to invoke an operation on Gcs storage |
ray.gcs.storage_operation.latency.sum (count) | Time to invoke an operation on Gcs storage Shown as millisecond |
ray.gcs.task_manager.task_events.dropped (gauge) | Number of task events dropped per type {PROFILEEVENT, STATUSEVENT} Shown as event |
ray.gcs.task_manager.task_events.reported (gauge) | Number of all task events reported to gcs. Shown as event |
ray.gcs.task_manager.task_events.stored (gauge) | Number of task events stored in GCS. Shown as event |
ray.gcs.task_manager.task_events.stored_bytes (gauge) | Number of bytes of all task events stored in GCS. Shown as byte |
ray.grpc_server.req.finished.count (count) | Finished request number in grpc server Shown as request |
ray.grpc_server.req.handling.count (count) | Request number are handling in grpc server Shown as request |
ray.grpc_server.req.new.count (count) | New request number in grpc server Shown as request |
ray.grpc_server.req.process_time (gauge) | Request latency in grpc server Shown as millisecond |
ray.health_check.rpc_latency.bucket (count) | Latency of rpc request for health check. Shown as millisecond |
ray.health_check.rpc_latency.count (count) | Latency of rpc request for health check. |
ray.health_check.rpc_latency.sum (count) | Latency of rpc request for health check. Shown as millisecond |
ray.internal_num.infeasible_scheduling_classes (gauge) | The number of unique scheduling classes that are infeasible. |
ray.internal_num.processes.skipped.job_mismatch (gauge) | The total number of cached workers skipped due to job mismatch. Shown as process |
ray.internal_num.processes.skipped.runtime_environment_mismatch (gauge) | The total number of cached workers skipped due to runtime environment mismatch. Shown as process |
ray.internal_num.processes.started (gauge) | The total number of worker processes the worker pool has created. Shown as process |
ray.internal_num.processes.started.from_cache (gauge) | The total number of workers started from a cached worker process. Shown as process |
ray.internal_num.spilled_tasks (gauge) | The cumulative number of lease requests that this raylet has spilled to other raylets. Shown as request |
ray.memory_manager.worker_eviction (count) | The number of tasks and actors killed by the Ray Out of Memory killer broken down by types (whether it is tasks or actors) and names (name of tasks and actors). |
ray.node.cpu (gauge) | Total CPUs available on a ray node |
ray.node.cpu_utilization (gauge) | Total CPU usage on a ray node |
ray.node.disk.free (gauge) | Total disk free (bytes) on a ray node Shown as byte |
ray.node.disk.io.read (gauge) | Total read from disk |
ray.node.disk.io.read.count (gauge) | Total read ops from disk Shown as operation |
ray.node.disk.io.read.speed (gauge) | Disk read speed |
ray.node.disk.io.write (gauge) | Total written to disk |
ray.node.disk.io.write.count (gauge) | Total write ops to disk |
ray.node.disk.io.write.speed (gauge) | Disk write speed |
ray.node.disk.read.iops (gauge) | Disk read iops |
ray.node.disk.usage (gauge) | Total disk usage (bytes) on a ray node Shown as byte |
ray.node.disk.utilization (gauge) | Total disk utilization (percentage) on a ray node Shown as percent |
ray.node.disk.write.iops (gauge) | Disk write iops |
ray.node.gpus_utilization (gauge) | The GPU utilization per GPU as a percentage quantity (0..NGPU*100). GpuDeviceName is a name of a GPU device (e.g., Nvidia A10G) and GpuIndex is the index of the GPU. Shown as percent |
ray.node.gram_used (gauge) | The amount of GPU memory used per GPU, in bytes. Shown as byte |
ray.node.mem.available (gauge) | Memory available on a ray node Shown as byte |
ray.node.mem.shared (gauge) | Total shared memory usage on a ray node Shown as byte |
ray.node.mem.total (gauge) | Total memory on a ray node Shown as byte |
ray.node.mem.used (gauge) | Memory usage on a ray node Shown as byte |
ray.node.network.receive.speed (gauge) | Network receive speed |
ray.node.network.received (gauge) | Total network received |
ray.node.network.send.speed (gauge) | Network send speed |
ray.node.network.sent (gauge) | Total network sent |
ray.object_directory.added_locations (gauge) | Number of object locations added per second., If this is high, a lot of objects have been added on this node. |
ray.object_directory.lookups (gauge) | Number of object location lookups per second. If this is high, the raylet is waiting on a lot of objects. |
ray.object_directory.removed_locations (gauge) | Number of object locations removed per second. If this is high, a lot of objects have been removed from this node. |
ray.object_directory.subscriptions (gauge) | Number of object location subscriptions. If this is high, the raylet is attempting to pull a lot of objects. |
ray.object_directory.updates (gauge) | Number of object location updates per second., If this is high, the raylet is attempting to pull a lot of objects and/or the locations for objects are frequently changing (e.g. due to many object copies or evictions). Shown as update |
ray.object_manager.bytes (gauge) | Number of bytes pushed or received by type {PushedFromLocalPlasma, PushedFromLocalDisk, Received}. Shown as byte |
ray.object_manager.num_pull_requests (gauge) | Number of active pull requests for objects. |
ray.object_manager.received_chunks (gauge) | Number object chunks received broken per type {Total, FailedTotal, FailedCancelled, FailedPlasmaFull}. |
ray.object_store.available_memory (gauge) | Amount of memory currently available in the object store. Shown as byte |
ray.object_store.fallback_memory (gauge) | Amount of memory in fallback allocations in the filesystem. Shown as byte |
ray.object_store.memory (gauge) | Object store memory by various sub-kinds on this node Shown as byte |
ray.object_store.num_local_objects (gauge) | Number of objects currently in the object store. Shown as object |
ray.object_store.size.bucket (count) | The distribution of object size in bytes Shown as byte |
ray.object_store.size.count (count) | The distribution of object size in bytes |
ray.object_store.size.sum (count) | The distribution of object size in bytes Shown as byte |
ray.object_store.used_memory (gauge) | Amount of memory currently occupied in the object store. Shown as byte |
ray.placement_groups (gauge) | Current number of placement groups by state. The State label (e.g., PENDING, CREATED, REMOVED) describes the state of the placement group. |
ray.process.cpu_seconds.count (count) | Total user and system CPU time spent in seconds. Shown as second |
ray.process.max_fds (gauge) | Maximum number of open file descriptors. Shown as file |
ray.process.open_fds (gauge) | Number of open file descriptors. Shown as file |
ray.process.resident_memory (gauge) | Resident memory size in bytes. Shown as byte |
ray.process.start_time (gauge) | Start time of the process since unix epoch in seconds. Shown as second |
ray.process.virtual_memory (gauge) | Virtual memory size in bytes. Shown as byte |
ray.pull_manager.active_bundles (gauge) | Number of active bundle requests Shown as request |
ray.pull_manager.num_object_pins (gauge) | Number of object pin attempts by the pull manager, can be {Success, Failure}. Shown as attempt |
ray.pull_manager.object_request_time.bucket (count) | Time between initial object pull request and local pinning of the object. Shown as millisecond |
ray.pull_manager.object_request_time.count (count) | Time between initial object pull request and local pinning of the object. |
ray.pull_manager.object_request_time.sum (count) | Time between initial object pull request and local pinning of the object. Shown as millisecond |
ray.pull_manager.requested_bundles (gauge) | Number of requested bundles broken per type {Get, Wait, TaskArgs}. |
ray.pull_manager.requests (gauge) | Number of pull requests broken per type {Queued, Active, Pinned}. Shown as request |
ray.pull_manager.retries_total (gauge) | Number of cumulative pull retries. |
ray.pull_manager.usage (gauge) | The total number of bytes usage broken per type {Available, BeingPulled, Pinned} Shown as byte |
ray.push_manager.chunks (gauge) | Number of object chunks transfer broken per type {InFlight, Remaining}. |
ray.push_manager.in_flight_pushes (gauge) | Number of in flight object push requests. Shown as request |
ray.python.gc.collections.count (count) | Number of times this generation was collected |
ray.python.gc.objects_collected.count (count) | Objects collected during gc Shown as object |
ray.python.gc.objects_uncollectable.count (count) | Uncollectable objects found during GC Shown as object |
ray.resources (gauge) | Logical Ray resources broken per state {AVAILABLE, USED} Shown as resource |
ray.scheduler.failed_worker_startup (gauge) | Number of tasks that fail to be scheduled because workers were not available. Labels are broken per reason {JobConfigMissing, RegistrationTimedOut, RateLimited} Shown as task |
ray.scheduler.placement_time.bucket (count) | The time it takes for a workload (task, actor, placement group) to be placed. This is the time from when the tasks dependencies are resolved to when it actually reserves resources on a node to run. Shown as second |
ray.scheduler.placement_time.count (count) | The time it takes for a workload (task, actor, placement group) to be placed. This is the time from when the tasks dependencies are resolved to when it actually reserves resources on a node to run. |
ray.scheduler.placement_time.sum (count) | The time it takes for a workload (task, actor, placement group) to be placed. This is the time from when the tasks dependencies are resolved to when it actually reserves resources on a node to run. Shown as second |
ray.scheduler.tasks (gauge) | Number of tasks waiting for scheduling broken per state {Cancelled, Executing, Waiting, Dispatched, Received}. Shown as task |
ray.scheduler.unscheduleable_tasks (gauge) | Number of pending tasks (not scheduleable tasks) broken per reason {Infeasible, WaitingForResources, WaitingForPlasmaMemory, WaitingForRemoteResources, WaitingForWorkers}. Shown as task |
ray.serve.deployment.error (gauge) | The number of exceptions that have occurred in this replica. Shown as exception |
ray.serve.deployment.processing_latency.bucket (count) | The latency for queries to be processed. Shown as millisecond |
ray.serve.deployment.processing_latency.count (count) | The latency for queries to be processed. |
ray.serve.deployment.processing_latency.sum (count) | The latency for queries to be processed. Shown as millisecond |
ray.serve.deployment.queued_queries (gauge) | The current number of queries to this deployment waiting to be assigned to a replica. Shown as query |
ray.serve.deployment.replica.healthy (gauge) | Tracks whether this deployment replica is healthy. 1 means healthy, 0 means unhealthy. |
ray.serve.deployment.replica.starts (gauge) | The number of times this replica has been restarted due to failure. |
ray.serve.deployment.request.counter (gauge) | The number of queries that have been processed in this replica. Shown as query |
ray.serve.grpc_request_latency.bucket (count) | The end-to-end latency of GRPC requests (measured from the Serve GRPC proxy). |
ray.serve.grpc_request_latency.count (count) | The end-to-end latency of GRPC requests (measured from the Serve GRPC proxy). |
ray.serve.grpc_request_latency.sum (count) | The end-to-end latency of GRPC requests (measured from the Serve GRPC proxy). |
ray.serve.handle_request (gauge) | The number of handle.remote() calls that have been made on this handle. Shown as request |
ray.serve.http_request_latency.bucket (count) | The end-to-end latency of HTTP requests (measured from the Serve HTTP proxy). Shown as millisecond |
ray.serve.http_request_latency.count (count) | The end-to-end latency of HTTP requests (measured from the Serve HTTP proxy). |
ray.serve.http_request_latency.sum (count) | The end-to-end latency of HTTP requests (measured from the Serve HTTP proxy). Shown as millisecond |
ray.serve.multiplexed_get_model_requests.count (count) | The counter for get model requests on the current replica. |
ray.serve.multiplexed_model_load_latency.bucket (count) | The time it takes to load a model. Shown as millisecond |
ray.serve.multiplexed_model_load_latency.count (count) | The time it takes to load a model. |
ray.serve.multiplexed_model_load_latency.sum (count) | The time it takes to load a model. Shown as millisecond |
ray.serve.multiplexed_model_unload_latency.bucket (count) | The time it takes to unload a model. Shown as millisecond |
ray.serve.multiplexed_model_unload_latency.count (count) | The time it takes to unload a model. |
ray.serve.multiplexed_model_unload_latency.sum (count) | The time it takes to unload a model. Shown as millisecond |
ray.serve.multiplexed_models_load.count (count) | The counter for loaded models on the current replica. |
ray.serve.multiplexed_models_unload.count (count) | The counter for unloaded models on the current replica. |
ray.serve.num_deployment_grpc_error_requests (gauge) | The number of errored GRPC responses returned by each deployment. |
ray.serve.num_deployment_http_error_requests (gauge) | The number of non-200 HTTP responses returned by each deployment. Shown as response |
ray.serve.num_grpc_error_requests (gauge) | The number of errored GRPC responses. |
ray.serve.num_grpc_requests (gauge) | The number of GRPC responses. |
ray.serve.num_http_error_requests (gauge) | The number of non-200 HTTP responses. Shown as response |
ray.serve.num_http_requests (gauge) | The number of HTTP requests processed. Shown as request |
ray.serve.num_multiplexed_models (gauge) | The number of models loaded on the current replica. |
ray.serve.num_router_requests (gauge) | The number of requests processed by the router. Shown as request |
ray.serve.registered_multiplexed_model_id (gauge) | The model id registered on the current replica. |
ray.serve.replica.pending_queries (gauge) | The current number of pending queries. Shown as query |
ray.serve.replica.processing_queries (gauge) | The current number of queries being processed. Shown as query |
ray.server.num_ongoing_grpc_requests (gauge) | The number of ongoing requests in this GRPC proxy. |
ray.server.num_ongoing_http_requests (gauge) | The number of ongoing requests in this HTTP proxy. |
ray.server.num_scheduling_tasks (gauge) | The number of request scheduling tasks in the router. |
ray.server.num_scheduling_tasks_in_backoff (gauge) | The number of request scheduling tasks in the router that are undergoing backoff. |
ray.spill_manager.objects (gauge) | Number of local objects broken per state {Pinned, PendingRestore, PendingSpill}. Shown as object |
ray.spill_manager.objects_size (gauge) | Byte size of local objects broken per state {Pinned, PendingSpill}. Shown as byte |
ray.spill_manager.request_total (gauge) | Number of {spill, restore} requests. Shown as request |
ray.tasks (gauge) | Current number of tasks currently in a particular state. Shown as task |
ray.unintentional_worker_failures.count (count) | Number of worker failures that are not intentional. For example, worker failures due to system related errors. Shown as error |
ray.worker.register_time.bucket (count) | End to end latency of register a worker process. Shown as millisecond |
ray.worker.register_time.count (count) | End to end latency of register a worker process. |
ray.worker.register_time.sum (count) | End to end latency of register a worker process. Shown as millisecond |