Yarn

Supported OS Linux Windows Mac OS

Integration version7.0.0

Hadoop Yarn

Overview

This check collects metrics from your YARN ResourceManager, including (but not limited to):

Cluster-wide metrics, such as number of running apps, running containers, unhealthy nodes, and more.
Per-application metrics, such as app progress, elapsed running time, running containers, memory use, and more.
Node metrics, such as available vCores, time of last health update, and more.

Deprecation notice

yarn.apps.<METRIC> metrics are deprecated in favor of yarn.apps.<METRIC>_gauge metrics because yarn.apps metrics are incorrectly reported as a RATE instead of a GAUGE.

Setup

Installation

The YARN check is included in the Datadog Agent package, so you don’t need to install anything else on your YARN ResourceManager.

Configuration

Host

To configure this check for an Agent running on a host:

Edit the yarn.d/conf.yaml file in the conf.d/ folder at the root of your Agent’s configuration directory.

init_config:

instances:
  ## @param resourcemanager_uri - string - required
  ## The YARN check retrieves metrics from YARNS's ResourceManager. This
  ## check must be run from the Master Node and the ResourceManager URI must
  ## be specified below. The ResourceManager URI is composed of the
  ## ResourceManager's hostname and port.
  ## The ResourceManager hostname can be found in the yarn-site.xml conf file
  ## under the property yarn.resourcemanager.address
  ##
  ## The ResourceManager port can be found in the yarn-site.xml conf file under
  ## the property yarn.resourcemanager.webapp.address
  #
  - resourcemanager_uri: http://localhost:8088

    ## @param cluster_name - string - required - default: default_cluster
    ## A friendly name for the cluster.
    #
    cluster_name: default_cluster

See the example check configuration for a comprehensive list and description of all check options.

Restart the Agent to start sending YARN metrics to Datadog.

Containerized

For containerized environments, see the Autodiscovery Integration Templates for guidance on applying the parameters below.

Parameter	Value
`<INTEGRATION_NAME>`	`yarn`
`<INIT_CONFIG>`	blank or `{}`
`<INSTANCE_CONFIG>`	`{"resourcemanager_uri": "http://%%host%%:%%port%%", "cluster_name": "<CLUSTER_NAME>"}`

Log collection

Collecting logs is disabled by default in the Datadog Agent, enable it in your datadog.yaml file:
```
logs_enabled: true
```

Uncomment and edit the logs configuration block in your yarn.d/conf.yaml file. Change the type, path, and service parameter values based on your environment. See the sample yarn.d/conf.yaml for all available configuration options.

logs:
  - type: file
    path: <LOG_FILE_PATH>
    source: yarn
    service: <SERVICE_NAME>
    # To handle multi line that starts with yyyy-mm-dd use the following pattern
    # log_processing_rules:
    #   - type: multi_line
    #     pattern: \d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2},\d{3}
    #     name: new_log_start_with_date

Restart the Agent.

To enable logs for Docker environments, see Docker Log Collection.

Validation

Run the Agent’s status subcommand and look for yarn under the Checks section.

Data Collected

Metrics

yarn.apps.allocated_mb (rate)	Deprecated use yarn.apps.allocatedmbgauge instead Shown as mebibyte
yarn.apps.allocated_mb_gauge (gauge)	The sum of memory in MB allocated to the applications running containers Shown as mebibyte
yarn.apps.allocated_vcores (rate)	Deprecated use yarn.apps.allocatedvcoresgauge instead Shown as core
yarn.apps.allocated_vcores_gauge (gauge)	The sum of virtual cores allocated to the applications running containers Shown as core
yarn.apps.elapsed_time (rate)	Deprecated use yarn.apps.elapsedtimegauge instead Shown as second
yarn.apps.elapsed_time_gauge (gauge)	The elapsed time since the application started (in ms) Shown as millisecond
yarn.apps.finished_time (rate)	Deprecated use yarn.apps.finishedtimegauge instead Shown as second
yarn.apps.finished_time_gauge (gauge)	The time in which the application finished (in ms since epoch) Shown as millisecond
yarn.apps.memory_seconds (rate)	Deprecated use yarn.apps.memorysecondsgauge instead Shown as second
yarn.apps.memory_seconds_gauge (gauge)	The amount of memory the application has allocated (megabyte-seconds) Shown as mebibyte
yarn.apps.progress (rate)	Deprecated use yarn.apps.progress_gauge instead Shown as percent
yarn.apps.progress_gauge (gauge)	The progress of the application, displayed as 0, 10, & 100, which represent the 3 states: hasn't started, in progress, & completed Shown as percent
yarn.apps.running_containers (rate)	Deprecated use yarn.apps.runningcontainersgauge instead
yarn.apps.running_containers_gauge (gauge)	The number of containers currently running for the application Shown as container
yarn.apps.started_time (rate)	Deprecated use yarn.apps.startedtimegauge instead Shown as second
yarn.apps.started_time_gauge (gauge)	The time in which application started (in ms since epoch) Shown as millisecond
yarn.apps.vcore_seconds (rate)	Deprecated use yarn.apps.vcoresecondsgauge instead Shown as second
yarn.apps.vcore_seconds_gauge (gauge)	The amount of CPU resources the application has allocated (virtual core-seconds) Shown as core
yarn.metrics.active_nodes (gauge)	The number of active nodes Shown as node
yarn.metrics.allocated_mb (gauge)	The amount of allocated memory Shown as mebibyte
yarn.metrics.allocated_virtual_cores (gauge)	The number of allocated virtual cores Shown as core
yarn.metrics.apps_completed (gauge)	The number of completed apps Shown as task
yarn.metrics.apps_failed (gauge)	The number of failed apps Shown as task
yarn.metrics.apps_killed (gauge)	The number of killed apps Shown as task
yarn.metrics.apps_pending (gauge)	The number of pending apps Shown as task
yarn.metrics.apps_running (gauge)	The number of running apps Shown as task
yarn.metrics.apps_submitted (gauge)	The number of submitted apps Shown as task
yarn.metrics.available_mb (gauge)	The amount of available memory Shown as mebibyte
yarn.metrics.available_virtual_cores (gauge)	The number of available virtual cores Shown as core
yarn.metrics.containers_allocated (gauge)	The number of containers allocated
yarn.metrics.containers_pending (gauge)	The number of containers pending
yarn.metrics.containers_reserved (gauge)	The number of containers reserved
yarn.metrics.decommissioned_nodes (gauge)	The number of decommissioned nodes Shown as node
yarn.metrics.decommissioning_nodes (gauge)	The number of decommissioning nodes Shown as node
yarn.metrics.lost_nodes (gauge)	The number of lost nodes Shown as node
yarn.metrics.rebooted_nodes (gauge)	The number of rebooted nodes Shown as node
yarn.metrics.reserved_mb (gauge)	The size of reserved memory Shown as mebibyte
yarn.metrics.reserved_virtual_cores (gauge)	The number of reserved virtual cores Shown as core
yarn.metrics.total_mb (gauge)	The amount of total memory Shown as mebibyte
yarn.metrics.total_nodes (gauge)	The total number of nodes Shown as node
yarn.metrics.total_virtual_cores (gauge)	The total number of virtual cores Shown as core
yarn.metrics.unhealthy_nodes (gauge)	The number of unhealthy nodes Shown as node
yarn.node.avail_memory_mb (gauge)	The total amount of memory currently available on the node (in MB) Shown as mebibyte
yarn.node.available_virtual_cores (gauge)	The total number of vCores available on the node Shown as core
yarn.node.last_health_update (gauge)	The last time the node reported its health (in ms since epoch) Shown as millisecond
yarn.node.num_containers (gauge)	The total number of containers currently running on the node
yarn.node.used_memory_mb (gauge)	The total amount of memory currently used on the node (in MB) Shown as mebibyte
yarn.node.used_virtual_cores (gauge)	The total number of vCores currently used on the node Shown as core
yarn.queue.absolute_capacity (gauge)	The absolute capacity percentage this queue can use of entire cluster Shown as percent
yarn.queue.absolute_max_capacity (gauge)	The absolute maximum capacity percentage this queue can use of the entire cluster Shown as percent
yarn.queue.absolute_used_capacity (gauge)	The absolute used capacity percentage this queue is using of the entire cluster Shown as percent
yarn.queue.am_resource_limit.memory (gauge)	The maximum memory resources this queue can use for Application Masters (in MB) Shown as mebibyte
yarn.queue.am_resource_limit.vcores (gauge)	The maximum vCpus this queue can use for Application Masters Shown as core
yarn.queue.capacity (gauge)	The configured queue capacity in percentage relative to its parent queue Shown as percent
yarn.queue.max_active_applications (gauge)	The maximum number of active applications this queue can have Shown as task
yarn.queue.max_active_applications_per_user (gauge)	The maximum number of active applications per user this queue can have Shown as task
yarn.queue.max_applications (gauge)	The maximum number of applications this queue can have Shown as task
yarn.queue.max_applications_per_user (gauge)	The maximum number of applications per user this queue can have Shown as task
yarn.queue.max_capacity (gauge)	The configured maximum queue capacity in percentage relative to its parent queue Shown as percent
yarn.queue.num_active_applications (gauge)	The number of active applications in this queue Shown as task
yarn.queue.num_applications (gauge)	The number of applications currently in the queue Shown as task
yarn.queue.num_containers (gauge)	The number of containers being used
yarn.queue.num_pending_applications (gauge)	The number of pending applications in this queue Shown as task
yarn.queue.resources_used.memory (gauge)	The total memory resources this queue is using (in MB) Shown as mebibyte
yarn.queue.resources_used.vcores (gauge)	The total vCpus this queue is using Shown as core
yarn.queue.root.capacity (gauge)	The configured queue capacity in percentage for root queue Shown as percent
yarn.queue.root.max_capacity (gauge)	The configured maximum queue capacity in percentage for root queue Shown as percent
yarn.queue.root.used_capacity (gauge)	The used queue capacity in percentage for root queue Shown as percent
yarn.queue.used_am_resource.memory (gauge)	The memory resources used for Application Masters (in MB) Shown as mebibyte
yarn.queue.used_am_resource.vcores (gauge)	The vCpus used for Application Masters Shown as core
yarn.queue.used_capacity (gauge)	The used queue capacity in percentage Shown as percent
yarn.queue.user_am_resource_limit.memory (gauge)	The maximum memory resources a user can use for Application Masters (in MB) Shown as mebibyte
yarn.queue.user_am_resource_limit.vcores (gauge)	The maximum vCpus a user can use for Application Masters Shown as core
yarn.queue.user_limit (gauge)	The user limit factor set in the configuration
yarn.queue.user_limit_factor (gauge)	The minimum user limit percent set in the configuration

Events

The Yarn check does not include any events.

Service Checks

yarn.can_connect
Returns CRITICAL if the Agent cannot connect to the ResourceManager URI to collect metrics, otherwise OK.
Statuses: ok, critical

yarn.application.status
By default, returns OK if the Yarn application state is NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, or FINISHED; UNKNOWN if the application state is ALL; and CRITICAL if the Yarn application state is FAILED or KILLED.
Statuses: ok, unknown, critical

Troubleshooting

Need help? Contact Datadog support.