Hdfs

HDFS DataNode Integration

HDFS Dashboard

Overview

Track disk utilization and failed volumes on each of your HDFS DataNodes. This Agent check collects metrics for these, as well as block- and cache-related metrics.

Use this check (hdfs_datanode) and its counterpart check (hdfs_namenode), not the older two-in-one check (hdfs); that check is deprecated.

Setup

Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.

Installation

The HDFS DataNode check is included in the Datadog Agent package, so you don’t need to install anything else on your DataNodes.

Configuration

Connect the Agent

Host

To configure this check for an Agent running on a host:

  1. Edit the hdfs_datanode.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory. See the sample hdfs_datanode.d/conf.yaml for all available configuration options:

    init_config:
    
    instances:
      ## @param hdfs_datanode_jmx_uri - string - required
      ## The HDFS DataNode check retrieves metrics from the HDFS DataNode's JMX
      ## interface via HTTP(S) (not a JMX remote connection). This check must be installed on a HDFS DataNode. The HDFS
      ## DataNode JMX URI is composed of the DataNode's hostname and port.
      ##
      ## The hostname and port can be found in the hdfs-site.xml conf file under
      ## the property dfs.datanode.http.address
      ## https://hadoop.apache.org/docs/r3.1.3/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
      #
      - hdfs_datanode_jmx_uri: http://localhost:9864
    
  2. Restart the Agent.

Containerized

For containerized environments, see the Autodiscovery Integration Templates for guidance on applying the parameters below.

ParameterValue
<INTEGRATION_NAME>hdfs_datanode
<INIT_CONFIG>blank or {}
<INSTANCE_CONFIG>{"hdfs_datanode_jmx_uri": "http://%%host%%:9864"}

Log collection

Available for Agent >6.0

  1. Collecting logs is disabled by default in the Datadog Agent. Enable it in the datadog.yaml file with:

      logs_enabled: true
    
  2. Add this configuration block to your hdfs_datanode.d/conf.yaml file to start collecting your DataNode logs:

      logs:
        - type: file
          path: /var/log/hadoop-hdfs/*.log
          source: hdfs_datanode
          service: <SERVICE_NAME>
    

    Change the path and service parameter values and configure them for your environment.

  3. Restart the Agent.

Validation

Run the Agent’s status subcommand and look for hdfs_datanode under the Checks section.

Data Collected

Metrics

hdfs.datanode.cache_capacity
(gauge)
Cache capacity in bytes
Shown as byte
hdfs.datanode.cache_used
(gauge)
Cache used in bytes
Shown as byte
hdfs.datanode.dfs_capacity
(gauge)
Disk capacity in bytes
Shown as byte
hdfs.datanode.dfs_remaining
(gauge)
The remaining disk space left in bytes
Shown as byte
hdfs.datanode.dfs_used
(gauge)
Disk usage in bytes
Shown as byte
hdfs.datanode.estimated_capacity_lost_total
(gauge)
The estimated capacity lost in bytes
Shown as byte
hdfs.datanode.last_volume_failure_date
(gauge)
The date/time of the last volume failure in milliseconds since epoch
Shown as millisecond
hdfs.datanode.num_blocks_cached
(gauge)
The number of blocks cached
Shown as block
hdfs.datanode.num_blocks_failed_to_cache
(gauge)
The number of blocks that failed to cache
Shown as block
hdfs.datanode.num_blocks_failed_to_uncache
(gauge)
The number of failed blocks to remove from cache
Shown as block
hdfs.datanode.num_failed_volumes
(gauge)
Number of failed volumes

Events

The HDFS-datanode check does not include any events.

Service Checks

hdfs.datanode.jmx.can_connect
Returns CRITICAL if the Agent cannot connect to the DataNode’s JMX interface for any reason. Returns OK otherwise.
Statuses: ok, critical

Troubleshooting

Need help? Contact Datadog support.

Further Reading

HDFS NameNode Integration

HDFS Dashboard

Overview

Monitor your primary and standby HDFS NameNodes to know when your cluster falls into a precarious state: when you’re down to one NameNode remaining, or when it’s time to add more capacity to the cluster. This Agent check collects metrics for remaining capacity, corrupt/missing blocks, dead DataNodes, filesystem load, under-replicated blocks, total volume failures (across all DataNodes), and many more.

Use this check (hdfs_namenode) and its counterpart check (hdfs_datanode), not the older two-in-one check (hdfs); that check is deprecated.

Setup

Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.

Installation

The HDFS NameNode check is included in the Datadog Agent package, so you don’t need to install anything else on your NameNodes.

Configuration

Connect the Agent

Host

To configure this check for an Agent running on a host:

  1. Edit the hdfs_namenode.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory. See the sample hdfs_namenode.d/conf.yaml for all available configuration options:

    init_config:
    
    instances:
      ## @param hdfs_namenode_jmx_uri - string - required
      ## The HDFS NameNode check retrieves metrics from the HDFS NameNode's JMX
      ## interface via HTTP(S) (not a JMX remote connection). This check must be installed on
      ## a HDFS NameNode. The HDFS NameNode JMX URI is composed of the NameNode's hostname and port.
      ##
      ## The hostname and port can be found in the hdfs-site.xml conf file under
      ## the property dfs.namenode.http-address
      ## https://hadoop.apache.org/docs/r3.1.3/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
      #
      - hdfs_namenode_jmx_uri: http://localhost:9870
    
  2. Restart the Agent.

Containerized

For containerized environments, see the Autodiscovery Integration Templates for guidance on applying the parameters below.

ParameterValue
<INTEGRATION_NAME>hdfs_namenode
<INIT_CONFIG>blank or {}
<INSTANCE_CONFIG>{"hdfs_namenode_jmx_uri": "https://%%host%%:9870"}

Log collection

Available for Agent >6.0

  1. Collecting logs is disabled by default in the Datadog Agent. Enable it in the datadog.yaml file with:

      logs_enabled: true
    
  2. Add this configuration block to your hdfs_namenode.d/conf.yaml file to start collecting your NameNode logs:

      logs:
        - type: file
          path: /var/log/hadoop-hdfs/*.log
          source: hdfs_namenode
          service: <SERVICE_NAME>
    

    Change the path and service parameter values and configure them for your environment.

  3. Restart the Agent.

Validation

Run the Agent’s status subcommand and look for hdfs_namenode under the Checks section.

Data Collected

Metrics

hdfs.namenode.blocks_total
(gauge)
Total number of blocks
Shown as block
hdfs.namenode.capacity_remaining
(gauge)
Remaining disk space left in bytes
Shown as byte
hdfs.namenode.capacity_total
(gauge)
Total disk capacity in bytes
Shown as byte
hdfs.namenode.capacity_used
(gauge)
Disk usage in bytes
Shown as byte
hdfs.namenode.corrupt_blocks
(gauge)
Number of corrupt blocks
Shown as block
hdfs.namenode.estimated_capacity_lost_total
(gauge)
Estimated capacity lost in bytes
Shown as byte
hdfs.namenode.files_total
(gauge)
Total number of files
Shown as file
hdfs.namenode.fs_lock_queue_length
(gauge)
Lock queue length
hdfs.namenode.max_objects
(gauge)
Maximum number of files HDFS supports
Shown as object
hdfs.namenode.missing_blocks
(gauge)
Number of missing blocks
Shown as block
hdfs.namenode.num_dead_data_nodes
(gauge)
Total number of dead data nodes
Shown as node
hdfs.namenode.num_decom_dead_data_nodes
(gauge)
Number of decommissioning dead data nodes
Shown as node
hdfs.namenode.num_decom_live_data_nodes
(gauge)
Number of decommissioning live data nodes
Shown as node
hdfs.namenode.num_decommissioning_data_nodes
(gauge)
Number of decommissioning data nodes
Shown as node
hdfs.namenode.num_live_data_nodes
(gauge)
Total number of live data nodes
Shown as node
hdfs.namenode.num_stale_data_nodes
(gauge)
Number of stale data nodes
Shown as node
hdfs.namenode.num_stale_storages
(gauge)
Number of stale storages
hdfs.namenode.pending_deletion_blocks
(gauge)
Number of pending deletion blocks
Shown as block
hdfs.namenode.pending_replication_blocks
(gauge)
Number of blocks pending replication
Shown as block
hdfs.namenode.scheduled_replication_blocks
(gauge)
Number of blocks scheduled for replication
Shown as block
hdfs.namenode.total_load
(gauge)
Total load on the file system
hdfs.namenode.under_replicated_blocks
(gauge)
Number of under replicated blocks
Shown as block
hdfs.namenode.volume_failures_total
(gauge)
Total volume failures

Events

The HDFS-namenode check does not include any events.

Service Checks

hdfs.namenode.jmx.can_connect
Returns CRITICAL if the Agent cannot connect to the NameNode’s JMX interface for any reason. Returns OK otherwise.
Statuses: ok, critical

Troubleshooting

Need help? Contact Datadog support.

Further Reading

PREVIEWING: rtrieu/product-analytics-ui-changes