Ceph

Supported OS Linux Mac OS

Integration version4.0.0

Ceph dashboard

Overview

Enable the Datadog-Ceph integration to:

  • Track disk usage across storage pools
  • Receive service checks in case of issues
  • Monitor I/O performance metrics

Setup

Installation

The Ceph check is included in the Datadog Agent package, so you don’t need to install anything else on your Ceph servers.

Configuration

Edit the file ceph.d/conf.yaml in the conf.d/ folder at the root of your Agent’s configuration directory. See the sample ceph.d/conf.yaml for all available configuration options:

init_config:

instances:
  - ceph_cmd: /path/to/your/ceph # default is /usr/bin/ceph
    use_sudo: true # only if the ceph binary needs sudo on your nodes

If you enabled use_sudo, add a line like the following to your sudoers file:

dd-agent ALL=(ALL) NOPASSWD:/path/to/your/ceph

Log collection

Available for Agent versions >6.0

  1. Collecting logs is disabled by default in the Datadog Agent, enable it in your datadog.yaml file:

    logs_enabled: true
    
  2. Next, edit ceph.d/conf.yaml by uncommenting the logs lines at the bottom. Update the logs path with the correct path to your Ceph log files.

    logs:
      - type: file
        path: /var/log/ceph/*.log
        source: ceph
        service: "<APPLICATION_NAME>"
    
  3. Restart the Agent.

Validation

Run the Agent’s status subcommand and look for ceph under the Checks section.

Data Collected

Metrics

ceph.aggregate_pct_used
(gauge)
Overall capacity usage metric
Shown as percent
ceph.apply_latency_ms
(gauge)
Time taken to flush an update to disks
Shown as millisecond
ceph.commit_latency_ms
(gauge)
Time taken to commit an operation to the journal
Shown as millisecond
ceph.misplaced_objects
(gauge)
Number of objects misplaced
Shown as item
ceph.misplaced_total
(gauge)
Total number of objects if there are misplaced objects
Shown as item
ceph.num_full_osds
(gauge)
Number of full osds
Shown as item
ceph.num_in_osds
(gauge)
Number of participating storage daemons
Shown as item
ceph.num_mons
(gauge)
Number of monitor daemons
Shown as item
ceph.num_near_full_osds
(gauge)
Number of nearly full osds
Shown as item
ceph.num_objects
(gauge)
Object count for a given pool
Shown as item
ceph.num_osds
(gauge)
Number of known storage daemons
Shown as item
ceph.num_pgs
(gauge)
Number of placement groups available
Shown as item
ceph.num_pools
(gauge)
Number of pools
Shown as item
ceph.num_up_osds
(gauge)
Number of online storage daemons
Shown as item
ceph.op_per_sec
(gauge)
IO operations per second for given pool
Shown as operation
ceph.osd.pct_used
(gauge)
Percentage used of full/near full osds
Shown as percent
ceph.pgstate.active_clean
(gauge)
Number of active+clean placement groups
Shown as item
ceph.read_bytes
(gauge)
Per-pool read bytes
Shown as byte
ceph.read_bytes_sec
(gauge)
Bytes/second being read
Shown as byte
ceph.read_op_per_sec
(gauge)
Per-pool read operations/second
Shown as operation
ceph.recovery_bytes_per_sec
(gauge)
Rate of recovered bytes
Shown as byte
ceph.recovery_keys_per_sec
(gauge)
Rate of recovered keys
Shown as item
ceph.recovery_objects_per_sec
(gauge)
Rate of recovered objects
Shown as item
ceph.total_objects
(gauge)
Object count from the underlying object store. [v<=3 only]
Shown as item
ceph.write_bytes
(gauge)
Per-pool write bytes
Shown as byte
ceph.write_bytes_sec
(gauge)
Bytes/second being written
Shown as byte
ceph.write_op_per_sec
(gauge)
Per-pool write operations/second
Shown as operation

Note: If you are running Ceph luminous or later, the ceph.osd.pct_used metric is not included.

Events

The Ceph check does not include any events.

Service Checks

ceph.overall_status
Returns OK if your ceph cluster status is HEALTH_OK, WARNING if it’s HEALTH_WARNING, CRITICAL otherwise.
Statuses: ok, warning, critical

ceph.osd_down
Returns OK if you have no down OSD. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.osd_orphan
Returns OK if you have no orphan OSD. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.osd_full
Returns OK if your OSDs are not full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.osd_nearfull
Returns OK if your OSDs are not near full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pool_full
Returns OK if your pools have not reached their quota. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pool_near_full
Returns OK if your pools are not near reaching their quota. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_availability
Returns OK if there is full data availability. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_degraded
Returns OK if there is full data redundancy. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_degraded_full
Returns OK if there is enough space in the cluster for data redundancy. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_damaged
Returns OK if there are no inconsistencies after data scrubing. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_not_scrubbed
Returns OK if the PGs were scrubbed recently. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_not_deep_scrubbed
Returns OK if the PGs were deep scrubbed recently. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.cache_pool_near_full
Returns OK if the cache pools are not near full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.too_few_pgs
Returns OK if the number of PGs is above the min threshold. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.too_many_pgs
Returns OK if the number of PGs is below the max threshold. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.object_unfound
Returns OK if all objects can be found. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.request_slow
Returns OK requests are taking a normal time to process. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.request_stuck
Returns OK requests are taking a normal time to process. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

Troubleshooting

Need help? Contact Datadog support.

Further Reading

PREVIEWING: rtrieu/product-analytics-ui-changes