Cassandra

Supported OS Linux Windows Mac OS

Integration version3.0.0

Cassandra default dashboard

Overview

Get metrics from Cassandra in real time to:

  • Visualize and monitor Cassandra states.
  • Be notified about Cassandra failovers and events.

Setup

Installation

The Cassandra check is included in the Datadog Agent package, so you don’t need to install anything else on your Cassandra nodes. It’s recommended to use Oracle’s JDK for this integration.

Note: This check has a limit of 350 metrics per instance. The number of returned metrics is indicated in the status page. You can specify the metrics you are interested in by editing the configuration below. To learn how to customize the metrics to collect see the JMX documentation for detailed instructions. If you need to monitor more metrics, contact Datadog support.

Configuration

Metric collection
  1. The default configuration of your cassandra.d/conf.yaml file activate the collection of your Cassandra metrics. See the sample cassandra.d/conf.yaml for all available configuration options.

  2. Restart the Agent.

Log collection

Available for Agent versions >6.0

For containerized environments, follow the instructions on the Kubernetes Log Collection or Docker Log Collection pages.

  1. Collecting logs is disabled by default in the Datadog Agent, enable it in your datadog.yaml file:

    logs_enabled: true
    
  2. Add this configuration block to your cassandra.d/conf.yaml file to start collecting your Cassandra logs:

      logs:
        - type: file
          path: /var/log/cassandra/*.log
          source: cassandra
          service: myapplication
          log_processing_rules:
             - type: multi_line
               name: log_start_with_date
               # pattern to match: DEBUG [ScheduledTasks:1] 2019-12-30
               pattern: '[A-Z]+ +\[[^\]]+\] +\d{4}-\d{2}-\d{2}'
    

    Change the path and service parameter values and configure them for your environment. See the sample cassandra.d/conf.yaml for all available configuration options.

    To make sure that stacktraces are properly aggregated as one single log, a multiline processing rule can be added.

  3. Restart the Agent.

Validation

Run the Agent’s status subcommand and look for cassandra under the Checks section.

Data Collected

Metrics

cassandra.active_tasks
(gauge)
The number of tasks that the thread pool is actively executing.
Shown as task
cassandra.bloom_filter_false_ratio
(gauge)
The ratio of Bloom filter false positives to total checks.
Shown as fraction
cassandra.bytes_flushed.count
(gauge)
The amount of data that was flushed since (re)start.
Shown as byte
cassandra.cas_commit_latency.75th_percentile
(gauge)
The latency of paxos commit round - p75.
Shown as microsecond
cassandra.cas_commit_latency.95th_percentile
(gauge)
The latency of paxos commit round - p95.
Shown as microsecond
cassandra.cas_commit_latency.one_minute_rate
(gauge)
The number of paxos commit round per second.
Shown as operation
cassandra.cas_prepare_latency.75th_percentile
(gauge)
The latency of paxos prepare round - p75.
Shown as microsecond
cassandra.cas_prepare_latency.95th_percentile
(gauge)
The latency of paxos prepare round - p95.
Shown as microsecond
cassandra.cas_prepare_latency.one_minute_rate
(gauge)
The number of paxos prepare round per second.
Shown as operation
cassandra.cas_propose_latency.75th_percentile
(gauge)
The latency of paxos propose round - p75.
Shown as microsecond
cassandra.cas_propose_latency.95th_percentile
(gauge)
The latency of paxos propose round - p95.
Shown as microsecond
cassandra.cas_propose_latency.one_minute_rate
(gauge)
The number of paxos propose round per second.
Shown as operation
cassandra.col_update_time_delta_histogram.75th_percentile
(gauge)
The column update time delta - p75.
Shown as microsecond
cassandra.col_update_time_delta_histogram.95th_percentile
(gauge)
The column update time delta - p95.
Shown as microsecond
cassandra.col_update_time_delta_histogram.min
(gauge)
The column update time delta - min.
Shown as microsecond
cassandra.compaction_bytes_written.count
(gauge)
The amount of data that was compacted since (re)start.
Shown as byte
cassandra.compression_ratio
(gauge)
The compression ratio for all SSTables. /!\ A low value means a high compression contrary to what the name suggests. Formula used is: 'size of the compressed SSTable / size of original'
Shown as fraction
cassandra.currently_blocked_tasks
(gauge)
The number of currently blocked tasks for the thread pool.
Shown as task
cassandra.currently_blocked_tasks.count
(gauge)
The number of currently blocked tasks for the thread pool.
Shown as task
cassandra.db.droppable_tombstone_ratio
(gauge)
The estimate of the droppable tombstone ratio.
Shown as fraction
cassandra.dropped.one_minute_rate
(gauge)
The tasks dropped during execution for the thread pool.
Shown as thread
cassandra.exceptions.count
(gauge)
The number of exceptions thrown from 'Storage' metrics.
Shown as error
cassandra.key_cache_hit_rate
(gauge)
The key cache hit rate.
Shown as fraction
cassandra.latency.75th_percentile
(gauge)
The client request latency - p75.
Shown as microsecond
cassandra.latency.95th_percentile
(gauge)
The client request latency - p95.
Shown as microsecond
cassandra.latency.one_minute_rate
(gauge)
The number of client requests.
Shown as request
cassandra.live_disk_space_used.count
(gauge)
The disk space used by "live" SSTables (only counts in use files).
Shown as byte
cassandra.live_ss_table_count
(gauge)
Number of "live" (in use) SSTables.
Shown as file
cassandra.load.count
(gauge)
The disk space used by live data on a node.
Shown as byte
cassandra.max_partition_size
(gauge)
The size of the largest compacted partition.
Shown as byte
cassandra.max_row_size
(gauge)
The size of the largest compacted row.
Shown as byte
cassandra.mean_partition_size
(gauge)
The average size of compacted partition.
Shown as byte
cassandra.mean_row_size
(gauge)
The average size of compacted rows.
Shown as byte
cassandra.net.down_endpoint_count
(gauge)
The number of unhealthy nodes in the cluster. They represent each individual node's view of the cluster and thus should not be summed across reporting nodes.
Shown as node
cassandra.net.up_endpoint_count
(gauge)
The number of healthy nodes in the cluster. They represent each individual node's view of the cluster and thus should not be summed across reporting nodes.
Shown as node
cassandra.pending_compactions
(gauge)
The number of pending compactions.
Shown as task
cassandra.pending_flushes.count
(gauge)
The number of pending flushes.
Shown as flush
cassandra.pending_tasks
(gauge)
The number of pending tasks for the thread pool.
Shown as task
cassandra.range_latency.75th_percentile
(gauge)
The local range request latency - p75.
Shown as microsecond
cassandra.range_latency.95th_percentile
(gauge)
The local range request latency - p95.
Shown as microsecond
cassandra.range_latency.one_minute_rate
(gauge)
The number of local range requests.
Shown as request
cassandra.read_latency.75th_percentile
(gauge)
The local read latency - p75.
Shown as microsecond
cassandra.read_latency.95th_percentile
(gauge)
The local read latency - p95.
Shown as microsecond
cassandra.read_latency.99th_percentile
(gauge)
The local read latency - p99.
Shown as microsecond
cassandra.read_latency.one_minute_rate
(gauge)
The number of local read requests.
Shown as read
cassandra.row_cache_hit.count
(gauge)
The number of row cache hits.
Shown as hit
cassandra.row_cache_hit_out_of_range.count
(gauge)
The number of row cache hits that do not satisfy the query filter and went to disk.
Shown as hit
cassandra.row_cache_miss.count
(gauge)
The number of table row cache misses.
Shown as miss
cassandra.snapshots_size
(gauge)
The disk space truly used by snapshots.
Shown as byte
cassandra.ss_tables_per_read_histogram.75th_percentile
(gauge)
The number of SSTable data files accessed per read - p75.
Shown as file
cassandra.ss_tables_per_read_histogram.95th_percentile
(gauge)
The number of SSTable data files accessed per read - p95.
Shown as file
cassandra.timeouts.count
(gauge)
Count of requests not acknowledged within configurable timeout window.
Shown as timeout
cassandra.timeouts.one_minute_rate
(gauge)
Recent timeout rate, as an exponentially weighted moving average over a one-minute interval.
Shown as timeout
cassandra.tombstone_scanned_histogram.75th_percentile
(gauge)
Number of tombstones scanned per read - p75.
Shown as record
cassandra.tombstone_scanned_histogram.95th_percentile
(gauge)
Number of tombstones scanned per read - p95.
Shown as record
cassandra.total_blocked_tasks
(gauge)
Total blocked tasks
Shown as task
cassandra.total_blocked_tasks.count
(count)
Total count of blocked tasks
Shown as task
cassandra.total_commit_log_size
(gauge)
The size used on disk by commit logs.
Shown as byte
cassandra.total_disk_space_used.count
(gauge)
Total disk space used by SSTables including obsolete ones waiting to be GC'd.
Shown as byte
cassandra.view_lock_acquire_time.75th_percentile
(gauge)
The time taken acquiring a partition lock for materialized view updates - p75.
Shown as microsecond
cassandra.view_lock_acquire_time.95th_percentile
(gauge)
The time taken acquiring a partition lock for materialized view updates - p95.
Shown as microsecond
cassandra.view_lock_acquire_time.one_minute_rate
(gauge)
The number of requests to acquire a partition lock for materialized view updates.
Shown as request
cassandra.view_read_time.75th_percentile
(gauge)
The time taken during the local read of a materialized view update - p75.
Shown as microsecond
cassandra.view_read_time.95th_percentile
(gauge)
The time taken during the local read of a materialized view update - p95.
Shown as microsecond
cassandra.view_read_time.one_minute_rate
(gauge)
The number of local reads for materialized view updates.
Shown as request
cassandra.waiting_on_free_memtable_space.75th_percentile
(gauge)
The time spent waiting for free memtable space either on- or off-heap - p75.
Shown as microsecond
cassandra.waiting_on_free_memtable_space.95th_percentile
(gauge)
The time spent waiting for free memtable space either on- or off-heap - p95.
Shown as microsecond
cassandra.write_latency.75th_percentile
(gauge)
The local write latency - p75.
Shown as microsecond
cassandra.write_latency.95th_percentile
(gauge)
The local write latency - p95.
Shown as microsecond
cassandra.write_latency.99th_percentile
(gauge)
The local write latency - p99.
Shown as microsecond
cassandra.write_latency.one_minute_rate
(gauge)
The number of local write requests.
Shown as write

Events

The Cassandra check does not include any events.

Service Checks

cassandra.can_connect
Returns CRITICAL if the Agent is unable to connect to and collect metrics from the monitored Cassandra instance, WARNING if no metrics are collected, and OK otherwise.
Statuses: ok, critical, warning

Troubleshooting

Need help? Contact Datadog support.

Further Reading

Cassandra Nodetool Integration

Cassandra default dashboard

Overview

This check collects metrics for your Cassandra cluster that are not available through jmx integration. It uses the nodetool utility to collect them.

Setup

Installation

The Cassandra Nodetool check is included in the Datadog Agent package, so you don’t need to install anything else on your Cassandra nodes.

Configuration

Follow the instructions below to configure this check for an Agent running on a host. For containerized environments, see the Containerized section.

Host

  1. Edit the file cassandra_nodetool.d/conf.yaml in the conf.d/ folder at the root of your Agent’s configuration directory. See the sample cassandra_nodetool.d/conf.yaml for all available configuration options:

    init_config:
    
    instances:
      ## @param keyspaces - list of string - required
      ## The list of keyspaces to monitor.
      ## An empty list results in no metrics being sent.
      #
      - keyspaces:
          - "<KEYSPACE_1>"
          - "<KEYSPACE_2>"
    
  2. Restart the Agent.

Log collection

Cassandra Nodetool logs are collected by the Cassandra integration. See the log collection instructions for Cassandra.

Containerized

For containerized environments, use the official Prometheus exporter in the pod, and then use Autodiscovery in the Agent to find the pod and query the endpoint.

Validation

Run the Agent’s status subcommand and look for cassandra_nodetool under the Checks section.

Data Collected

Metrics

cassandra.nodetool.status.load
(gauge)
Amount of file system data under the cassandra data directory without snapshot content
Shown as byte
cassandra.nodetool.status.owns
(gauge)
Percentage of the data owned by the node per datacenter times the replication factor
Shown as percent
cassandra.nodetool.status.replication_availability
(gauge)
Percentage of data available per keyspace times replication factor
Shown as percent
cassandra.nodetool.status.replication_factor
(gauge)
Replication factor per keyspace
cassandra.nodetool.status.status
(gauge)
Node status: up (1) or down (0)

Events

The Cassandra_nodetool check does not include any events.

Service Checks

cassandra.nodetool.node_up
The agent sends this service check for each node of the monitored cluster. Returns CRITICAL if the node is down, otherwise OK.
Statuses: ok, critical

Troubleshooting

Need help? Contact Datadog support.

Further Reading

PREVIEWING: safchain/fix-custom-agent