Velero

Supported OS Linux Windows Mac OS

Integration version2.0.0

Agent Check: Velero

Overview

This check monitors Velero through the Datadog Agent. It collects data about Velero’s backup, restore and snapshot operations. This allows users to gain insight into the health, performance and reliability of their disaster recovery processes.

Setup

Installation

The Velero check is included in the Datadog Agent package. No additional installation is needed on your server.

Configuration

Metrics

velero.backup.amount
(gauge)
Current number of existent backups
velero.backup.attempt.count
(count)
Total number of attempted backups
velero.backup.deletion.attempt.count
(count)
Total number of attempted backup deletions
velero.backup.deletion.failure.count
(count)
Total number of failed backup deletions
velero.backup.deletion.success.count
(count)
Total number of successful backup deletions
velero.backup.duration.seconds.bucket
(count)
Bucket for time taken to complete backup, in seconds
velero.backup.duration.seconds.count
(count)
Count aggregation for time taken to complete backup
velero.backup.duration.seconds.sum
(count)
Cumulative sum of time taken to complete backup, in seconds
Shown as second
velero.backup.failure.count
(count)
Total number of failed backups
velero.backup.items
(gauge)
Total number of items backed up
velero.backup.items.errors
(gauge)
Total number of errors encountered during backup
Shown as error
velero.backup.last_status
(gauge)
Last status of the backup. A value of 1 is success, 0 is failure
velero.backup.last_successful_timestamp
(gauge)
Last time a backup ran successfully, Unix timestamp in seconds
velero.backup.partial_failure.count
(count)
Total number of partially failed backups
velero.backup.success.count
(count)
Total number of successful backups
velero.backup.tarball_size_bytes
(gauge)
Size, in bytes, of a backup
Shown as byte
velero.backup.validation_failure.count
(count)
Total number of validation failed backups
velero.backup.warning.count
(count)
Total number of warned backups
velero.csi_snapshot.attempt.count
(count)
Total number of CSI attempted volume snapshots
velero.csi_snapshot.failure.count
(count)
Total number of CSI failed volume snapshots
velero.csi_snapshot.success.count
(count)
Total number of CSI successful volume snapshots
velero.pod_volume.backup.dequeue.count
(count)
Total number of pod_volume_backup objects dequeued
velero.pod_volume.backup.enqueue.count
(count)
Total number of pod_volume_backup objects enqueued
velero.pod_volume.data.download.cancel.count
(count)
Total number of canceled downloaded snapshots
velero.pod_volume.data.download.failure.count
(count)
Total number of failed downloaded snapshots
velero.pod_volume.data.download.success.count
(count)
Total number of successful downloaded snapshots
velero.pod_volume.data.upload.cancel.count
(count)
Total number of canceled uploaded snapshots
velero.pod_volume.data.upload.failure.count
(count)
Total number of failed uploaded snapshots
velero.pod_volume.data.upload.success.count
(count)
Total number of successful uploaded snapshots
velero.pod_volume.operation_latency.seconds.bucket
(count)
Histogram bucket for time taken to complete pod volume operations, in seconds
velero.pod_volume.operation_latency.seconds.count
(count)
Count aggregation for time taken to complete pod volume operations
velero.pod_volume.operation_latency.seconds.gauge
(gauge)
Gauge metric indicating time taken, in seconds, to perform pod volume operations
Shown as second
velero.pod_volume.operation_latency.seconds.sum
(count)
Sum aggregation for time taken to complete pod volume operations, in seconds
Shown as second
velero.restore.amount
(gauge)
Current number of existent restores
velero.restore.attempt.count
(count)
Total number of attempted restores
velero.restore.failed.count
(count)
Total number of failed restores
velero.restore.partial_failure.count
(count)
Total number of partially failed restores
velero.restore.success.count
(count)
Total number of successful restores
velero.restore.validation_failed.count
(count)
Total number of failed restores failing validations
velero.volume_snapshot.attempt.count
(count)
Total number of attempted volume snapshots
velero.volume_snapshot.failure.count
(count)
Total number of failed volume snapshots
velero.volume_snapshot.success.count
(count)
Total number of successful volume snapshots

Logs

The Velero integration can collect logs from the Velero pods.

To collect logs from Velero containers on a host:

  1. Collecting logs is disabled by default in the Datadog Agent. Enable it in your datadog.yaml file:

    logs_enabled: true
    
  2. Uncomment and edit the logs configuration block in your velero.d/conf.yaml file. For example:

    logs:
      - type: docker
        source: velero
        service: velero
    

To collect logs from a Velero Kubernetes deployment:

  1. Collecting logs is disabled by default in the Datadog Agent. To enable it, see Kubernetes Log Collection.

  2. Set Log Integrations as pod annotations. This can also be configured with a file, a ConfigMap, or a key-value store. For more information, see the configuration section of Kubernetes Log Collection.

Validation

Run the Agent’s status subcommand and look for velero under the Checks section.

Data Collected

Metrics

This integration collects various Velero metrics, including:

  • Backup: Success/failure rates, durations, and data sizes.
  • Restore: Success/failure counts and validation failures.
  • Snapshot: CSI and volume snapshot attempts, successes, and failures.
  • Pod volume data: Upload/download success and failure rates. These are exposed by the node-agent pods.

See metadata.csv for a list of metrics provided by this integration.

Events

The Velero integration does not include any events.

Service Checks

velero.openmetrics.health

Returns CRITICAL if the Agent is unable to connect to the Velero OpenMetrics endpoint, otherwise returns OK.

Statuses: ok, critical

Troubleshooting

Make sure that your Velero server is exposing metrics by checking that the feature is enabled in the deployment configuration:

# Settings for Velero's prometheus metrics. Enabled by default.
metrics:
  enabled: true
  scrapeInterval: 30s
  scrapeTimeout: 10s

Need help? Contact Datadog support.

PREVIEWING: may/embedded-workflows