Velero

Supported OS Linux Windows Mac OS

Intégration2.0.0
Cette page n'est pas encore disponible en français, sa traduction est en cours.
Si vous avez des questions ou des retours sur notre projet de traduction actuel, n'hésitez pas à nous contacter.

Overview

This check monitors Velero through the Datadog Agent. It collects data about Velero’s backup, restore and snapshot operations. This allows users to gain insight into the health, performance and reliability of their disaster recovery processes.

Setup

Installation

The Velero check is included in the Datadog Agent package. No additional installation is needed on your server.

Configuration

Metrics

Follow the instructions below to install and configure this check for an Agent running on a host.

  1. Edit the velero.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your Velero performance data. See the sample velero.d/conf.yaml for all available configuration options.

  2. Restart the Agent.

See the Autodiscovery Integration Templates for guidance on configuring this integration in a containerized environment.

Note that two types of pods need to be queried for all metrics to be collected: velero and node-agent Therefore, make sure to update the annotations of the velero deployment as well as the node-agent daemonset.

Logs

The Velero integration can collect logs from the Velero pods.

To collect logs from Velero containers on a host:

  1. Collecting logs is disabled by default in the Datadog Agent. Enable it in your datadog.yaml file:

    logs_enabled: true
    
  2. Uncomment and edit the logs configuration block in your velero.d/conf.yaml file. For example:

    logs:
      - type: docker
        source: velero
        service: velero
    

To collect logs from a Velero Kubernetes deployment:

  1. Collecting logs is disabled by default in the Datadog Agent. To enable it, see Kubernetes Log Collection.

  2. Set Log Integrations as pod annotations. This can also be configured with a file, a ConfigMap, or a key-value store. For more information, see the configuration section of Kubernetes Log Collection.

Validation

Run the Agent’s status subcommand and look for velero under the Checks section.

Data Collected

Metrics

velero.backup.amount
(gauge)
Current number of existent backups
velero.backup.attempt.count
(count)
Total number of attempted backups
velero.backup.deletion.attempt.count
(count)
Total number of attempted backup deletions
velero.backup.deletion.failure.count
(count)
Total number of failed backup deletions
velero.backup.deletion.success.count
(count)
Total number of successful backup deletions
velero.backup.duration.seconds.bucket
(count)
Bucket for time taken to complete backup, in seconds
velero.backup.duration.seconds.count
(count)
Count aggregation for time taken to complete backup
velero.backup.duration.seconds.sum
(count)
Cumulative sum of time taken to complete backup, in seconds
Shown as second
velero.backup.failure.count
(count)
Total number of failed backups
velero.backup.items
(gauge)
Total number of items backed up
velero.backup.items.errors
(gauge)
Total number of errors encountered during backup
Shown as error
velero.backup.last_status
(gauge)
Last status of the backup. A value of 1 is success, 0 is failure
velero.backup.last_successful_timestamp
(gauge)
Last time a backup ran successfully, Unix timestamp in seconds
velero.backup.partial_failure.count
(count)
Total number of partially failed backups
velero.backup.success.count
(count)
Total number of successful backups
velero.backup.tarball_size_bytes
(gauge)
Size, in bytes, of a backup
Shown as byte
velero.backup.validation_failure.count
(count)
Total number of validation failed backups
velero.backup.warning.count
(count)
Total number of warned backups
velero.csi_snapshot.attempt.count
(count)
Total number of CSI attempted volume snapshots
velero.csi_snapshot.failure.count
(count)
Total number of CSI failed volume snapshots
velero.csi_snapshot.success.count
(count)
Total number of CSI successful volume snapshots
velero.pod_volume.backup.dequeue.count
(count)
Total number of podvolumebackup objects dequeued
velero.pod_volume.backup.enqueue.count
(count)
Total number of podvolumebackup objects enqueued
velero.pod_volume.data.download.cancel.count
(count)
Total number of canceled downloaded snapshots
velero.pod_volume.data.download.failure.count
(count)
Total number of failed downloaded snapshots
velero.pod_volume.data.download.success.count
(count)
Total number of successful downloaded snapshots
velero.pod_volume.data.upload.cancel.count
(count)
Total number of canceled uploaded snapshots
velero.pod_volume.data.upload.failure.count
(count)
Total number of failed uploaded snapshots
velero.pod_volume.data.upload.success.count
(count)
Total number of successful uploaded snapshots
velero.pod_volume.operation_latency.seconds.bucket
(count)
Histogram bucket for time taken to complete pod volume operations, in seconds
velero.pod_volume.operation_latency.seconds.count
(count)
Count aggregation for time taken to complete pod volume operations
velero.pod_volume.operation_latency.seconds.gauge
(gauge)
Gauge metric indicating time taken, in seconds, to perform pod volume operations
Shown as second
velero.pod_volume.operation_latency.seconds.sum
(count)
Sum aggregation for time taken to complete pod volume operations, in seconds
Shown as second
velero.restore.amount
(gauge)
Current number of existent restores
velero.restore.attempt.count
(count)
Total number of attempted restores
velero.restore.failed.count
(count)
Total number of failed restores
velero.restore.partial_failure.count
(count)
Total number of partially failed restores
velero.restore.success.count
(count)
Total number of successful restores
velero.restore.validation_failed.count
(count)
Total number of failed restores failing validations
velero.volume_snapshot.attempt.count
(count)
Total number of attempted volume snapshots
velero.volume_snapshot.failure.count
(count)
Total number of failed volume snapshots
velero.volume_snapshot.success.count
(count)
Total number of successful volume snapshots

Events

The Velero integration does not include any events.

Service Checks

The Velero integration does not include any service checks.

Troubleshooting

Make sure that your Velero server is exposing metrics by checking that the feature is enabled in the deployment configuration:

# Settings for Velero's prometheus metrics. Enabled by default.
metrics:
  enabled: true
  scrapeInterval: 30s
  scrapeTimeout: 10s

Need help? Contact Datadog support.

PREVIEWING: dgreen15/adding-custom-entities