Kubeflow

Supported OS Linux Windows Mac OS

Integration version1.0.0

Overview

This check monitors Kubeflow through the Datadog Agent.

Setup

Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.

Installation

The Kubeflow check is included in the Datadog Agent package. No additional installation is needed on your server.

Configuration

Edit the kubeflow.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your kubeflow performance data. See the sample kubeflow.d/conf.yaml for all available configuration options.
Restart the Agent.

Metric collection

Make sure that the Prometheus-formatted metrics are exposed for your kubeflow componenet. For the Agent to start collecting metrics, the kubeflow pods need to be annotated.

Kubeflow has metrics endpoints that can be accessed on port 9090.

To enable metrics exposure in kubeflow through prometheus, you might need to enable the prometheus service monitoring for the component in question.

You can use Kube-Prometheus-Stack or a custom Prometheus installation.

How to install Kube-Prometheus-Stack:

Add Helm Repository:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Install the Chart:

helm install prometheus-stack prometheus-community/kube-prometheus-stack

Expose Prometheus service externally:

kubectl port-forward prometheus-stack 9090:9090

Set Up ServiceMonitors for Kubeflow Components:

You need to configure ServiceMonitors for Kubeflow components to expose their Prometheus metrics. If your Kubeflow component exposes Prometheus metrics by default. You’ll just need to configure Prometheus to scrape these metrics.

The ServiceMonitor would look like this:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: <kubeflow-component>-monitor
  labels:
    release: prometheus-stack
spec:
  selector:
    matchLabels:
      app: <kubeflow-component-name>
  endpoints:
  - port: http
    path: /metrics

Where <kubeflow-component> is to be replaced by pipelines, kserve or katib and <kubeflow-component-name> is to be replaced by ml-pipeline, kserve or katib.

Note: The listed metrics can only be collected if they are available(depending on the version). Some metrics are generated only when certain actions are performed.

The only parameter required for configuring the kubeflow check is openmetrics_endpoint. This parameter should be set to the location where the Prometheus-formatted metrics are exposed. The default port is 9090. In containerized environments, %%host%% should be used for host autodetection.

apiVersion: v1
kind: Pod
# (...)
metadata:
  name: '<POD_NAME>'
  annotations:
    ad.datadoghq.com/controller.checks: |
      {
        "kubeflow": {
          "init_config": {},
          "instances": [
            {
              "openmetrics_endpoint": "http://%%host%%:9090/metrics"
            }
          ]
        }
      }      
    # (...)
spec:
  containers:
    - name: 'controller'
# (...)

Validation

Run the Agent’s status subcommand and look for kubeflow under the Checks section.

Data Collected

Metrics

kubeflow.katib.controller.reconcile.count (count)	Number of reconcile loops executed by the Katib controller
kubeflow.katib.controller.reconcile.duration.seconds.bucket (count)	Duration of reconcile loops executed by the Katib controller(bucket)
kubeflow.katib.controller.reconcile.duration.seconds.count (count)	Duration of reconcile loops executed by the Katib controller(count)
kubeflow.katib.controller.reconcile.duration.seconds.sum (count)	Duration of reconcile loops executed by the Katib controller(sum) Shown as second
kubeflow.katib.experiment.created.count (count)	Total number of experiments created
kubeflow.katib.experiment.duration.seconds.bucket (count)	Duration of experiments from start to completion(bucket)
kubeflow.katib.experiment.duration.seconds.count (count)	Duration of experiments from start to completion(count)
kubeflow.katib.experiment.duration.seconds.sum (count)	Duration of experiments from start to completion(sum) Shown as second
kubeflow.katib.experiment.failed.count (count)	Number of experiments that have failed
kubeflow.katib.experiment.running.total (gauge)	Number of experiments currently running
kubeflow.katib.experiment.succeeded.count (count)	Number of experiments that have successfully completed
kubeflow.katib.suggestion.created.count (count)	Total number of suggestions made
kubeflow.katib.suggestion.duration.seconds.bucket (count)	Duration of suggestion processes from start to completion(bucket)
kubeflow.katib.suggestion.duration.seconds.count (count)	Duration of suggestion processes from start to completion(count)
kubeflow.katib.suggestion.duration.seconds.sum (count)	Duration of suggestion processes from start to completion(sum) Shown as second
kubeflow.katib.suggestion.failed.count (count)	Number of suggestions that have failed
kubeflow.katib.suggestion.running.total (gauge)	Number of suggestions currently being processed
kubeflow.katib.suggestion.succeeded.count (count)	Number of suggestions that have successfully completed
kubeflow.katib.trial.created.count (count)	Total number of trials created
kubeflow.katib.trial.duration.seconds.bucket (count)	Duration of trials from start to completion(bucket)
kubeflow.katib.trial.duration.seconds.count (count)	Duration of trials from start to completion(count)
kubeflow.katib.trial.duration.seconds.sum (count)	Duration of trials from start to completion(sum) Shown as second
kubeflow.katib.trial.failed.count (count)	Number of trials that have failed
kubeflow.katib.trial.running.total (gauge)	Number of trials currently running
kubeflow.katib.trial.succeeded.count (count)	Number of trials that have successfully completed
kubeflow.kserve.inference.duration.seconds.bucket (count)	Duration of inference requests(bucket)
kubeflow.kserve.inference.duration.seconds.count (count)	Duration of inference requests(count)
kubeflow.kserve.inference.duration.seconds.sum (count)	Duration of inference requests(sum) Shown as second
kubeflow.kserve.inference.errors.count (count)	Number of errors encountered during inference
kubeflow.kserve.inference.request.bytes.bucket (count)	Size of inference request payloads(bucket)
kubeflow.kserve.inference.request.bytes.count (count)	Size of inference request payloads(count)
kubeflow.kserve.inference.request.bytes.sum (count)	Size of inference request payloads(sum) Shown as byte
kubeflow.kserve.inference.response.bytes.bucket (count)	Size of inference response payloads(bucket)
kubeflow.kserve.inference.response.bytes.count (count)	Size of inference response payloads(count)
kubeflow.kserve.inference.response.bytes.sum (count)	Size of inference response payloads(sum) Shown as byte
kubeflow.kserve.inferences.count (count)	Total number of inferences made
kubeflow.notebook.server.created.count (count)	Total number of notebook servers created
kubeflow.notebook.server.failed.count (count)	Number of notebook servers that have failed
kubeflow.notebook.server.reconcile.count (count)	Number of reconcile loops executed by the notebook controller
kubeflow.notebook.server.reconcile.duration.seconds.bucket (count)	Duration of reconcile loops executed by the notebook controller(bucket)
kubeflow.notebook.server.reconcile.duration.seconds.count (count)	Duration of reconcile loops executed by the notebook controller(count)
kubeflow.notebook.server.reconcile.duration.seconds.sum (count)	Duration of reconcile loops executed by the notebook controller(sum) Shown as second
kubeflow.notebook.server.running.total (gauge)	Number of notebook servers currently running
kubeflow.notebook.server.succeeded.count (count)	Number of notebook servers that have successfully completed
kubeflow.pipeline.run.duration.seconds.bucket (count)	Duration of pipeline runs(bucket)
kubeflow.pipeline.run.duration.seconds.count (count)	Duration of pipeline runs(count)
kubeflow.pipeline.run.duration.seconds.sum (count)	Duration of pipeline runs(sum) Shown as second
kubeflow.pipeline.run.status (gauge)	Status of pipeline runs

Events

The Kubeflow integration does not include any events.

Service Checks

kubeflow.openmetrics.health
Returns CRITICAL if the Agent is unable to connect to the Kubeflow OpenMetrics endpoint, otherwise returns OK.
Statuses: ok, critical

Troubleshooting

Need help? Contact Datadog support.