Supported OS Linux Windows Mac OS

Versión de la integración1.1.0

Información general

Este check monitoriza Kubeflow a través del Datadog Agent.

Configuración

Esta integración se publica actualmente en modo de vista previa. Su disponibilidad está sujeta a cambios en el futuro.

Sigue las instrucciones a continuación para instalar y configurar este check para un Agent que se ejecuta en un host. Para entornos en contenedores, consulta las plantillas de integración de Autodiscovery para obtener orientación sobre la aplicación de estas instrucciones.

Instalación

El check de Kubeflow está incluido en el paquete del Datadog Agent. No es necesaria ninguna instalación adicional en tu servidor.

Configuración

  1. Edita el archivo kubeflow.d/conf.yaml, en la carpeta conf.d/ en la raíz del directorio de configuración del Agent para empezar a recopilar tus datos de rendimiento de Kubeflow. Para ver todas las opciones de configuración disponibles, consulta el kubeflow.d/conf.yaml de ejemplo.

  2. Reinicia el Agent.

Recopilación de métricas

Asegúrate de que las métricas con formato Prometheus están expuestas para tu componentekubeflow. Para que el Agent empiece a recopilar métricas, los pods kubeflow deben estar anotados.

Kubeflow tiene endpoints de métricas a los que se puede acceder en el puerto 9090.

Para habilitar la exposición de métricas en Kubeflow a través de Prometheus, es posible que necesites habilitar la monitorización del servicio Prometheus para el componente en cuestión.

Puedes utilizar Kube-Prometheus-Stack o una instalación personalizada de Prometheus.

Cómo instalar Kube-Prometheus-Stack:
  1. Añade el repositorio Helm:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
  1. Instala el Chart:
helm install prometheus-stack prometheus-community/kube-prometheus-stack
  1. Expón el servicio Prometheus externamente:
kubectl port-forward prometheus-stack 9090:9090
Configura ServiceMonitors para componentes Kubeflow:

Necesitas configurar ServiceMonitors para que los componentes Kubeflow expongan sus métricas Prometheus. Si tu componente Kubeflow expone métricas Prometheus por defecto, sólo tendrás que configurar Prometheus para extraer estas métricas.

El ServiceMonitor tendría el siguiente aspecto:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: <kubeflow-component>-monitor
  labels:
    release: prometheus-stack
spec:
  selector:
    matchLabels:
      app: <kubeflow-component-name>
  endpoints:
  - port: http
    path: /metrics

Donde <kubeflow-component> debe sustituirse por pipelines, kserve o katib y <kubeflow-component-name> debe sustituirse por ml-pipeline, kserve o katib.

Nota: Las métricas enumeradas sólo pueden recopilarse si están disponibles (dependiendo de la versión). Algunas métricas sólo se generan cuando se realizan determinadas acciones.

El único parámetro necesario para configurar el check kubeflow es openmetrics_endpoint. Este parámetro debe definirse en la localización donde se exponen métricas con formato Prometheus. El puerto por defecto es 9090. En entornos contenedorizados, %%host%% debe utilizarse para la detección automática de hosts.

apiVersion: v1
kind: Pod
# (...)
metadata:
  name: '<POD_NAME>'
  annotations:
    ad.datadoghq.com/controller.checks: |
      {
        "kubeflow": {
          "init_config": {},
          "instances": [
            {
              "openmetrics_endpoint": "http://%%host%%:9090/metrics"
            }
          ]
        }
      }      
    # (...)
spec:
  containers:
    - name: 'controller'
# (...)

Validación

Ejecuta el subcomando de estado del Agent y busca kubeflow en la sección Checks.

Datos recopilados

Métricas

kubeflow.katib.controller.reconcile.count
(count)
Number of reconcile loops executed by the Katib controller
kubeflow.katib.controller.reconcile.duration.seconds.bucket
(count)
Duration of reconcile loops executed by the Katib controller(bucket)
kubeflow.katib.controller.reconcile.duration.seconds.count
(count)
Duration of reconcile loops executed by the Katib controller(count)
kubeflow.katib.controller.reconcile.duration.seconds.sum
(count)
Duration of reconcile loops executed by the Katib controller(sum)
Shown as second
kubeflow.katib.experiment.created.count
(count)
Total number of experiments created
kubeflow.katib.experiment.duration.seconds.bucket
(count)
Duration of experiments from start to completion(bucket)
kubeflow.katib.experiment.duration.seconds.count
(count)
Duration of experiments from start to completion(count)
kubeflow.katib.experiment.duration.seconds.sum
(count)
Duration of experiments from start to completion(sum)
Shown as second
kubeflow.katib.experiment.failed.count
(count)
Number of experiments that have failed
kubeflow.katib.experiment.running.total
(gauge)
Number of experiments currently running
kubeflow.katib.experiment.succeeded.count
(count)
Number of experiments that have successfully completed
kubeflow.katib.suggestion.created.count
(count)
Total number of suggestions made
kubeflow.katib.suggestion.duration.seconds.bucket
(count)
Duration of suggestion processes from start to completion(bucket)
kubeflow.katib.suggestion.duration.seconds.count
(count)
Duration of suggestion processes from start to completion(count)
kubeflow.katib.suggestion.duration.seconds.sum
(count)
Duration of suggestion processes from start to completion(sum)
Shown as second
kubeflow.katib.suggestion.failed.count
(count)
Number of suggestions that have failed
kubeflow.katib.suggestion.running.total
(gauge)
Number of suggestions currently being processed
kubeflow.katib.suggestion.succeeded.count
(count)
Number of suggestions that have successfully completed
kubeflow.katib.trial.created.count
(count)
Total number of trials created
kubeflow.katib.trial.duration.seconds.bucket
(count)
Duration of trials from start to completion(bucket)
kubeflow.katib.trial.duration.seconds.count
(count)
Duration of trials from start to completion(count)
kubeflow.katib.trial.duration.seconds.sum
(count)
Duration of trials from start to completion(sum)
Shown as second
kubeflow.katib.trial.failed.count
(count)
Number of trials that have failed
kubeflow.katib.trial.running.total
(gauge)
Number of trials currently running
kubeflow.katib.trial.succeeded.count
(count)
Number of trials that have successfully completed
kubeflow.kserve.inference.duration.seconds.bucket
(count)
Duration of inference requests(bucket)
kubeflow.kserve.inference.duration.seconds.count
(count)
Duration of inference requests(count)
kubeflow.kserve.inference.duration.seconds.sum
(count)
Duration of inference requests(sum)
Shown as second
kubeflow.kserve.inference.errors.count
(count)
Number of errors encountered during inference
kubeflow.kserve.inference.request.bytes.bucket
(count)
Size of inference request payloads(bucket)
kubeflow.kserve.inference.request.bytes.count
(count)
Size of inference request payloads(count)
kubeflow.kserve.inference.request.bytes.sum
(count)
Size of inference request payloads(sum)
Shown as byte
kubeflow.kserve.inference.response.bytes.bucket
(count)
Size of inference response payloads(bucket)
kubeflow.kserve.inference.response.bytes.count
(count)
Size of inference response payloads(count)
kubeflow.kserve.inference.response.bytes.sum
(count)
Size of inference response payloads(sum)
Shown as byte
kubeflow.kserve.inferences.count
(count)
Total number of inferences made
kubeflow.notebook.server.created.count
(count)
Total number of notebook servers created
kubeflow.notebook.server.failed.count
(count)
Number of notebook servers that have failed
kubeflow.notebook.server.reconcile.count
(count)
Number of reconcile loops executed by the notebook controller
kubeflow.notebook.server.reconcile.duration.seconds.bucket
(count)
Duration of reconcile loops executed by the notebook controller(bucket)
kubeflow.notebook.server.reconcile.duration.seconds.count
(count)
Duration of reconcile loops executed by the notebook controller(count)
kubeflow.notebook.server.reconcile.duration.seconds.sum
(count)
Duration of reconcile loops executed by the notebook controller(sum)
Shown as second
kubeflow.notebook.server.running.total
(gauge)
Number of notebook servers currently running
kubeflow.notebook.server.succeeded.count
(count)
Number of notebook servers that have successfully completed
kubeflow.pipeline.run.duration.seconds.bucket
(count)
Duration of pipeline runs(bucket)
kubeflow.pipeline.run.duration.seconds.count
(count)
Duration of pipeline runs(count)
kubeflow.pipeline.run.duration.seconds.sum
(count)
Duration of pipeline runs(sum)
Shown as second
kubeflow.pipeline.run.status
(gauge)
Status of pipeline runs

Eventos

La integración Kubeflow no incluye eventos.

Checks de servicio

kubeflow.openmetrics.health
Returns CRITICAL if the Agent is unable to connect to the Kubeflow OpenMetrics endpoint, otherwise returns OK.
Statuses: ok, critical

Resolución de problemas

¿Necesitas ayuda? Consulta el servicio de asistencia de Datadog.

PREVIEWING: brett.blue/PA-link-fixes