- 필수 기능
- 시작하기
- Glossary
- 표준 속성
- Guides
- Agent
- 통합
- 개방형텔레메트리
- 개발자
- API
- Datadog Mobile App
- CoScreen
- Cloudcraft
- 앱 내
- 서비스 관리
- 인프라스트럭처
- 애플리케이션 성능
- APM
- Continuous Profiler
- 스팬 시각화
- 데이터 스트림 모니터링
- 데이터 작업 모니터링
- 디지털 경험
- 소프트웨어 제공
- 보안
- AI Observability
- 로그 관리
- 관리
Supported OS
This check monitors Kubeflow through the Datadog Agent.
Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.
The Kubeflow check is included in the Datadog Agent package. No additional installation is needed on your server.
Edit the kubeflow.d/conf.yaml
file, in the conf.d/
folder at the root of your Agent’s configuration directory to start collecting your kubeflow performance data. See the sample kubeflow.d/conf.yaml for all available configuration options.
Make sure that the Prometheus-formatted metrics are exposed for your kubeflow
componenet.
For the Agent to start collecting metrics, the kubeflow
pods need to be annotated.
Kubeflow has metrics endpoints that can be accessed on port 9090
.
To enable metrics exposure in kubeflow through prometheus, you might need to enable the prometheus service monitoring for the component in question.
You can use Kube-Prometheus-Stack or a custom Prometheus installation.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus-stack prometheus-community/kube-prometheus-stack
kubectl port-forward prometheus-stack 9090:9090
You need to configure ServiceMonitors for Kubeflow components to expose their Prometheus metrics. If your Kubeflow component exposes Prometheus metrics by default. You’ll just need to configure Prometheus to scrape these metrics.
The ServiceMonitor would look like this:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: <kubeflow-component>-monitor
labels:
release: prometheus-stack
spec:
selector:
matchLabels:
app: <kubeflow-component-name>
endpoints:
- port: http
path: /metrics
Where <kubeflow-component>
is to be replaced by pipelines
, kserve
or katib
and <kubeflow-component-name>
is to be replaced by ml-pipeline
, kserve
or katib
.
Note: The listed metrics can only be collected if they are available(depending on the version). Some metrics are generated only when certain actions are performed.
The only parameter required for configuring the kubeflow
check is openmetrics_endpoint
. This parameter should be set to the location where the Prometheus-formatted metrics are exposed. The default port is 9090
. In containerized environments, %%host%%
should be used for host autodetection.
apiVersion: v1
kind: Pod
# (...)
metadata:
name: '<POD_NAME>'
annotations:
ad.datadoghq.com/controller.checks: |
{
"kubeflow": {
"init_config": {},
"instances": [
{
"openmetrics_endpoint": "http://%%host%%:9090/metrics"
}
]
}
}
# (...)
spec:
containers:
- name: 'controller'
# (...)
Run the Agent’s status subcommand and look for kubeflow
under the Checks section.
kubeflow.katib.controller.reconcile.count (count) | Number of reconcile loops executed by the Katib controller |
kubeflow.katib.controller.reconcile.duration.seconds.bucket (count) | Duration of reconcile loops executed by the Katib controller(bucket) |
kubeflow.katib.controller.reconcile.duration.seconds.count (count) | Duration of reconcile loops executed by the Katib controller(count) |
kubeflow.katib.controller.reconcile.duration.seconds.sum (count) | Duration of reconcile loops executed by the Katib controller(sum) Shown as second |
kubeflow.katib.experiment.created.count (count) | Total number of experiments created |
kubeflow.katib.experiment.duration.seconds.bucket (count) | Duration of experiments from start to completion(bucket) |
kubeflow.katib.experiment.duration.seconds.count (count) | Duration of experiments from start to completion(count) |
kubeflow.katib.experiment.duration.seconds.sum (count) | Duration of experiments from start to completion(sum) Shown as second |
kubeflow.katib.experiment.failed.count (count) | Number of experiments that have failed |
kubeflow.katib.experiment.running.total (gauge) | Number of experiments currently running |
kubeflow.katib.experiment.succeeded.count (count) | Number of experiments that have successfully completed |
kubeflow.katib.suggestion.created.count (count) | Total number of suggestions made |
kubeflow.katib.suggestion.duration.seconds.bucket (count) | Duration of suggestion processes from start to completion(bucket) |
kubeflow.katib.suggestion.duration.seconds.count (count) | Duration of suggestion processes from start to completion(count) |
kubeflow.katib.suggestion.duration.seconds.sum (count) | Duration of suggestion processes from start to completion(sum) Shown as second |
kubeflow.katib.suggestion.failed.count (count) | Number of suggestions that have failed |
kubeflow.katib.suggestion.running.total (gauge) | Number of suggestions currently being processed |
kubeflow.katib.suggestion.succeeded.count (count) | Number of suggestions that have successfully completed |
kubeflow.katib.trial.created.count (count) | Total number of trials created |
kubeflow.katib.trial.duration.seconds.bucket (count) | Duration of trials from start to completion(bucket) |
kubeflow.katib.trial.duration.seconds.count (count) | Duration of trials from start to completion(count) |
kubeflow.katib.trial.duration.seconds.sum (count) | Duration of trials from start to completion(sum) Shown as second |
kubeflow.katib.trial.failed.count (count) | Number of trials that have failed |
kubeflow.katib.trial.running.total (gauge) | Number of trials currently running |
kubeflow.katib.trial.succeeded.count (count) | Number of trials that have successfully completed |
kubeflow.kserve.inference.duration.seconds.bucket (count) | Duration of inference requests(bucket) |
kubeflow.kserve.inference.duration.seconds.count (count) | Duration of inference requests(count) |
kubeflow.kserve.inference.duration.seconds.sum (count) | Duration of inference requests(sum) Shown as second |
kubeflow.kserve.inference.errors.count (count) | Number of errors encountered during inference |
kubeflow.kserve.inference.request.bytes.bucket (count) | Size of inference request payloads(bucket) |
kubeflow.kserve.inference.request.bytes.count (count) | Size of inference request payloads(count) |
kubeflow.kserve.inference.request.bytes.sum (count) | Size of inference request payloads(sum) Shown as byte |
kubeflow.kserve.inference.response.bytes.bucket (count) | Size of inference response payloads(bucket) |
kubeflow.kserve.inference.response.bytes.count (count) | Size of inference response payloads(count) |
kubeflow.kserve.inference.response.bytes.sum (count) | Size of inference response payloads(sum) Shown as byte |
kubeflow.kserve.inferences.count (count) | Total number of inferences made |
kubeflow.notebook.server.created.count (count) | Total number of notebook servers created |
kubeflow.notebook.server.failed.count (count) | Number of notebook servers that have failed |
kubeflow.notebook.server.reconcile.count (count) | Number of reconcile loops executed by the notebook controller |
kubeflow.notebook.server.reconcile.duration.seconds.bucket (count) | Duration of reconcile loops executed by the notebook controller(bucket) |
kubeflow.notebook.server.reconcile.duration.seconds.count (count) | Duration of reconcile loops executed by the notebook controller(count) |
kubeflow.notebook.server.reconcile.duration.seconds.sum (count) | Duration of reconcile loops executed by the notebook controller(sum) Shown as second |
kubeflow.notebook.server.running.total (gauge) | Number of notebook servers currently running |
kubeflow.notebook.server.succeeded.count (count) | Number of notebook servers that have successfully completed |
kubeflow.pipeline.run.duration.seconds.bucket (count) | Duration of pipeline runs(bucket) |
kubeflow.pipeline.run.duration.seconds.count (count) | Duration of pipeline runs(count) |
kubeflow.pipeline.run.duration.seconds.sum (count) | Duration of pipeline runs(sum) Shown as second |
kubeflow.pipeline.run.status (gauge) | Status of pipeline runs |
The Kubeflow integration does not include any events.
kubeflow.openmetrics.health
Returns CRITICAL
if the Agent is unable to connect to the Kubeflow OpenMetrics endpoint, otherwise returns OK
.
Statuses: ok, critical
Need help? Contact Datadog support.