Federator.ai

Supported OS Linux

概要

ProphetStor Federator.ai は、Kubernetes と仮想マシン (VM) クラスターの計算リソース管理を強化するために設計された AI ベースのソリューションです。IT 運用の全体的な可観測性、特にマルチテナントの大規模言語モデル (LLM) のトレーニングを含むことで、ミッションクリティカルなアプリケーションのリソース、ネームスペース、ノード、クラスターを効率的に割り当て、最小限のリソース消費で KPI を効果的に達成できます。

高度な機械学習アルゴリズムを使用して、アプリケーションのワークロードを予測します。Federator.ai の主な機能は次の通りです。

Kubernetes クラスター内のコンテナ化されたアプリケーション、ならびに VMware クラスター、Amazon Web Services (AWS) Elastic Compute Cloud (EC2)、Azure Virtual Machine、Google Compute Engine 内の VM における AI ベースのワークロード予測
ワークロード予測、アプリケーション、Kubernetes などの関連するメトリクスに基づくリソースの提案
一般的な Kubernetes アプリケーションのコントローラー / ネームスペース向け CPU / メモリーの自動プロビジョニング
Kubernetes アプリケーションコンテナ、Kafka Consumer Group、NGINX Ingress アップストリームサービスのオートスケーリング
Kubernetes クラスターと VM クラスターのワークロード予測に基づくマルチクラウドコスト分析と推奨
クラスター、Kubernetes アプリケーション、VM、Kubernetes ネームスペースの提案に基づく実際のコストと潜在的な節約
パフォーマンスの妥協なしに行えるマルチテナント LLM トレーニングの可観測性と実行可能なリソース最適化

ProphetStor Federator.ai は、Datadog Agent と統合された API を通じて、LLM トレーニングを含むアプリケーションレベルのワークロードからクラスターレベルのリソース消費までのフルスタックの可観測性を提供します。このインテグレーションにより、リアルタイムモニタリングと予測分析の間のダイナミックなループが促進され、リソース管理を継続的に改善し、コストを最適化し、アプリケーションの効率的な運用を保証します。Kubernetes のコンテナ、ネームスペース、クラスターノードのリソース使用状況を容易に追跡・予測し、コストがかかる過剰プロビジョニングやパフォーマンスに影響を与える過小プロビジョニングを防ぐための正しい推奨を行うことができます。CI/CD パイプラインへの簡単なインテグレーションにより、Federator.ai は Kubernetes クラスターにデプロイされた際のコンテナの継続的な最適化を可能にします。アプリケーションワークロードの予測を使用して、Federator.ai は適切なタイミングでアプリケーションコンテナを自動的にスケーリングし、Kubernetes HPA や Datadog Watermark Pod Autoscaling (WPA) を介して、適切な数のコンテナレプリカでパフォーマンスを最適化します。

Federator.ai について詳しくは、ProphetStor Federator.ai 機能デモおよび Datadog 向け ProphetStor Federator.aiのビデオをご覧ください。

ProphetStor Federator.ai クラスターの概要

ProphetStor Federator.ai クラスターの概要

クラスターのリソース使用量予測と推奨
- この表は、クラスターのリソース計画のための、CPU メモリの負荷予測の最大値、最小値、平均値、および Federator.ai からの CPU メモリリソースの推奨使用量を示しています。
クラスターノードのリソース使用量予測と推奨
- この表は、ノードのリソース計画のための、CPU メモリの負荷予測の最大値、最小値、平均値、および Federator.ai からの CPU メモリリソースの推奨使用量を示しています。
ノードの現在/予測メモリ使用量 (日次)
- このグラフは、Federator.ai からの予測メモリ使用量とノードのメモリ使用量（日次）を示しています。
ノードの現在/予測メモリ使用量 (週次)
- このグラフは、Federator.ai からの予測メモリ使用量とノードのメモリ使用量（週次）を示しています。
ノードの現在/予測メモリ使用量 (月次)
- このグラフは、Federator.ai からの予測メモリ使用量とノードのメモリ使用量（月次）を示しています。
ノードの現在/予測 CPU 使用量 (日次)
- このグラフは、Federator.ai からの予測 CPU 使用量とノードの CPU 使用量（日次）を示しています。
ノードの現在/予測 CPU 使用量 (週次)
- このグラフは、Federator.ai からの予測 CPU 使用量とノードの CPU 使用量（週次）を示しています。
ノードの現在/予測 CPU 使用量 (月次)
- このグラフは、Federator.ai からの予測 CPU 使用量とノードの CPU 使用量（月次）を示しています。

ProphetStor Federator.ai アプリケーションの概要

アプリケーション概要ダッシュボード

今後 24 時間の負荷予測
- この表は、コントローラーのリソース計画のための、今後 24 時間の CPU メモリの負荷予測の最大値、最小値、平均値、および Federator.ai からの CPU メモリリソースの推奨使用量を示しています。
今後 7 日間の負荷予測
- この表は、コントローラーのリソース計画のための、今後 7 日の CPU メモリの負荷予測の最大値、最小値、平均値、および Federator.ai からの CPU メモリリソースの推奨使用量を示しています。
今後 30 日間の負荷予測
- この表は、コントローラーのリソース計画のための、今後 30 日の CPU メモリの負荷予測の最大値、最小値、平均値、および Federator.ai からの CPU メモリリソースの推奨使用量を示しています。
現在/予測 CPU 使用量 (日次)
- このグラフは、Federator.ai からの予測 CPU 使用量とコントローラーの CPU 使用量（日次）を示しています。
現在/予測 CPU 使用量 (週次)
- このグラフは、Federator.ai からの予測 CPU 使用量とコントローラーの CPU 使用量（週次）を示しています。
現在/予測 CPU 使用量 (月次)
- このグラフは、Federator.ai からの予測 CPU 使用量とコントローラーの CPU 使用量（月次）を示しています。
現在/予測メモリ使用量 (日次)
- このグラフは、Federator.ai からの予測メモリ使用量とコントローラーのメモリ使用量（日次）を示しています。
現在/予測メモリ使用量 (週次)
- このグラフは、Federator.ai からの予測メモリ使用量とコントローラーのメモリ使用量（週次）を示しています。
現在/予測メモリ使用量 (月次)
- このグラフは、Federator.ai からの予測メモリ使用量とコントローラーのメモリ使用量（月次）を示しています。
現在/理想/推奨のレプリカ
- このグラフは、Federator.ai からの推奨レプリカと、コントローラーの理想的なレプリカおよび現在のレプリカを示しています。
メモリの使用量/リクエスト/上限 vs 推奨メモリ上限
- このグラフは、Federator.ai からの推奨メモリ上限と、コントローラーの現在のメモリ使用量、要求された使用量、および制限された使用量を示しています。
CPU 使用量/リクエスト/上限 vs 推奨 CPU 上限
- このグラフは、Federator.ai からの推奨 CPU 上限と、コントローラーの現在の CPU 使用量、要求された使用量、および制限された使用量を示しています。
CPU 使用量/使用率上限
- このグラフは、コントローラーの CPU 使用率と、CPU 使用率が上限を上回っている/下回っていることを視覚的にを示しています。

ProphetStor Federator.ai Kafka の概要

ダッシュボード概要

推奨レプリカと現在/理想的なレプリカ
- この時系列グラフは、Federator.ai からの推奨レプリカと、システム内の必要なレプリカと現在のレプリカを示しています。
生成と消費と生成予測
- この時系列グラフは、Kafka メッセージの生成率と消費率、および Federated.ai によって予測された生成率を示しています。
Kafka コンシューマーラグ
- この時系列グラフは、すべてのパーティションからのコンシューマーラグの合計を示しています。
コンシューマーキューレイテンシー (ミリ秒)
- この時系列グラフは、コンシューマーが受信するまでのメッセージキュー内のメッセージの平均レイテンシーを示しています。
デプロイメモリ使用量
- この時系列グラフは、コンシューマーのメモリ使用量を示しています。
デプロイ CPU 使用量
- この時系列グラフは、コンシューマーの CPU 使用量を示しています。

ProphetStor Federator.ai マルチクラウドコスト分析の概要

マルチクラウドコスト分析の概要

現在のクラスターコストおよび現在のクラスターコンフィギュレーション
- この表は、クラスターの現在のコストと環境コンフィギュレーションを示しています。
推奨クラスター - AWS および推奨クラスターコンフィギュレーション - AWS
- この表は、Federator.ai からの推奨 AWS インスタンスコンフィギュレーションと、推奨 AWS インスタンスのコストを示しています。
推奨クラスター - Azure および推奨クラスターコンフィギュレーション - Azure
- この表は、Federator.ai からの推奨 Azure インスタンスコンフィギュレーションと、推奨 Azure インスタンスのコストを示しています。
推奨クラスター - GCP および推奨クラスターコンフィギュレーション - GCP
- この表は、Federator.ai からの推奨 GCP インスタンスコンフィギュレーションと、推奨 GCP インスタンスのコストを示しています。
最高コストのネームスペース ($/日)
- このグラフは、現在のクラスターのネームスペースの最高コスト（日次）を示しています。
最高予測コストのネームスペース ($/月)
- このグラフは、現在のクラスターのネームスペースの最高予測コスト（月次）を示しています。

セットアップ

以下の手順に従って、Federator.ai をダウンロードおよび設定してください。

インストール

OpenShift/Kubernetes クラスターにログインします

次のコマンドで OpenShift/Kubernetes 用の Federator.ai をインストールします

$ curl https://raw.githubusercontent.com/containers-ai/prophetstor/master/deploy/federatorai-launcher.sh | bash

$ curl https://raw.githubusercontent.com/containers-ai/prophetstor/master/deploy/federatorai-launcher.sh | bash
...
Please enter Federator.ai version tag [default: latest]:latest
Please enter the path of Federator.ai directory [default: /opt]:

Downloading v4.5.1-b1562 tgz file ...
Done
Do you want to use a private repository URL? [default: n]:
Do you want to launch Federator.ai installation script? [default: y]:

Executing install.sh ...
Checking environment version...
...Passed
Enter the namespace you want to install Federator.ai [default: federatorai]:
.........
Downloading Federator.ai alamedascaler sample files ...
Done
========================================
Which storage type you would like to use? ephemeral or persistent?
[default: persistent]:
Specify log storage size [e.g., 2 for 2GB, default: 2]:
Specify AI engine storage size [e.g., 10 for 10GB, default: 10]:
Specify InfluxDB storage size [e.g., 100 for 100GB, default: 100]:
Specify storage class name: managed-nfs-storage
Do you want to expose dashboard and REST API services for external access? [default: y]:

----------------------------------------
install_namespace = federatorai
storage_type = persistent
log storage size = 2 GB
AI engine storage size = 10 GB
InfluxDB storage size = 100 GB
storage class name = managed-nfs-storage
expose service = y
----------------------------------------
Is the above information correct [default: y]:
Processing...

(snipped)
.........
All federatorai pods are ready.

========================================
You can now access GUI through https://<YOUR IP>:31012
Default login credential is admin/admin

Also, you can start to apply alamedascaler CR for the target you would like to monitor.
Review administration guide for further details.
========================================
========================================
You can now access Federatorai REST API through https://<YOUR IP>:31011
The default login credential is admin/admin
The REST API online document can be found in https://<YOUR IP>:31011/apis/v1/swagger/index.html
========================================

Install Federator.ai v4.5.1-b1562 successfully

Downloaded YAML files are located under /opt/federatorai/installation

Downloaded files are located under /opt/federatorai/repo/v4.5.1-b1562

Federator.ai ポッドが正しく実行されていることを確認します。
```
$ kubectl get pod -n federatorai
```
Federator.ai GUI にログインします。URL とログイン資格情報は、ステップ 2 の出力で確認できます。

構成

お使いのアカウントで Datadog にログインし、Datadog API を使用するための API キーとアプリケーションキーを取得します。
クラスターごとのメメトリクスデータソース用に Federator.ai を構成します。
- Federator.ai GUI を起動 -> Configuration -> Clusters -> “Add Cluster” をクリックします
- API キーとアプリケーションキーを入力します
詳細については、Federator.ai - インストールおよびコンフィギュレーションガイドおよびユーザーガイドを参照してください。

収集データ

メトリクス

federatorai.integration.status (gauge)	integration status for showing Federator.ai health status.
federatorai.recommendation (gauge)	recommended deployment/statefulset replicas.
federatorai.prediction.kafka (gauge)	Workload prediction for Kafka metrics.
federatorai.kafka.broker_offset_rate (gauge)	The delta of kafka.broker_offset timeseries in one minute.
federatorai.kafka.consumer_offset_rate (gauge)	The delta of kafka.consumer_offset timeseries in one minute.
federatorai.prediction.node (gauge)	Workload prediction for a Kubernetes node.
federatorai.prediction.node.avg (gauge)	The average value of workload predictions for a Kubernetes node over a prediction window.
federatorai.prediction.node.min (gauge)	The minimum value of workload predictions for a Kubernetes node over a prediction window.
federatorai.prediction.node.max (gauge)	The maximum value of workload predictions for a Kubernetes node over a prediction window.
federatorai.prediction.controller (gauge)	Workload prediction for a specific controller
federatorai.prediction.controller.avg (gauge)	The average value of workload predictions for a specific controller over a prediction window.
federatorai.prediction.controller.min (gauge)	The minimum value of workload predictions for a specific controller over a prediction window.
federatorai.prediction.controller.max (gauge)	The maximum value of workload predictions for a specific controller over a prediction window.
federatorai.prediction.nginx_ingress_controller_request_rate (gauge)	Workload prediction of request rate for the upstream service of Nginx ingress
federatorai.resource_planning.node (gauge)	Workload predictions for resource planning of a Kubernetes node.
federatorai.resource_planning.controller (gauge)	Workload predictions for resource planning of a Kubernetes controller.
federatorai.recommendation.instance (gauge)	Cost of a recommended cloud instance.
federatorai.cost_analysis.instance.cost (gauge)	Cost analysis for a cloud instance.
federatorai.cost_analysis.namespace.cost (gauge)	Cost analysis for a namespace in a Kubernetes cluster
federatorai.prediction.namespace.cost (gauge)	Cost prediction for a namespace in a Kubernetes cluster
federatorai.kubernetes.cpu.usage.total.controller (gauge)	The number of cores (in millicore) used by the Kubernetes controller.
federatorai.kubernetes.memory.usage.controller (gauge)	The memory usage (in bytes) of the Kubernetes controller.
federatorai.kubernetes.cpu.usage.total.node (gauge)	The number of cores (in millicore) used by the Kubernetes node.
federatorai.kubernetes.memory.usage.node (gauge)	The memory usage (in bytes) of the Kubernetes node.
federatorai.cost_analysis.resource_alloc_cost.cluster (gauge)	The cost per hour/per 6 hours/per day based on resource allocation of a Kubernetes cluster for daily/weekly/monthly cost analysis
federatorai.cost_analysis.resource_alloc_cost.node (gauge)	The cost per hour/per 6 hours/ per day based on resource allocation of a Kubernetes node for daily/weekly/monthly cost analysis
federatorai.cost_analysis.resource_alloc_cost.namespace (gauge)	The cost per hour/per 6 hours/per day based on resource allocation of a Kubernetes namespace for daily/weekly/monthly cost analysis
federatorai.cost_analysis.resource_usage_cost.cluster (gauge)	The cost per hour/per 6 hours/per day based on resource usage of a Kubernetes cluster for daily/weekly/monthly cost analysis
federatorai.cost_analysis.resource_usage_cost.node (gauge)	The cost per hour/per 6 hours/per day based on resource usage of a Kubernetes node for daily/weekly/monthly cost analysis
federatorai.cost_analysis.resource_usage_cost.namespace (gauge)	The cost per hour/per 6 hours/per day based on resource usage of a Kubernetes namespace for daily/weekly/monthly cost analysis
federatorai.cost_analysis.cost_per_day.cluster (gauge)	The cost of the entire 24 hours based on resource allocation of a Kubernetes cluster
federatorai.cost_analysis.cost_per_day.node (gauge)	The cost of the entire 24 hours based on resource allocation of a Kubernetes node
federatorai.cost_analysis.cost_per_day.namespace (gauge)	The cost of the entire 24 hours based on resource allocation of a Kubernetes namespace
federatorai.cost_analysis.cost_per_week.cluster (gauge)	The cost of the entire 7 days based on resource allocation of a Kubernetes cluster
federatorai.cost_analysis.cost_per_week.node (gauge)	The cost of the entire 7 days based on resource allocation of a Kubernetes node
federatorai.cost_analysis.cost_per_week.namespace (gauge)	The cost of the entire 7 days based on resource allocation of a Kubernetes namespace
federatorai.cost_analysis.cost_per_month.cluster (gauge)	The cost of the entire 30 days based on resource allocation of a Kubernetes cluster
federatorai.cost_analysis.cost_per_month.node (gauge)	The cost of the entire 30 days based on resource allocation of a Kubernetes node
federatorai.cost_analysis.cost_per_month.namespace (gauge)	The cost of the entire 30 days based on resource allocation of a Kubernetes namespace
federatorai.cost_analysis.cost_efficiency_per_day.cluster (gauge)	The cost efficiency for the entire 24 hours based on resource allocation of a Kubernetes cluster
federatorai.cost_analysis.cost_efficiency_per_day.node (gauge)	The cost efficiency for the entire 24 hours based on resource allocation of a Kubernetes node
federatorai.cost_analysis.cost_efficiency_per_day.namespace (gauge)	The cost efficiency for the entire 24 hours based on resource allocation of a Kubernetes namespace
federatorai.cost_analysis.cost_efficiency_per_week.cluster (gauge)	The cost efficiency for the entire 7 days based on resource allocation of a Kubernetes cluster
federatorai.cost_analysis.cost_efficiency_per_week.node (gauge)	The cost efficiency for the entire 7 days based on resource allocation of a Kubernetes node
federatorai.cost_analysis.cost_efficiency_per_week.namespace (gauge)	The cost efficiency for the entire 7 days based on resource allocation of a Kubernetes namespace
federatorai.cost_analysis.cost_efficiency_per_month.cluster (gauge)	The cost efficiency for the entire 30 days based on resource allocation of a Kubernetes cluster
federatorai.cost_analysis.cost_efficiency_per_month.node (gauge)	The cost efficiency for the entire 30 days based on resource allocation of a Kubernetes node
federatorai.cost_analysis.cost_efficiency_per_month.namespace (gauge)	The cost efficiency for the entire 30 days based on resource allocation of a Kubernetes namespace
federatorai.recommendation.cost_analysis.cost_per_day.cluster (gauge)	The estimated cost of the entire 24 hours based on Federator.ai recommendation for a Kubernetes cluster
federatorai.recommendation.cost_analysis.cost_per_day.node (gauge)	The estimated cost of the entire 24 hours based on Federator.ai recommendation for a Kubernetes node
federatorai.recommendation.cost_analysis.cost_per_day.namespace (gauge)	The estimated cost of the entire 24 hours based on Federator.ai recommendation for a Kubernetes namespace
federatorai.recommendation.cost_analysis.cost_per_week.cluster (gauge)	The estimated cost of the entire 7 days based on Federator.ai recommendation for a Kubernetes cluster
federatorai.recommendation.cost_analysis.cost_per_week.node (gauge)	The estimated cost of the entire 7 days based on Federator.ai recommendation for a Kubernetes node
federatorai.recommendation.cost_analysis.cost_per_week.namespace (gauge)	The estimated cost of the entire 7 days based on Federator.ai recommendation for a Kubernetes namespace
federatorai.recommendation.cost_analysis.cost_per_month.cluster (gauge)	The estimated cost of the entire 30 days based on Federator.ai recommendation for a Kubernetes cluster
federatorai.recommendation.cost_analysis.cost_per_month.node (gauge)	The estimated cost of the entire 30 days based on Federator.ai recommendation for a Kubernetes node
federatorai.recommendation.cost_analysis.cost_per_month.namespace (gauge)	The estimated cost of the entire 30 days based on Federator.ai recommendation for a Kubernetes namespace
federatorai.recommendation.cost_analysis.cost_efficiency_per_day.cluster (gauge)	The cost efficiency for the entire 24 hours based on Federator.ai recommendation for a Kubernetes cluster
federatorai.recommendation.cost_analysis.cost_efficiency_per_day.namespace (gauge)	The cost efficiency for the entire 24 hours based on Federator.ai recommendation for a Kubernetes namespace
federatorai.recommendation.cost_analysis.cost_efficiency_per_week.cluster (gauge)	The cost efficiency for the entire 7 days based on Federator.ai recommendation for a Kubernetes cluster
federatorai.recommendation.cost_analysis.cost_efficiency_per_week.namespace (gauge)	The cost efficiency for the entire 7 days based on Federator.ai recommendation for a Kubernetes namespace
federatorai.recommendation.cost_analysis.cost_efficiency_per_month.cluster (gauge)	The cost efficiency for the entire 30 days based on Federator.ai recommendation for a Kubernetes cluster
federatorai.recommendation.cost_analysis.cost_efficiency_per_month.namespace (gauge)	The cost efficiency for the entire 30 days based on Federator.ai recommendation for a Kubernetes namespace

サービスチェック

Federator.ai には、サービスのチェック機能は含まれません。

イベント

Federator.ai には、イベントは含まれません。

トラブルシューティング

ご不明な点は、Federator.ai - インストールおよびコンフィギュレーションガイドをご覧いただくか、Datadog サポートまでお問い合わせください。