Google Kubernetes Engine, Cloud

개요

Google Kubernetes Engine(GKE)은 도커(Docker) 컨테이너 실행을 위한 강력한 클러스터 관리자 및 오케스트레이션 시스템입니다.

Google Kubernetes Engine으로 메트릭을 수집하면 다음 작업을 수행할 수 있습니다.

  • GKE 컨테이너와 GKE 컨트롤 플레인의 성능을 시각화합니다.
  • GKE 컨테이너의 성능과 애플리케이션의 상관 관계를 파악합니다.

본 통합에는 다음과 같은 두 개의 별도 프리셋 대시보드가 제공됩니다.

  • 표준 GKE 대시보드는 Google 통합으로 수집한 GKE 및 GKE 컨트롤 플레인 메트릭을 제공합니다.
  • 강화 GKE 대시보드는 Datadog의 에이전트 기반 쿠버네티스(Kubernetes) 통합으로 수집한 메트릭과 Google 통합으로 수집한 GKE 컨트롤 플레인 메트릭 을 제공합니다.

표준 대시보드는 간단한 설정만으로 GKE에서 옵저빌리티를 제공합니다. 강화 대시보드는 추가 설정 단계가 필요하지만, 실시간 쿠버네티스(Kubernetes) 메트릭을 더 제공합니다. 대개 프로덕션 환경에서 워크로드를 모니터링할 목적으로 대시보드를 복제 및 사용자 정의할 때 더 적합합니다.

자체 호스팅된 쿠버네티스(Kubernetes) 클러스터와 달리, GKE 컨트롤 플레인은 Google이 관리하며 클러스터에서 실행되는 Datadog 에이전트에서는 접근할 수 없습니다. 따라서 GKE 컨트롤 플레인의 옵저빌리티를 활용하려면 클러스터를 모니터링하는데 Datadog 에이전트를 주로 사용하더라도 Google 통합이 필요합니다.

설정

메트릭 수집

설치

  1. 아직 설치하지 않았다면, 먼저 Google Cloud Platform 통합을 설정하세요. 기본 메트릭 및 프리셋 대시보드를 활용하는 데에는 추가 설치 단계가 필요하지 않습니다.

  2. 강화 대시보드를 채우고 애플리케이션 성능 모니터링(APM) 추적, 로깅, 프로파일링, 보안 및 기타 Datadog 서비스를 활성화하려면 GKE 클러스터에 Datadog 에이전트를 설치하세요.

  3. 컨트롤 플레인 메트릭을 채우려면 GKE 컨트롤 플레인 메트릭을 활성화해야 합니다. 컨트롤 플레인 메트릭을 사용하면 쿠버네티스(Kubernetes) 컨트롤 플레인 작업에 관한 옵저빌리티를 활용할 수 있으며, 이는 GKE에서 Google이 관리합니다.

로그 수집

Google Kubernetes Engine 로그는 Google Cloud Logging으로 수집하여 클라우드 Pub/Sub 토픽을 통해 데이터 플로우 작업으로 전송됩니다. 아직 설정하지 않았다면 Datadog 데이터 플로우 템플릿으로 로깅을 설정하세요.

해당 작업이 완료되면 Google Cloud Logging에서 Google Kubernetes Engine 로그를 다음 Pub/Sub 주제로 내보냅니다.

  1. GCP 로그 탐색기 페이지로 이동하여 쿠버네티스(Kubernetes) 및 GKE 로그를 필터링합니다.

  2. Create Sink를 클릭하고 그에 따라 싱크 이름을 지정합니다.

  3. “Cloud Pub/Sub"를 대상으로 선택하고 해당 목적으로 생성된 Pub/Sub 주제를 선택합니다. 참고: Pub/Sub 주제는 다른 프로젝트에 있을 수 있습니다.

    Google Cloud Pub/Sub 로그를 Pub Sub로 내보내기
  4. Create를 클릭하고 확인 메시지가 나타날 때까지 기다립니다.

수집한 데이터

메트릭

gcp.gke.container.accelerator.duty_cycle
(gauge)
Percent of time over the past sample period during which the accelerator was actively processing.
Shown as percent
gcp.gke.container.accelerator.memory_total
(gauge)
Total accelerator memory.
Shown as byte
gcp.gke.container.accelerator.memory_used
(gauge)
Total accelerator memory allocated.
Shown as byte
gcp.gke.container.accelerator.request
(gauge)
Number of accelerator devices requested by the container.
Shown as device
gcp.gke.container.cpu.core_usage_time
(count)
Cumulative CPU usage on all cores used by the container.
Shown as second
gcp.gke.container.cpu.limit_cores
(gauge)
CPU cores limit of the container.
Shown as core
gcp.gke.container.cpu.limit_utilization
(gauge)
Fraction of the CPU limit that is currently in use on the instance.
Shown as fraction
gcp.gke.container.cpu.request_cores
(gauge)
Number of CPU cores requested by the container.
Shown as core
gcp.gke.container.cpu.request_utilization
(gauge)
Fraction of the requested CPU that is currently in use on the instance.
Shown as fraction
gcp.gke.container.ephemeral_storage.limit_bytes
(gauge)
Local ephemeral storage limit.
Shown as byte
gcp.gke.container.ephemeral_storage.request_bytes
(gauge)
Local ephemeral storage request.
Shown as byte
gcp.gke.container.ephemeral_storage.used_bytes
(gauge)
Local ephemeral storage usage.
Shown as byte
gcp.gke.container.memory.limit_bytes
(gauge)
Memory limit of the container.
Shown as byte
gcp.gke.container.memory.limit_utlization
(gauge)
Fraction of the memory limit that is currently in use on the instance.
Shown as fraction
gcp.gke.container.memory.page_fault_count
(count)
Number of page faults, broken down by type.
Shown as fault
gcp.gke.container.memory.request_bytes
(gauge)
Memory request of the container.
Shown as byte
gcp.gke.container.memory.request_utilization
(gauge)
Fraction of the requested memory that is currently in use on the instance.
Shown as fraction
gcp.gke.container.memory.used_bytes
(gauge)
Memory usage of the container.
Shown as byte
gcp.gke.container.restart_count
(count)
Number of times the container has restarted.
Shown as occurrence
gcp.gke.container.uptime
(gauge)
Time in seconds that the container has been running.
Shown as second
gcp.gke.node.cpu.allocatable_cores
(gauge)
Number of allocatable CPU cores on the node.
Shown as core
gcp.gke.node.cpu.allocatable_utilization
(gauge)
Fraction of the allocatable CPU that is currently in use on the instance.
Shown as fraction
gcp.gke.node.cpu.core_usage_time
(count)
Cumulative CPU usage on all cores used on the node.
Shown as second
gcp.gke.node.cpu.total_cores
(gauge)
Total number of CPU cores on the node.
Shown as core
gcp.gke.node.ephemeral_storage.allocatable_bytes
(gauge)
Local ephemeral storage bytes allocatable on the node.
Shown as byte
gcp.gke.node.ephemeral_storage.inodes_free
(gauge)
Free number of inodes on local ephemeral storage.
gcp.gke.node.ephemeral_storage.inodes_total
(gauge)
Total number of inodes on local ephemeral storage.
gcp.gke.node.ephemeral_storage.total_bytes
(gauge)
Total ephemeral storage bytes on the node.
Shown as byte
gcp.gke.node.ephemeral_storage.used_bytes
(gauge)
Local ephemeral storage bytes used by the node.
Shown as byte
gcp.gke.node.memory.allocatable_bytes
(gauge)
Cumulative memory bytes used by the node.
Shown as byte
gcp.gke.node.memory.allocatable_utilization
(gauge)
Fraction of the allocatable memory that is currently in use on the instance.
Shown as fraction
gcp.gke.node.memory.total_bytes
(gauge)
Number of bytes of memory allocatable on the node.
Shown as byte
gcp.gke.node.memory.used_bytes
(gauge)
Cumulative memory bytes used by the node.
Shown as byte
gcp.gke.node.network.received_bytes_count
(count)
Cumulative number of bytes received by the node over the network.
Shown as byte
gcp.gke.node.network.sent_bytes_count
(count)
Cumulative number of bytes transmitted by the node over the network.
Shown as byte
gcp.gke.node.pid_limit
(gauge)
Max PID of OS on the node.
gcp.gke.node.pid_used
(gauge)
Number of running process in the OS on the node.
gcp.gke.node_daemon.cpu.core_usage_time
(count)
Cumulative CPU usage on all cores used by the node level system daemon.
Shown as second
gcp.gke.node_daemon.memory.used_bytes
(gauge)
Memory usage by the system daemon.
Shown as byte
gcp.gke.pod.network.received_bytes_count
(count)
Cumulative number of bytes received by the pod over the network.
Shown as byte
gcp.gke.pod.network.sent_bytes_count
(count)
Cumulative number of bytes transmitted by the pod over the network.
Shown as byte
gcp.gke.pod.volume.total_bytes
(gauge)
Total number of disk bytes available to the pod.
Shown as byte
gcp.gke.pod.volume.used_bytes
(gauge)
Number of disk bytes used by the pod.
Shown as byte
gcp.gke.pod.volume.utilization
(gauge)
Fraction of the volume that is currently being used by the instance.
Shown as fraction
gcp.gke.control_plane.apiserver.admission_controller_admission_duration_seconds
(gauge)
Admission controller latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit).
Shown as second
gcp.gke.control_plane.apiserver.admission_step_admission_duration_seconds
(gauge)
Admission sub-step latency histogram in seconds, broken out for each operation and API resource and step type (validate or admit).
Shown as second
gcp.gke.control_plane.apiserver.admission_webhook_admission_duration_seconds
(gauge)
Admission webhook latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit).
Shown as second
gcp.gke.control_plane.apiserver.current_inflight_requests
(gauge)
Maximal number of currently used inflight request limit of this apiserver per request kind.
Shown as request
gcp.gke.control_plane.apiserver.request_duration_seconds
(gauge)
Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.
Shown as second
gcp.gke.control_plane.apiserver.request_total
(gauge)
Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response code.
Shown as request
gcp.gke.control_plane.apiserver.response_sizes
(gauge)
Response size distribution in bytes for each group, version, verb, resource, subresource, scope and component.
Shown as byte
gcp.gke.control_plane.apiserver.storage_objects
(gauge)
Number of stored objects at the time of last check split by kind.
Shown as object
gcp.gke.control_plane.controller_manager.node_collector_evictions_number
(count)
Number of Node evictions that happened since current instance of NodeController started.
Shown as event
gcp.gke.control_plane.scheduler.pending_pods
(gauge)
Number of pending pods, by the queue type.
Shown as event
gcp.gke.control_plane.scheduler.pod_scheduling_duration_seconds
(gauge)
E2e latency for a pod being scheduled
Shown as second
gcp.gke.control_plane.scheduler.preemption_attempts_total
(count)
Total preemption attempts in the cluster till now
Shown as attempt
gcp.gke.control_plane.scheduler.preemption_victims
(gauge)
Number of selected preemption victims
Shown as event
gcp.gke.control_plane.scheduler.scheduling_attempt_duration_seconds
(gauge)
Scheduling attempt latency in seconds
Shown as second
gcp.gke.control_plane.scheduler.schedule_attempts_total
(gauge)
Number of attempts to schedule pods.
Shown as attempt
gcp.gke.control_plane.apiserver.aggregator_unavailable_apiservice
(gauge)
(Deprecated)
gcp.gke.control_plane.apiserver.audit_event_total
(gauge)
(Deprecated) Accumulated number audit events generated and sent to the audit backend
Shown as event
gcp.gke.control_plane.apiserver.audit_level_total
(gauge)
(Deprecated)
gcp.gke.control_plane.apiserver.audit_requests_rejected_total
(gauge)
(Deprecated)
Shown as request
gcp.gke.control_plane.apiserver.client_certificate_expiration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.apiserver.etcd_object_counts
(gauge)
(Deprecated) Number of stored objects split by kind.
Shown as object
gcp.gke.control_plane.apiserver.etcd_request_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.apiserver.init_events_total
(gauge)
(Deprecated)
Shown as event
gcp.gke.control_plane.apiserver.longrunning_gauge
(gauge)
(Deprecated) Gauge of all active long-running apiserver requests.
Shown as request
gcp.gke.control_plane.apiserver.registered_watchers
(gauge)
(Deprecated) Number of currently registered watchers for a given resource.
Shown as object
gcp.gke.control_plane.apiserver.workqueue_adds_total
(count)
(Deprecated)
gcp.gke.control_plane.apiserver.workqueue_depth
(gauge)
(Deprecated)
gcp.gke.control_plane.apiserver.workqueue_longest_running_processor_seconds
(gauge)
(Deprecated) Number of seconds that the longest running processor has been running.
Shown as second
gcp.gke.control_plane.apiserver.workqueue_queue_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.apiserver.workqueue_retries_total
(count)
(Deprecated)
gcp.gke.control_plane.apiserver.workqueue_unfinished_work_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.apiserver.workqueue_work_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.controller_manager.cloudprovider_gce_api_request_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.controller_manager.cronjob_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by cronjob controller
gcp.gke.control_plane.controller_manager.daemon_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by daemon controller
gcp.gke.control_plane.controller_manager.deployment_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by deployment controller
gcp.gke.control_plane.controller_manager.endpoint_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by endpoint controller
gcp.gke.control_plane.controller_manager.gc_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by GC controller
gcp.gke.control_plane.controller_manager.job_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by job controller
gcp.gke.control_plane.controller_manager.leader_election_master_status
(gauge)
(Deprecated)
gcp.gke.control_plane.controller_manager.namespace_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by namespace controller
gcp.gke.control_plane.controller_manager.node_collector_evictions_number
(count)
(Deprecated) Count of node eviction events.
gcp.gke.control_plane.controller_manager.node_collector_unhealthy_nodes_in_zone
(gauge)
(Deprecated) Number of unhealthy nodes
gcp.gke.control_plane.controller_manager.node_collector_zone_health
(gauge)
(Deprecated)
gcp.gke.control_plane.controller_manager.node_collector_zone_size
(gauge)
(Deprecated)
gcp.gke.control_plane.controller_manager.node_ipam_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by IPAM controller
gcp.gke.control_plane.controller_manager.node_lifecycle_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by lifecycle controller
gcp.gke.control_plane.controller_manager.persistentvolume_protection_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by persistent volume protection controller
gcp.gke.control_plane.controller_manager.persistentvolumeclaim_protection_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by persistent volume claim protection controller
gcp.gke.control_plane.controller_manager.replicaset_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by ReplicaSet controller
gcp.gke.control_plane.controller_manager.replication_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by replication controller
gcp.gke.control_plane.controller_manager.route_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by route controller
gcp.gke.control_plane.controller_manager.service_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by service controller
gcp.gke.control_plane.controller_manager.serviceaccount_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by service account controller
gcp.gke.control_plane.controller_manager.serviceaccount_tokens_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by service account tokens controller
gcp.gke.control_plane.controller_manager.workqueue_adds_total
(count)
(Deprecated)
gcp.gke.control_plane.controller_manager.workqueue_depth
(gauge)
(Deprecated)
gcp.gke.control_plane.controller_manager.workqueue_longest_running_processor_seconds
(gauge)
(Deprecated) Number of seconds that the longest running processor has been running.
Shown as second
gcp.gke.control_plane.controller_manager.workqueue_queue_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.controller_manager.workqueue_retries_total
(count)
(Deprecated)
gcp.gke.control_plane.controller_manager.workqueue_unfinished_work_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.controller_manager.workqueue_work_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.scheduler.binding_duration_seconds
(gauge)
(Deprecated) Number of latency in seconds.
Shown as second
gcp.gke.control_plane.scheduler.e2e_scheduling_duration_seconds
(gauge)
(Deprecated) Total e2e scheduling latency.
Shown as second
gcp.gke.control_plane.scheduler.framework_extension_point_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.scheduler.leader_election_master_status
(gauge)
(Deprecated)
gcp.gke.control_plane.scheduler.scheduling_algorithm_duration_seconds
(gauge)
(Deprecated) Total scheduling algorithm latency.
Shown as second
gcp.gke.control_plane.scheduler.scheduling_algorithm_preemption_evaluation_seconds
(gauge)
(Deprecated)
Shown as second

이벤트

Google Kubernetes Engine 통합은 이벤트를 포함하지 않습니다.

서비스 점검

Google Kubernetes Engine 통합은 서비스 점검을 포함하지 않습니다.

트러블슈팅

도움이 필요하신가요? Datadog 지원팀에 문의하세요.

PREVIEWING: evan.li/clarify-agentless