Google Kubernetes Engine, Nube

Información general

Google Kubernetes Engine (GKE) es un potente gestor de clústeres y un sistema de orquestación para la ejecución de tus contenedores Docker.

Obtén métricas de Google Kubernetes Engine para:

  • Visualizar el rendimiento de tus contenedores GKE y del plano de control GKE.
  • Correlacionar el rendimiento de tus contenedores GKE con tus aplicaciones.

Esta integración viene con dos dashboards preconfigurados separados:

  • El dashboard estándar de GKE presenta métricas de GKE y del plano de control GKE recopiladas de la integración Google.
  • El dashboard mejorado de GKE presenta métricas de la integración Kubernetes basada en el Agent de Datadog junto con las métricas del plano de control GKE recopiladas de la integración Google.

El dashboard estándar ofrece observabilidad en GKE con una simple configuración. El dashboard mejorado requiere pasos de configuración adicionales, pero proporciona más métricas de Kubernetes en tiempo real y a menudo es un mejor punto de partida al clonar y personalizar un dashboard para la monitorización de cargas de trabajo en producción.

A diferencia de los clústeres Kubernetes autoalojados, el plano de control GKE es gestionado por Google y no es accesible por un Datadog Agent que se ejecuta en el clúster. Por lo tanto, la observabilidad en el plano de control GKE requiere la integración Google, incluso si utilizas principalmente el Datadog Agent para monitorizar tus clústeres.

Configuración

Recopilación de métricas

Instalación

  1. Si aún no lo has hecho, configura la integración Google Cloud Platform. No es necesario realizar ningún otro paso de instalación para las métricas estándar y el dashboard preconfigurado.

  2. Para rellenar el dashboard mejorado y habilitar el rastreo, la generación de logs, los perfiles, la seguridad de APM y otros servicios Datadog, instala el Datadog Agent en tu clúster GKE.

  3. Para rellenar las métricas del plano de control, debes habilitar las métricas del plano de control GKE. Las métricas del plano de control te proporcionan visibilidad del funcionamiento del plano de control Kubernetes, gestionado por Google en GKE.

APM

Los logs de Google Kubernetes Engine se recopilan con Google Cloud Logging y se envían a una tarea de Dataflow a través de un tema Cloud Pub/Sub. Si aún no lo has hecho, configura la generación de logs con la plantilla Dataflow de Datadog.

Una vez hecho esto, exporta tus logs de Google Kubernetes Engine desde Google Cloud Logging al tema Pub/Sub:

  1. Ve a la página del Explorador de logs de GCP y filtra logs de Kubernetes y GKE.

  2. Haz clic en Create sink (Crear sumidero) y asigna al sumidero el nombre correspondiente.

  3. Elige “Cloud Pub/Sub” como destino y selecciona el tema Pub/Sub creado para tal fin. Nota: El tema Pub/Sub puede encontrarse en un proyecto diferente.

    {< img src=“integrations/google_cloud_pubsub/creating_sink2.png” alt=“Exportar logs de Google Cloud Pub/Sub a Pub Sub” >}}

  4. Haz clic en Create (Crear) y espera a que aparezca el mensaje de confirmación.

Datos recopilados

Métricas

gcp.gke.container.accelerator.duty_cycle
(gauge)
Percent of time over the past sample period during which the accelerator was actively processing.
Shown as percent
gcp.gke.container.accelerator.memory_total
(gauge)
Total accelerator memory.
Shown as byte
gcp.gke.container.accelerator.memory_used
(gauge)
Total accelerator memory allocated.
Shown as byte
gcp.gke.container.accelerator.request
(gauge)
Number of accelerator devices requested by the container.
Shown as device
gcp.gke.container.cpu.core_usage_time
(count)
Cumulative CPU usage on all cores used by the container.
Shown as second
gcp.gke.container.cpu.limit_cores
(gauge)
CPU cores limit of the container.
Shown as core
gcp.gke.container.cpu.limit_utilization
(gauge)
Fraction of the CPU limit that is currently in use on the instance.
Shown as fraction
gcp.gke.container.cpu.request_cores
(gauge)
Number of CPU cores requested by the container.
Shown as core
gcp.gke.container.cpu.request_utilization
(gauge)
Fraction of the requested CPU that is currently in use on the instance.
Shown as fraction
gcp.gke.container.ephemeral_storage.limit_bytes
(gauge)
Local ephemeral storage limit.
Shown as byte
gcp.gke.container.ephemeral_storage.request_bytes
(gauge)
Local ephemeral storage request.
Shown as byte
gcp.gke.container.ephemeral_storage.used_bytes
(gauge)
Local ephemeral storage usage.
Shown as byte
gcp.gke.container.memory.limit_bytes
(gauge)
Memory limit of the container.
Shown as byte
gcp.gke.container.memory.limit_utlization
(gauge)
Fraction of the memory limit that is currently in use on the instance.
Shown as fraction
gcp.gke.container.memory.page_fault_count
(count)
Number of page faults, broken down by type.
Shown as fault
gcp.gke.container.memory.request_bytes
(gauge)
Memory request of the container.
Shown as byte
gcp.gke.container.memory.request_utilization
(gauge)
Fraction of the requested memory that is currently in use on the instance.
Shown as fraction
gcp.gke.container.memory.used_bytes
(gauge)
Memory usage of the container.
Shown as byte
gcp.gke.container.restart_count
(count)
Number of times the container has restarted.
Shown as occurrence
gcp.gke.container.uptime
(gauge)
Time in seconds that the container has been running.
Shown as second
gcp.gke.node.cpu.allocatable_cores
(gauge)
Number of allocatable CPU cores on the node.
Shown as core
gcp.gke.node.cpu.allocatable_utilization
(gauge)
Fraction of the allocatable CPU that is currently in use on the instance.
Shown as fraction
gcp.gke.node.cpu.core_usage_time
(count)
Cumulative CPU usage on all cores used on the node.
Shown as second
gcp.gke.node.cpu.total_cores
(gauge)
Total number of CPU cores on the node.
Shown as core
gcp.gke.node.ephemeral_storage.allocatable_bytes
(gauge)
Local ephemeral storage bytes allocatable on the node.
Shown as byte
gcp.gke.node.ephemeral_storage.inodes_free
(gauge)
Free number of inodes on local ephemeral storage.
gcp.gke.node.ephemeral_storage.inodes_total
(gauge)
Total number of inodes on local ephemeral storage.
gcp.gke.node.ephemeral_storage.total_bytes
(gauge)
Total ephemeral storage bytes on the node.
Shown as byte
gcp.gke.node.ephemeral_storage.used_bytes
(gauge)
Local ephemeral storage bytes used by the node.
Shown as byte
gcp.gke.node.memory.allocatable_bytes
(gauge)
Cumulative memory bytes used by the node.
Shown as byte
gcp.gke.node.memory.allocatable_utilization
(gauge)
Fraction of the allocatable memory that is currently in use on the instance.
Shown as fraction
gcp.gke.node.memory.total_bytes
(gauge)
Number of bytes of memory allocatable on the node.
Shown as byte
gcp.gke.node.memory.used_bytes
(gauge)
Cumulative memory bytes used by the node.
Shown as byte
gcp.gke.node.network.received_bytes_count
(count)
Cumulative number of bytes received by the node over the network.
Shown as byte
gcp.gke.node.network.sent_bytes_count
(count)
Cumulative number of bytes transmitted by the node over the network.
Shown as byte
gcp.gke.node.pid_limit
(gauge)
Max PID of OS on the node.
gcp.gke.node.pid_used
(gauge)
Number of running process in the OS on the node.
gcp.gke.node_daemon.cpu.core_usage_time
(count)
Cumulative CPU usage on all cores used by the node level system daemon.
Shown as second
gcp.gke.node_daemon.memory.used_bytes
(gauge)
Memory usage by the system daemon.
Shown as byte
gcp.gke.pod.network.received_bytes_count
(count)
Cumulative number of bytes received by the pod over the network.
Shown as byte
gcp.gke.pod.network.sent_bytes_count
(count)
Cumulative number of bytes transmitted by the pod over the network.
Shown as byte
gcp.gke.pod.volume.total_bytes
(gauge)
Total number of disk bytes available to the pod.
Shown as byte
gcp.gke.pod.volume.used_bytes
(gauge)
Number of disk bytes used by the pod.
Shown as byte
gcp.gke.pod.volume.utilization
(gauge)
Fraction of the volume that is currently being used by the instance.
Shown as fraction
gcp.gke.control_plane.apiserver.admission_controller_admission_duration_seconds
(gauge)
Admission controller latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit).
Shown as second
gcp.gke.control_plane.apiserver.admission_step_admission_duration_seconds
(gauge)
Admission sub-step latency histogram in seconds, broken out for each operation and API resource and step type (validate or admit).
Shown as second
gcp.gke.control_plane.apiserver.admission_webhook_admission_duration_seconds
(gauge)
Admission webhook latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit).
Shown as second
gcp.gke.control_plane.apiserver.current_inflight_requests
(gauge)
Maximal number of currently used inflight request limit of this apiserver per request kind.
Shown as request
gcp.gke.control_plane.apiserver.request_duration_seconds
(gauge)
Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.
Shown as second
gcp.gke.control_plane.apiserver.request_total
(gauge)
Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response code.
Shown as request
gcp.gke.control_plane.apiserver.response_sizes
(gauge)
Response size distribution in bytes for each group, version, verb, resource, subresource, scope and component.
Shown as byte
gcp.gke.control_plane.apiserver.storage_objects
(gauge)
Number of stored objects at the time of last check split by kind.
Shown as object
gcp.gke.control_plane.controller_manager.node_collector_evictions_number
(count)
Number of Node evictions that happened since current instance of NodeController started.
Shown as event
gcp.gke.control_plane.scheduler.pending_pods
(gauge)
Number of pending pods, by the queue type.
Shown as event
gcp.gke.control_plane.scheduler.pod_scheduling_duration_seconds
(gauge)
E2e latency for a pod being scheduled
Shown as second
gcp.gke.control_plane.scheduler.preemption_attempts_total
(count)
Total preemption attempts in the cluster till now
Shown as attempt
gcp.gke.control_plane.scheduler.preemption_victims
(gauge)
Number of selected preemption victims
Shown as event
gcp.gke.control_plane.scheduler.scheduling_attempt_duration_seconds
(gauge)
Scheduling attempt latency in seconds
Shown as second
gcp.gke.control_plane.scheduler.schedule_attempts_total
(gauge)
Number of attempts to schedule pods.
Shown as attempt
gcp.gke.control_plane.apiserver.aggregator_unavailable_apiservice
(gauge)
(Deprecated)
gcp.gke.control_plane.apiserver.audit_event_total
(gauge)
(Deprecated) Accumulated number audit events generated and sent to the audit backend
Shown as event
gcp.gke.control_plane.apiserver.audit_level_total
(gauge)
(Deprecated)
gcp.gke.control_plane.apiserver.audit_requests_rejected_total
(gauge)
(Deprecated)
Shown as request
gcp.gke.control_plane.apiserver.client_certificate_expiration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.apiserver.etcd_object_counts
(gauge)
(Deprecated) Number of stored objects split by kind.
Shown as object
gcp.gke.control_plane.apiserver.etcd_request_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.apiserver.init_events_total
(gauge)
(Deprecated)
Shown as event
gcp.gke.control_plane.apiserver.longrunning_gauge
(gauge)
(Deprecated) Gauge of all active long-running apiserver requests.
Shown as request
gcp.gke.control_plane.apiserver.registered_watchers
(gauge)
(Deprecated) Number of currently registered watchers for a given resource.
Shown as object
gcp.gke.control_plane.apiserver.workqueue_adds_total
(count)
(Deprecated)
gcp.gke.control_plane.apiserver.workqueue_depth
(gauge)
(Deprecated)
gcp.gke.control_plane.apiserver.workqueue_longest_running_processor_seconds
(gauge)
(Deprecated) Number of seconds that the longest running processor has been running.
Shown as second
gcp.gke.control_plane.apiserver.workqueue_queue_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.apiserver.workqueue_retries_total
(count)
(Deprecated)
gcp.gke.control_plane.apiserver.workqueue_unfinished_work_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.apiserver.workqueue_work_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.controller_manager.cloudprovider_gce_api_request_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.controller_manager.cronjob_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by cronjob controller
gcp.gke.control_plane.controller_manager.daemon_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by daemon controller
gcp.gke.control_plane.controller_manager.deployment_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by deployment controller
gcp.gke.control_plane.controller_manager.endpoint_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by endpoint controller
gcp.gke.control_plane.controller_manager.gc_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by GC controller
gcp.gke.control_plane.controller_manager.job_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by job controller
gcp.gke.control_plane.controller_manager.leader_election_master_status
(gauge)
(Deprecated)
gcp.gke.control_plane.controller_manager.namespace_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by namespace controller
gcp.gke.control_plane.controller_manager.node_collector_evictions_number
(count)
(Deprecated) Count of node eviction events.
gcp.gke.control_plane.controller_manager.node_collector_unhealthy_nodes_in_zone
(gauge)
(Deprecated) Number of unhealthy nodes
gcp.gke.control_plane.controller_manager.node_collector_zone_health
(gauge)
(Deprecated)
gcp.gke.control_plane.controller_manager.node_collector_zone_size
(gauge)
(Deprecated)
gcp.gke.control_plane.controller_manager.node_ipam_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by IPAM controller
gcp.gke.control_plane.controller_manager.node_lifecycle_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by lifecycle controller
gcp.gke.control_plane.controller_manager.persistentvolume_protection_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by persistent volume protection controller
gcp.gke.control_plane.controller_manager.persistentvolumeclaim_protection_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by persistent volume claim protection controller
gcp.gke.control_plane.controller_manager.replicaset_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by ReplicaSet controller
gcp.gke.control_plane.controller_manager.replication_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by replication controller
gcp.gke.control_plane.controller_manager.route_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by route controller
gcp.gke.control_plane.controller_manager.service_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by service controller
gcp.gke.control_plane.controller_manager.serviceaccount_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by service account controller
gcp.gke.control_plane.controller_manager.serviceaccount_tokens_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by service account tokens controller
gcp.gke.control_plane.controller_manager.workqueue_adds_total
(count)
(Deprecated)
gcp.gke.control_plane.controller_manager.workqueue_depth
(gauge)
(Deprecated)
gcp.gke.control_plane.controller_manager.workqueue_longest_running_processor_seconds
(gauge)
(Deprecated) Number of seconds that the longest running processor has been running.
Shown as second
gcp.gke.control_plane.controller_manager.workqueue_queue_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.controller_manager.workqueue_retries_total
(count)
(Deprecated)
gcp.gke.control_plane.controller_manager.workqueue_unfinished_work_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.controller_manager.workqueue_work_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.scheduler.binding_duration_seconds
(gauge)
(Deprecated) Number of latency in seconds.
Shown as second
gcp.gke.control_plane.scheduler.e2e_scheduling_duration_seconds
(gauge)
(Deprecated) Total e2e scheduling latency.
Shown as second
gcp.gke.control_plane.scheduler.framework_extension_point_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.scheduler.leader_election_master_status
(gauge)
(Deprecated)
gcp.gke.control_plane.scheduler.scheduling_algorithm_duration_seconds
(gauge)
(Deprecated) Total scheduling algorithm latency.
Shown as second
gcp.gke.control_plane.scheduler.scheduling_algorithm_preemption_evaluation_seconds
(gauge)
(Deprecated)
Shown as second

Eventos

La integración Google Kubernetes Engine no incluye eventos.

Checks de servicio

La integración Google Kubernetes Engine no incluye checks de servicio.

Resolución de problemas

¿Necesitas ayuda? Ponte en contacto con el servicio de asistencia de Datadog.

PREVIEWING: piotr_wolski/update-dsm-docs