Overview

This page provides troubleshooting for the Datadog Cluster Agent’s Admission Controller.

Common problems

Update pre-existing pods

Admission Controller responds to the creation of new pods within your Kubernetes cluster: at pod creation, the Cluster Agent receives a request from Kubernetes and responds with the details of what changes (if any) to make to the pod.

Therefore, Admission Controller does not mutate existing pods within your cluster. If you recently enabled the Admission Controller or made other environmental changes, delete your existing pod and let Kubernetes recreate it. This ensures that Admission Controller updates your pod.

Labels and annotations

The Cluster Agent responds to labels and annotations on the created pod—not the workload (Deployment, DaemonSet, CronJob, etc.) that created that pod. Ensure that your pod template references this accordingly.

For example, the following template sets the label for APM configuration and the annotation for library injection:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-deployment
spec:
  #(...)  
  template:
    metadata:
      labels:
        admission.datadoghq.com/enabled: "true"
      annotations:
        admission.datadoghq.com/<LANGUAGE>-lib.version: <VERSION>
    spec:
      containers:
      #(...)

Application pods are not created

Admission Controller’s injection mode (socket, hostip, service) is set by the configuration of your Cluster Agent. For example, if you have socket mode enabled in your Agent, Admission Controller also uses socket mode.

If you are using GKE Autopilot or OpenShift, you need to use a specific injection mode.

GKE Autopilot

GKE Autopilot restricts the use of any volumes with a hostPath. Therefore, if Admission Controller uses socket mode, the Pods are blocked from scheduling by the GKE Warden.

Enabling GKE Autopilot mode in the Helm chart disables the socket mode to prevent this from ocurring. To enable APM, enable the port and use the hostip or service method instead. The Admission Controller will default to hostip to match.

datadog:
  apm:
    portEnabled: true
  #(...)

providers:
  gke:
    autopilot: true

Refer to the Kubernetes Distributions for more configuration details regarding Autopilot.

OpenShift

OpenShift has SecurityContextConstraints (SCCs) that are required to deploy pods with extra permissions, such as a volume with a hostPath. Datadog components are deployed with SCCs to allow activity specific to Datadog pods, but Datadog does not create SCCs for other pods. The Admission Controller might add the socket based configuration to your application pods, causing them to fail to deploy.

If you are using OpenShift, use hostip mode. The following configuration enables hostip mode by disabling the socket options:

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  features:
    apm:
      enabled: true
      hostPortConfig:
        enabled: true
      unixDomainSocketConfig:
        enabled: false
    dogstatsd:
      hostPortConfig:
        enabled: true
      unixDomainSocketConfig:
        enabled: false

Alternatively, you can set features.admissionController.agentCommunicationMode to hostip or service directly.

datadog:
  apm:
    portEnabled: true
    socketEnabled: false

Alternatively, you can set clusterAgent.admissionController.configMode to hostip or service directly.

Refer to Kubernetes Distributions for more configuration details regarding OpenShift.

View Admission Controller status

The Cluster Agent’s status output provides information to verify that it has created the datadog-webhook for the MutatingWebhookConfiguration and has a valid certificate.

Run the following command:

% kubectl exec -it <Cluster Agent Pod> -- agent status

Your output resembles the following:

...
Admission Controller
====================
  
    Webhooks info
    -------------
      MutatingWebhookConfigurations name: datadog-webhook
      Created at: 2023-09-25T22:32:07Z
      ---------
        Name: datadog.webhook.auto.instrumentation
        CA bundle digest: f24b6c0c40feaad2
        Object selector: &LabelSelector{MatchLabels:map[string]string{admission.datadoghq.com/enabled: true,},MatchExpressions:[]LabelSelectorRequirement{},}
        Rule 1: Operations: [CREATE] - APIGroups: [] - APIVersions: [v1] - Resources: [pods]
        Service: default/datadog-admission-controller - Port: 443 - Path: /injectlib
      ---------
        Name: datadog.webhook.config
        CA bundle digest: f24b6c0c40feaad2
        Object selector: &LabelSelector{MatchLabels:map[string]string{admission.datadoghq.com/enabled: true,},MatchExpressions:[]LabelSelectorRequirement{},}
        Rule 1: Operations: [CREATE] - APIGroups: [] - APIVersions: [v1] - Resources: [pods]
        Service: default/datadog-admission-controller - Port: 443 - Path: /injectconfig
      ---------
        Name: datadog.webhook.tags
        CA bundle digest: f24b6c0c40feaad2
        Object selector: &LabelSelector{MatchLabels:map[string]string{admission.datadoghq.com/enabled: true,},MatchExpressions:[]LabelSelectorRequirement{},}
        Rule 1: Operations: [CREATE] - APIGroups: [] - APIVersions: [v1] - Resources: [pods]
        Service: default/datadog-admission-controller - Port: 443 - Path: /injecttags
  
    Secret info
    -----------
    Secret name: webhook-certificate
    Secret namespace: default
    Created at: 2023-09-25T22:32:07Z
    CA bundle digest: f24b6c0c40feaad2
    Duration before certificate expiration: 8643h34m2.557676864s
...

This output is relative to the Cluster Agent deployed in the default namespace. The Service and Secret should match the namespace used.

View Admission Controller logs

Debug logs help validate that you have set up Admission Controller properly. Enable debug logs with the following configuration:

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  global:
    credentials:
      apiKey: <DATADOG_API_KEY>
    site: <DATADOG_SITE>
    logLevel: debug
datadog:
  logLevel: debug

Validate datadog-webhook

Example logs:

<TIMESTAMP> | CLUSTER | INFO | (pkg/clusteragent/admission/controllers/secret/controller.go:73 in Run) | Starting secrets controller for default/webhook-certificate
<TIMESTAMP> | CLUSTER | DEBUG | (pkg/clusteragent/admission/controllers/webhook/controller_base.go:148 in enqueue) | Adding object with key default/webhook-certificate to the queue
<TIMESTAMP> | CLUSTER | DEBUG | (pkg/clusteragent/admission/controllers/secret/controller.go:140 in enqueue) | Adding object with key default/webhook-certificate to the queue
<TIMESTAMP> | CLUSTER | DEBUG | (pkg/clusteragent/admission/controllers/webhook/controller_base.go:148 in enqueue) | Adding object with key datadog-webhook to the queue
<TIMESTAMP> | CLUSTER | DEBUG | (pkg/util/kubernetes/apiserver/util.go:47 in func1) | Sync done for informer admissionregistration.k8s.io/v1/mutatingwebhookconfigurations in 101.116625ms, last resource version: 152728
<TIMESTAMP> | CLUSTER | DEBUG | (pkg/clusteragent/admission/controllers/webhook/controller_v1.go:140 in reconcile) | The Webhook datadog-webhook was found, updating it
<TIMESTAMP> | CLUSTER | DEBUG | (pkg/clusteragent/admission/controllers/secret/controller.go:211 in reconcile) | The certificate is up-to-date, doing nothing. Duration before expiration: 8558h17m27.909792831s
<TIMESTAMP> | CLUSTER | DEBUG | (pkg/clusteragent/admission/controllers/secret/controller.go:174 in processNextWorkItem) | Secret default/webhook-certificate reconciled successfully
<TIMESTAMP> | CLUSTER | DEBUG | (pkg/clusteragent/admission/controllers/webhook/controller_base.go:176 in processNextWorkItem) | Webhook datadog-webhook reconciled successfully

If do not see that the datadog-webhook webhook has been reconciled successfully, ensure that you have correctly enabled Admission Controller according to the configuration instructions.

Validate injection

Example logs:

<TIMESTAMP> | CLUSTER | DEBUG | (pkg/clusteragent/admission/controllers/secret/controller.go:140 in enqueue) | Adding object with key default/webhook-certificate to the queue
<TIMESTAMP> | CLUSTER | DEBUG | (pkg/clusteragent/admission/controllers/secret/controller.go:211 in reconcile) | The certificate is up-to-date, doing nothing. Duration before expiration: 8558h12m28.007769373s
<TIMESTAMP> | CLUSTER | DEBUG | (pkg/clusteragent/admission/controllers/secret/controller.go:174 in processNextWorkItem) | Secret default/webhook-certificate reconciled successfully
<TIMESTAMP> | CLUSTER | DEBUG | (pkg/clusteragent/admission/mutate/common.go:74 in injectEnv) | Injecting env var 'DD_TRACE_AGENT_URL' into pod with generate name example-pod-123456789-
<TIMESTAMP> | CLUSTER | DEBUG | (pkg/clusteragent/admission/mutate/common.go:74 in injectEnv) | Injecting env var 'DD_DOGSTATSD_URL' into pod with generate name example-pod-123456789-
<TIMESTAMP> | CLUSTER | DEBUG | (pkg/clusteragent/admission/mutate/common.go:74 in injectEnv) | Injecting env var 'DD_ENTITY_ID' into pod with generate name example-pod-123456789-
<TIMESTAMP> | CLUSTER | DEBUG | (pkg/clusteragent/admission/mutate/common.go:74 in injectEnv) | Injecting env var 'DD_SERVICE' into pod with generate name example-pod-123456789-
<TIMESTAMP> | CLUSTER | DEBUG | (pkg/clusteragent/admission/mutate/auto_instrumentation.go:336 in injectLibInitContainer) | Injecting init container named "datadog-lib-python-init" with image "gcr.io/datadoghq/dd-lib-python-init:v1.18.0" into pod with generate name example-pod-123456789-

If you see errors with the injection for a given pod, contact Datadog support with your Datadog configuration and your pod configuration.

If you do not see the injection attempts for any pod, verify your mutateUnlabelled settings and ensure your pod labels match up with the expected values. If these match up, your problem is likely with the networking between the control plane, webhook, and service. See Networking for further information.

Networking

Network policies

Kubernetes Network Policies help you control different ingress (inbound) and egress (outbound) flows of traffic to your pods.

If you are using network policies, Datadog recommends creating corresponding policies for the Cluster Agent to ensure connectivity to the pod over this port. You can do this with the following configuration:

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  global:
    #(...)
    networkPolicy:
      create: true
      flavor: kubernetes
datadog:
  #(...)
  networkPolicy:
    create: true
    flavor: kubernetes

Set flavor to kubernetes to create a NetworkPolicy resource.

Alternatively, for Cilium-based environments, set flavor to cilium to create a CiliumNetworkPolicy resource.

Network troubleshooting for Kubernetes distributions

When a pod is created, the Kubernetes cluster sends a request from the control plane, to datadog-webhook, through the service, and finally to the Cluster Agent pod. This request requires inbound connectivity from the control plane to the node that the Cluster Agent is on, over its Admission Controller port (8000). After this request is resolved, the Cluster Agent mutates your pod to configure the network connection for the Datadog tracer.

Depending on your Kubernetes distribution, this may have some additional requirements for your security rules and Admission Controller settings.

Amazon Elastic Kubernetes Service (EKS)

In an EKS cluster, you can deploy the Cluster Agent pod on any of your Linux-based nodes by default. These nodes and their EC2 instances need a security group with the following inbound rule:

  • Protocol: TCP
  • Port range: 8000, or a range that covers 8000
  • Source: The ID of either the cluster security group, or one of your cluster’s additional security groups. You can find these IDs in the EKS console, under the Networking tab for your EKS cluster.

This security group rule allows the control plane to access the node and the downstream Cluster Agent over port 8000.

If you have multiple managed node groups, each with distinct security groups, add this inbound rule to each security group.

Control plane logging

To validate your networking configuration, enable EKS control plane logging for the API server. You can view these logs in the CloudWatch console.

Then, delete one of your pods to re-trigger a request through Admission Controller. When the request fails, you can view logs that resemble the following:

W0908 <TIMESTAMP> 10 dispatcher.go:202] Failed calling webhook, failing open datadog.webhook.auto.instrumentation: failed calling webhook "datadog.webhook.auto.instrumentation": failed to call webhook: Post "https://datadog-cluster-agent-admission-controller.default.svc:443/injectlib?timeout=10s": context deadline exceeded
E0908 <TIMESTAMP> 10 dispatcher.go:206] failed calling webhook "datadog.webhook.auto.instrumentation": failed to call webhook: Post "https://datadog-cluster-agent-admission-controller.default.svc:443/injectlib?timeout=10s": context deadline exceeded

These failures are relative to a Cluster Agent deployed in the default namespace; the DNS name adjusts relative to the namespace used.

You may also see failures for the other Admission Controller webhooks, such as datadog.webhook.tags and datadodg.webhook.config.

Note: EKS often generates two log streams within the CloudWatch log group for the cluster. Be sure to check both for these types of logs.

Azure Kubernetes Service (AKS)

To use admission controller webhooks on AKS, use the following configuration:

kind: DatadogAgent
apiVersion: datadoghq.com/v2alpha1
metadata:
  name: datadog
spec:
  #(...)
  override:
    clusterAgent:
      containers:
        cluster-agent:
          env:
            - name: DD_ADMISSION_CONTROLLER_ADD_AKS_SELECTORS
              value: "true"
datadog:
  #(...)

providers:
  aks:
    enabled: true

The providers.aks.enabled option sets the environment variable DD_ADMISSION_CONTROLLER_ADD_AKS_SELECTORS="true".

Google Kubernetes Engine (GKE)

If you are using a GKE private cluster, you need to adjust your firewall rules to allow inbound access from the control plane to port 8000.

Add a firewall rule to allow ingress over TCP on port 8000.

You can also edit an existing rule. By default, the network for your cluster has a firewall rule named gke-<CLUSTER_NAME>-master. Ensure that this rule’s source filters include your cluster control plane’s CIDR block. Edit this rule to allow access over protocol tcp on port 8000.

For more information, see Adding firewall rules for specific use cases in the GKE documentation.

Rancher

If you are using Rancher with an EKS cluster or a private GKE cluster, additional configuration is required. For more information, see Rancher Webhook - Common Issues in the Rancher documentation.

Note: Since Datadog’s Admission Controller’s webhook operates similarly to the Rancher webhook, Datadog needs access to port 8000 instead of Rancher’s 9443.

Rancher and EKS

To use Rancher in an EKS cluster, deploy the Cluster Agent pod with the following configuration:

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  #(...)
  override:
    clusterAgent:
      hostNetwork: true
datadog:
  #(...)

clusterAgent:
  useHostNetwork: true

You must also add a security group inbound rule, as described in the Amazon EKS section on this page.

Rancher and GKE

To use Rancher in a private GKE cluster, edit your firewall rules to allow inbound access over TCP on port 8000. See the GKE section on this page.

Further Reading

PREVIEWING: esther/docs-9478-fix-split-after-example