Overview
This section aims to document specifics and to provide good base configuration for all major Kubernetes distributions.
These configurations can then be customized to add any Datadog feature.
AWS Elastic Kubernetes Service (EKS)
No specific configuration is required.
If you are using AWS Bottlerocket OS on your nodes, add the following to enable container monitoring (containerd
check):
In an EKS cluster, you can install the Operator using Helm or as an EKS add-on.
The configuration below is meant to work with either setup (Helm or EKS add-on) when the Agent is installed in the same namespace as the Datadog Operator.
kind: DatadogAgent
apiVersion: datadoghq.com/v2alpha1
metadata:
name: datadog
spec:
features:
admissionController:
enabled: false
externalMetricsServer:
enabled: false
useDatadogMetrics: false
global:
credentials:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
criSocketPath: /run/dockershim.sock
override:
clusterAgent:
image:
name: gcr.io/datadoghq/cluster-agent:latest
Custom datadog-values.yaml
:
datadog:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
criSocketPath: /run/dockershim.sock
env:
- name: DD_AUTOCONFIG_INCLUDE_FEATURES
value: "containerd"
Azure Kubernetes Service (AKS)
AKS requires a specific configuration for the Kubelet
integration due to how AKS has set up the SSL Certificates. Additionally, the optional Admission Controller feature requires a specific configuration to prevent an error when reconciling the webhook.
DatadogAgent Kubernetes Resource:
kind: DatadogAgent
apiVersion: datadoghq.com/v2alpha1
metadata:
name: datadog
spec:
features:
admissionController:
enabled: true
global:
credentials:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
kubelet:
tlsVerify: false
override:
clusterAgent:
containers:
cluster-agent:
env:
- name: DD_ADMISSION_CONTROLLER_ADD_AKS_SELECTORS
value: "true"
Custom datadog-values.yaml
:
datadog:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
# Required as of Agent 7.35. See Kubelet Certificate note below.
kubelet:
tlsVerify: false
providers:
aks:
enabled: true
The providers.aks.enabled
option sets the necessary environment variable DD_ADMISSION_CONTROLLER_ADD_AKS_SELECTORS="true"
for you.
The kubelet.tlsVerify=false
sets the environment variable DD_KUBELET_TLS_VERIFY=false
for you to deactivate verification of the server certificate.
AKS Kubelet certificate
There is a known issue with the format of the AKS Kubelet certificate in older node image versions. As of Agent 7.35, it is required to use tlsVerify: false
as the certificates did not contain a valid Subject Alternative Name (SAN).
If all the nodes within your AKS cluster are using a supported node image version, you can use Kubelet TLS Verification. Your version must be at or above the versions listed here for the 2022-10-30 release. You must also update your Kubelet configuration to use the node name for the address and map in the custom certificate path.
DatadogAgent Kubernetes Resource:
kind: DatadogAgent
apiVersion: datadoghq.com/v2alpha1
metadata:
name: datadog
spec:
features:
admissionController:
enabled: true
global:
credentials:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
kubelet:
host:
fieldRef:
fieldPath: spec.nodeName
hostCAPath: /etc/kubernetes/certs/kubeletserver.crt
override:
clusterAgent:
containers:
cluster-agent:
env:
- name: DD_ADMISSION_CONTROLLER_ADD_AKS_SELECTORS
value: "true"
Custom datadog-values.yaml
:
datadog:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
# Requires supported node image version
kubelet:
host:
valueFrom:
fieldRef:
fieldPath: spec.nodeName
hostCAPath: /etc/kubernetes/certs/kubeletserver.crt
providers:
aks:
enabled: true
Using spec.nodeName
keeps TLS verification. In some clusters, DNS resolution for spec.nodeName
inside Pods may not work in AKS. This has been reported on all AKS Windows nodes, as well as Linux nodes when the cluster is set up in a Virtual Network using custom DNS. In this case, use the first AKS configuration provided: remove any settings for the Kubelet host path (which defaults to status.hostIP
) and use tlsVerify: false
. This setting is required. Do NOT set the Kubelet host path and tlsVerify: false
in the same configuration.
Google Kubernetes Engine (GKE)
GKE can be configured in two different mode of operation:
- Standard: You manage the cluster’s underlying infrastructure, giving you node configuration flexibility.
- Autopilot: GKE provisions and manages the cluster’s underlying infrastructure, including nodes and node pools, giving you an optimized cluster with a hands-off experience.
Depending on the operation mode of your cluster, the Datadog Agent needs to be configured differently.
Standard
Since Agent 7.26, no specific configuration is required for GKE (whether you run Docker
or containerd
).
Note: When using COS (Container Optimized OS), the eBPF-based OOM Kill
and TCP Queue Length
checks are supported starting from the version 3.0.1 of the Helm chart. To enable these checks, configure the following setting:
datadog.systemProbe.enableDefaultKernelHeadersPaths
to false
.
Autopilot
GKE Autopilot requires some configuration, shown below.
Datadog recommends that you specify resource limits for the Agent container. Autopilot sets a relatively low default limit (50m CPU, 100Mi memory) that may lead the Agent container to quickly OOMKill depending on your environment. If applicable, also specify resource limits for the Trace Agent and Process Agent containers. Additionally, you may wish to create a priority class for the Agent to ensure it is scheduled.
Note: Network Performance Monitoring is not supported for GKE Autopilot.
Custom datadog-values.yaml
:
datadog:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
clusterName: <CLUSTER_NAME>
# The site of the Datadog intake to send Agent data to (example: `us3.datadoghq.com`)
# Default value is `datadoghq.com' (the US1 site)
# Documentation: https://docs.datadoghq.com/getting_started/site/
site: <DATADOG_SITE>
agents:
containers:
agent:
# resources for the Agent container
resources:
requests:
cpu: 200m
memory: 256Mi
traceAgent:
# resources for the Trace Agent container
resources:
requests:
cpu: 100m
memory: 200Mi
processAgent:
# resources for the Process Agent container
resources:
requests:
cpu: 100m
memory: 200Mi
priorityClassCreate: true
providers:
gke:
autopilot: true
Spot pods and compute classes
Using Spot Pods in GKE Autopilot clusters introduces taints to the corresponding Spot GKE nodes. When using Spot Pods, additional configuration is required to provide the Agent DaemonSet with a matching toleration.
agents:
#(...)
# agents.tolerations -- Allow the DaemonSet to schedule on tainted nodes (requires Kubernetes >= 1.6)
tolerations:
- effect: NoSchedule
key: cloud.google.com/gke-spot
operator: Equal
value: "true"
Similarly when using GKE Autopilot Compute classes to run workloads that have specific hardware requirements, take note of the taints that GKE Autopilot is applying to these specific nodes and add matching tolerations to the Agent DaemonSet. You can match the tolerations on your corresponding pods. For example for the Scale-Out
compute class use a toleration like:
agents:
#(...)
# agents.tolerations -- Allow the DaemonSet to schedule on tainted nodes (requires Kubernetes >= 1.6)
tolerations:
- effect: NoSchedule
key: cloud.google.com/compute-class
operator: Equal
value: Scale-Out
Red Hat OpenShift
OpenShift comes with hardened security by default with SELinux and SecurityContextConstraints (SCC). As a result, it requires some specific configurations:
- Elevated SCC access for the Node Agent and Cluster Agent
- Kubelet API certificates may not always be signed by cluster CA
- Tolerations are required to schedule the Node Agent on
master
and infra
nodes - Cluster name should be set as it cannot be retrieved automatically from cloud provider
- (Optional) Set
hostNetwork: true
in the Node Agent to allow the Agent to make requests to cloud provider metadata services (IMDS)
This core configuration supports OpenShift 3.11 and OpenShift 4, but it works best with OpenShift 4.
Additionally log collection and APM have slightly different requirements as well.
The use of Unix Domain Socket (UDS) for APM and DogStatsD can work in OpenShift. However, Datadog does not recommend this, as it requires additional privileged permissions and SCC access to both your Datadog Agent pod and your application pod. Without these, your application pod can fail to deploy. Datadog recommends disabling the UDS option to avoid this, allowing the Admission Controller to inject the appropriate TCP/IP setting or Service setting for APM connectivity.
When using the Datadog Operator in OpenShift, Datadog recommends that you use the Operator Lifecycle Manager to deploy the Datadog Operator from OperatorHub in your OpenShift Cluster web console. Refer to the Operator install steps. The configuration below works with that setup, which creates the ClusterRole and ClusterRoleBinding based access to the SCC for the specified ServiceAccount datadog-agent-scc
. This DatadogAgent
configuration should be deployed in the same namespace as the Datadog Operator.
kind: DatadogAgent
apiVersion: datadoghq.com/v2alpha1
metadata:
name: datadog
namespace: openshift-operators # set as the same namespace where the Datadog Operator was deployed
spec:
features:
logCollection:
enabled: true
containerCollectAll: true
apm:
enabled: true
hostPortConfig:
enabled: true
unixDomainSocketConfig:
enabled: false
dogstatsd:
unixDomainSocketConfig:
enabled: false
global:
credentials:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
clusterName: <CLUSTER_NAME>
kubelet:
tlsVerify: false
override:
clusterAgent:
serviceAccountName: datadog-agent-scc
nodeAgent:
serviceAccountName: datadog-agent-scc
hostNetwork: true
securityContext:
runAsUser: 0
seLinuxOptions:
level: s0
role: system_r
type: spc_t
user: system_u
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/infra
operator: Exists
effect: NoSchedule
Note: The nodeAgent.securityContext.seLinuxOptions
override is necessary for log collection when deploying with the Operator. If log collection is not enabled, you can omit this override.
The configuration below creates custom SCCs for the Agent and Cluster Agent Service Accounts.
Custom datadog-values.yaml
:
datadog:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
clusterName: <CLUSTER_NAME>
kubelet:
tlsVerify: false
apm:
portEnabled: true
socketEnabled: false
agents:
podSecurity:
securityContextConstraints:
create: true
useHostNetwork: true
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/infra
operator: Exists
clusterAgent:
podSecurity:
securityContextConstraints:
create: true
Rancher
Rancher installations are similar to vanilla Kubernetes installations, requiring only some minor configuration:
- Tolerations are required to schedule the Node Agent on
controlplane
and etcd
nodes. - The cluster name should be set as it cannot be retrieved automatically from the cloud provider.
DatadogAgent Kubernetes Resource:
kind: DatadogAgent
apiVersion: datadoghq.com/v2alpha1
metadata:
name: datadog
spec:
features:
logCollection:
enabled: false
liveProcessCollection:
enabled: false
liveContainerCollection:
enabled: true
apm:
enabled: false
cspm:
enabled: false
cws:
enabled: false
npm:
enabled: false
admissionController:
enabled: false
externalMetricsServer:
enabled: false
useDatadogMetrics: false
global:
credentials:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
clusterName: <CLUSTER_NAME>
kubelet:
tlsVerify: false
override:
clusterAgent:
image:
name: gcr.io/datadoghq/cluster-agent:latest
nodeAgent:
image:
name: gcr.io/datadoghq/agent:latest
tolerations:
- key: node-role.kubernetes.io/controlplane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/etcd
operator: Exists
effect: NoExecute
Custom datadog-values.yaml
:
datadog:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
clusterName: <CLUSTER_NAME>
kubelet:
tlsVerify: false
agents:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/controlplane
operator: Exists
- effect: NoExecute
key: node-role.kubernetes.io/etcd
operator: Exists
Oracle Container Engine for Kubernetes (OKE)
No specific configuration is required.
vSphere Tanzu Kubernetes Grid (TKG)
TKG requires some small configuration changes, shown below. For example, setting a toleration is required for the controller to schedule the Node Agent on the master
nodes.
DatadogAgent Kubernetes Resource:
kind: DatadogAgent
apiVersion: datadoghq.com/v2alpha1
metadata:
name: datadog
spec:
features:
eventCollection:
collectKubernetesEvents: true
kubeStateMetricsCore:
enabled: true
global:
credentials:
apiSecret:
secretName: datadog-secret
keyName: api-key
appSecret:
secretName: datadog-secret
keyName: app-key
kubelet:
tlsVerify: false
override:
nodeAgent:
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
Custom datadog-values.yaml
:
datadog:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
kubelet:
# Set tlsVerify to false since the Kubelet certificates are self-signed
tlsVerify: false
# Disable the `kube-state-metrics` dependency chart installation.
kubeStateMetricsEnabled: false
# Enable the new `kubernetes_state_core` check.
kubeStateMetricsCore:
enabled: true
# Add a toleration so that the agent can be scheduled on the control plane nodes.
agents:
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
Additional helpful documentation, links, and articles: