Network Device Monitoring with the Cluster Agent

For Kubernetes environments, the Datadog Cluster Agent (DCA) can be configured to use the Network Device Monitoring (NDM) auto-discovery logic as a source of cluster checks.

Agent auto-discovery combined with the DCA is scalable. It can monitor a large number of devices.

Setup

Installation

  1. Ensure the DCA is installed.

  2. Set up the DCA with NDM auto-discovery using Datadog helm-chart by adding the Datadog Helm repository:

    helm repo add datadog https://helm.datadoghq.com
    helm repo update
    
  3. Then, install datadog-monitoring and set your Datadog API key.

    helm install datadog-monitoring --set datadog.apiKey=<YOUR_DD_API_KEY> -f cluster-agent-values.yaml datadog/datadog
    

Configuration

Below is an example of the cluster-agent-values.yaml:

cluster-agent-values.yaml

datadog:
  ## @param apiKey - string - required
  ## Set this to your Datadog API key before the Agent runs.
  ## ref: https://app.datadoghq.com/account/settings/agent/latest?platform=kubernetes
  #
  apiKey: <DATADOG_API_KEY>

  ## @param clusterName - string - optional
  ## Set a unique cluster name to allow scoping hosts and Cluster Checks easily
  ## The name must be unique and must be dot-separated tokens where a token can be up to 40 characters with the following restrictions:
  ## * Lowercase letters, numbers, and hyphens only.
  ## * Must start with a letter.
  ## * Must end with a number or a letter.
  ## Compared to the rules of GKE, dots are allowed whereas they are not allowed on GKE:
  ## https://cloud.google.com/kubernetes-engine/docs/reference/rest/v1beta1/projects.locations.clusters#Cluster.FIELDS.name
  #
  clusterName: my-snmp-cluster

  ## @param clusterChecks - object - required
  ## Enable the Cluster Checks feature on both the cluster-agents and the daemonset
  ## ref: https://docs.datadoghq.com/agent/autodiscovery/clusterchecks/
  ## Autodiscovery via Kube Service annotations is automatically enabled
  #
  clusterChecks:
    enabled: true

  ## @param tags  - list of key:value elements - optional
  ## List of tags to attach to every metric, event and service check collected by this Agent.
  ##
  ## Learn more about tagging: https://docs.datadoghq.com/tagging/
  #
  tags:
    - 'env:test-snmp-cluster-agent'

## @param clusterAgent - object - required
## This is the Datadog Cluster Agent implementation that handles cluster-wide
## metrics more cleanly, separates concerns for better rbac, and implements
## the external metrics API so you can autoscale HPAs based on datadog metrics
## ref: https://docs.datadoghq.com/agent/kubernetes/cluster/
#
clusterAgent:
  ## @param enabled - boolean - required
  ## Set this to true to enable Datadog Cluster Agent
  #
  enabled: true

  ## @param confd - list of objects - optional
  ## Provide additional cluster check configurations
  ## Each key will become a file in /conf.d
  ## ref: https://docs.datadoghq.com/agent/autodiscovery/
  #
  confd:
     # Static checks
     http_check.yaml: |-
       cluster_check: true
       instances:
         - name: 'Check Example Site1'
           url: http://example.net
         - name: 'Check Example Site2'
           url: http://example.net
         - name: 'Check Example Site3'
           url: http://example.net       
     # Autodiscovery template needed for `snmp_listener` to create instance configs
     snmp.yaml: |-
      cluster_check: true
      ad_identifiers:
        - snmp
      init_config:
      instances:
        -
          ## @param ip_address - string - optional
          ## The IP address of the device to monitor.
          #
          ip_address: "%%host%%"

          ## @param port - integer - optional - default: 161
          ## Default SNMP port.
          #
          port: "%%port%%"

          ## @param snmp_version - integer - optional - default: 2
          ## If you are using SNMP v1 set snmp_version to 1 (required)
          ## If you are using SNMP v3 set snmp_version to 3 (required)
          #
          snmp_version: "%%extra_version%%"

          ## @param timeout - integer - optional - default: 5
          ## Amount of second before timing out.
          #
          timeout: "%%extra_timeout%%"

          ## @param retries - integer - optional - default: 5
          ## Amount of retries before failure.
          #
          retries: "%%extra_retries%%"

          ## @param community_string - string - optional
          ## Only useful for SNMP v1 & v2.
          #
          community_string: "%%extra_community%%"

          ## @param user - string - optional
          ## USERNAME to connect to your SNMP devices.
          #
          user: "%%extra_user%%"

          ## @param authKey - string - optional
          ## Authentication key to use with your Authentication type.
          #
          authKey: "%%extra_auth_key%%"

          ## @param authProtocol - string - optional
          ## Authentication type to use when connecting to your SNMP devices.
          ## It can be one of: MD5, SHA, SHA224, SHA256, SHA384, SHA512.
          ## Default to MD5 when `authKey` is specified.
          #
          authProtocol: "%%extra_auth_protocol%%"

          ## @param privKey - string - optional
          ## Privacy type key to use with your Privacy type.
          #
          privKey: "%%extra_priv_key%%"

          ## @param privProtocol - string - optional
          ## Privacy type to use when connecting to your SNMP devices.
          ## It can be one of: DES, 3DES, AES, AES192, AES256, AES192C, AES256C.
          ## Default to DES when `privKey` is specified.
          #
          privProtocol: "%%extra_priv_protocol%%"

          ## @param context_engine_id - string - optional
          ## ID of your context engine; typically unneeded.
          ## (optional SNMP v3-only parameter)
          #
          context_engine_id: "%%extra_context_engine_id%%"

          ## @param context_name - string - optional
          ## Name of your context (optional SNMP v3-only parameter).
          #
          context_name: "%%extra_context_name%%"

          ## @param tags - list of key:value element - optional
          ## List of tags to attach to every metric, event and service check emitted by this integration.
          ##
          ## Learn more about tagging: https://docs.datadoghq.com/tagging/
          #
          tags:
            # The autodiscovery subnet the device is part of.
            # Used by Agent autodiscovery to pass subnet name.
            - "autodiscovery_subnet:%%extra_autodiscovery_subnet%%"

          ## @param extra_tags - string - optional
          ## Comma separated tags to attach to every metric, event and service check emitted by this integration.
          ## Example:
          ##  extra_tags: "tag1:val1,tag2:val2"
          #
          extra_tags: "%%extra_tags%%"

          ## @param oid_batch_size - integer - optional - default: 60
          ## The number of OIDs handled by each batch. Increasing this number improves performance but
          ## uses more resources.
          #
          oid_batch_size: "%%extra_oid_batch_size%%"      


  ## @param datadog-cluster.yaml - object - optional
  ## Specify custom contents for the datadog cluster agent config (datadog-cluster.yaml).
  #
  datadog_cluster_yaml:

    # See here for all `network_devices.autodiscovery` configs: https://github.com/DataDog/datadog-agent/blob/master/pkg/config/config_template.yaml
    autodiscovery:
      workers: 2
      discovery_interval: 10
      configs:
        - network: 192.168.1.16/29
          version: 2
          port: 1161
          community: cisco_icm
        - network: 192.168.1.16/29
          version: 2
          port: 1161
          community: public

Further Reading

Additional helpful documentation, links, and articles:

PREVIEWING: may/unit-testing