In the DC/OS web UI, click on the Universe tab. Find the datadog package and click the Install button.
Click the Advanced Installation button.
Enter your Datadog API Key in the first field.
In the Instances field, enter the number of slave nodes in your cluster (You can determine the number of nodes in your cluster by clicking the Nodes tab on the left side of the DC/OS web ui).
Click Review and Install then Install
Marathon
If you are not using DC/OS, use the Marathon web UI or post to the API URL the following JSON to define the Datadog Agent. You must change <YOUR_DATADOG_API_KEY> with your API Key and the number of instances with the number of slave nodes on your cluster. You may also need to update the docker image used to more recent tag. You can find the latest on Docker Hub
Unless you want to configure a custom mesos_slave.d/conf.yaml-perhaps you need to set disable_ssl_validation: true-you don’t need to do anything after installing the Agent.
Log collection
Collecting logs is disabled by default in the Datadog Agent, enable it in your datadog.yaml file:
logs_enabled:true
Add this configuration block to your mesos_slave.d/conf.yaml file to start collecting your Mesos logs:
logs:- type:filepath:/var/log/mesos/*source:mesos
Change the path parameter value based on your environment, or use the default docker stdout:
Under the Services tab in the DC/OS web UI you should see the Datadog Agent shown. In Datadog, search for mesos.slave in the Metrics Explorer.
Marathon
If you are not using DC/OS, then datadog-agent is in the list of running applications with a healthy status. In Datadog, search for mesos.slave in the Metrics Explorer.
Data Collected
Metrics
mesos.slave.cpus_percent (gauge)
Percentage of allocated CPUs Shown as percent
mesos.slave.cpus_total (gauge)
Number of CPUs
mesos.slave.cpus_used (gauge)
Number of allocated CPUs
mesos.slave.disk_percent (gauge)
Percentage of allocated disk space Shown as percent
mesos.slave.disk_total (gauge)
Disk space Shown as mebibyte
mesos.slave.disk_used (gauge)
Allocated disk space Shown as mebibyte
mesos.slave.executors_registering (gauge)
Number of executors registering
mesos.slave.executors_running (gauge)
Number of executors running
mesos.slave.executors_terminated (gauge)
Number of terminated executors
mesos.slave.executors_terminating (gauge)
Number of terminating executors
mesos.slave.frameworks_active (gauge)
Number of active frameworks
mesos.slave.gpus_percent (gauge)
Percentage of allocated GPUs Shown as percent
mesos.slave.gpus_total (gauge)
Number of GPUs
mesos.slave.gpus_used (gauge)
Number of allocated GPUs
mesos.slave.invalid_framework_messages (gauge)
Number of invalid framework messages Shown as message
mesos.slave.invalid_status_updates (gauge)
Number of invalid status updates
mesos.slave.mem_percent (gauge)
Percentage of allocated memory Shown as percent
mesos.slave.mem_total (gauge)
Total memory Shown as mebibyte
mesos.slave.mem_used (gauge)
Allocated memory Shown as mebibyte
mesos.slave.recovery_errors (gauge)
Number of errors encountered during slave recovery Shown as error
mesos.slave.tasks_failed (count)
Number of failed tasks Shown as task
mesos.slave.tasks_finished (count)
Number of finished tasks Shown as task
mesos.slave.tasks_killed (count)
Number of killed tasks Shown as task
mesos.slave.tasks_lost (count)
Number of lost tasks Shown as task
mesos.slave.tasks_running (gauge)
Number of running tasks Shown as task
mesos.slave.tasks_staging (gauge)
Number of staging tasks Shown as task
mesos.slave.tasks_starting (gauge)
Number of starting tasks Shown as task
mesos.slave.valid_framework_messages (gauge)
Number of valid framework messages Shown as message
mesos.slave.valid_status_updates (gauge)
Number of valid status updates
mesos.state.task.cpu (gauge)
Task cpu
mesos.state.task.disk (gauge)
Task disk Shown as mebibyte
mesos.state.task.mem (gauge)
Task memory Shown as mebibyte
mesos.stats.registered (gauge)
Whether this slave is registered with a master
mesos.stats.system.cpus_total (gauge)
Number of CPUs available
mesos.stats.system.load_15min (gauge)
Load average for the past 15 minutes
mesos.stats.system.load_1min (gauge)
Load average for the past minutes
mesos.stats.system.load_5min (gauge)
Load average for the past 5 minutes
mesos.stats.system.mem_free_bytes (gauge)
Free memory Shown as byte
mesos.stats.system.mem_total_bytes (gauge)
Total memory Shown as byte
mesos.stats.uptime_secs (gauge)
Slave uptime
Events
The Mesos-slave check does not include any events.
Service Checks
mesos_slave.can_connect
Returns CRITICAL if the Agent cannot connect to the Mesos slave metrics endpoint, otherwise OK.