init_config:instances:# the API endpoint of your Marathon master; required- url:"https://<SERVER>:<PORT>"# if your Marathon master requires ACS auth# acs_url: https://<SERVER>:<PORT># the username for Marathon API or ACS token authenticationusername:"<USERNAME>"# the password for Marathon API or ACS token authenticationpassword:"<PASSWORD>"
The function of username and password depends on whether or not you configure acs_url. If you do, the Agent uses them to request an authentication token from ACS, which it then uses to authenticate to the Marathon API. Otherwise, the Agent uses username and password to directly authenticate to the Marathon API.
Collecting logs is disabled by default in the Datadog Agent, enable it in your datadog.yaml file:
logs_enabled:true
Because Marathon uses logback, you can specify a custom log format. With Datadog, two formats are supported out of the box: the default one provided by Marathon and the Datadog recommended format. Add a file appender to your configuration as in the following example and replace $PATTERN$ with your selected format:
<?xml version="1.0" encoding="UTF-8"?><configuration><shutdownHookclass="ch.qos.logback.core.hook.DelayingShutdownHook"/><appendername="stdout"class="ch.qos.logback.core.ConsoleAppender"><encoder><pattern>[%date] %-5level %message \(%logger:%thread\)%n</pattern></encoder></appender><appendername="async"class="ch.qos.logback.classic.AsyncAppender"><appender-refref="stdout"/><queueSize>1024</queueSize></appender><appendername="FILE"class="ch.qos.logback.core.FileAppender"><file>/var/log/marathon.log</file><append>true</append><!-- set immediateFlush to false for much higher logging throughput --><immediateFlush>true</immediateFlush><encoder><pattern>$PATTERN$</pattern></encoder></appender><rootlevel="INFO"><appender-refref="async"/><appender-refref="FILE"/></root></configuration>
Add this configuration block to your marathon.d/conf.yaml file to start collecting your Marathon logs:
Backoff time multiplication factor for each consecutive failed task launch; tagged by app_id and version
marathon.backoffSeconds (gauge)
Task backoff period; tagged by app_id and version Shown as second
marathon.cpus (gauge)
Configured CPUs for each instance of a given application
marathon.deployments (gauge)
Number of running or pending deployments
marathon.disk (gauge)
Configured CPU for each instance of a given application Shown as mebibyte
marathon.instances (gauge)
Number of instances of a given application; tagged by app_id and version
marathon.mem (gauge)
Configured memory for each instance of a given application; tagged by app_id and version Shown as mebibyte
marathon.queue.count (gauge)
Number of instances left to launch Shown as task
marathon.queue.delay (gauge)
Wait before the next launch attempt Shown as second
marathon.queue.offers.processed (gauge)
The number of processed offers for this launch attempt Shown as task
marathon.queue.offers.reject.last (gauge)
Summary of unused offers for all last offers Shown as task
marathon.queue.offers.reject.launch (gauge)
Summary of unused offers for the launch attempt Shown as task
marathon.queue.offers.unused (gauge)
The number of unused offers for this launch attempt Shown as task
marathon.queue.size (gauge)
Number of app offer queues Shown as task
marathon.taskRateLimit (gauge)
The task rate limit for a given application; tagged by app_id and version
marathon.tasksHealthy (gauge)
Number of healthy tasks for a given application; tagged by app_id and version Shown as task
marathon.tasksRunning (gauge)
Number of tasks running for a given application; tagged by app_id and version Shown as task
marathon.tasksStaged (gauge)
Number of tasks staged for a given application; tagged by app_id and version Shown as task
marathon.tasksUnhealthy (gauge)
Number of unhealthy tasks for a given application; tagged by app_id and version Shown as task
Events
The Marathon check does not include any events.
Service Checks
marathon.can_connect CRITICAL if either cannot connect to API endpoint or no instances of any application are running. WARN if no applications are detected. Additional information about response status at the time of collection is included in the check message. Statuses: ok, critical