Create a Dashboard to track and correlate APM metrics

4 minutes to complete

Datadog APM allows you to create dashboards based on your business priorities and metrics important to you: You can create widgets on these dashboards to keep track of any traditional infrastructure, logs and custom metrics like host memory usage alongside critical APM metrics based on throughput, latency, and error rate for correlation. Next to these you can track latency of the user experience of your top customers or largest transactions and alongside these keep track of the throughput of your main web server ahead of any major events like Black Friday.

This guides walks you through adding trace metrics to a dashboard, correlating them with infrastructure metrics and then how to export an Analytics query. This guide covers adding widgets to the dashboard in three ways:

  • Copying an existing APM graph ( Step 1. 2. & 3.)
  • Creating it manually. (Step 4. & 5. )
  • Exporting an Analytics query. (Step 7.)
  1. Open the Service Catalog and choose the web-store service.

  2. Find the Total Requests Graph and click on the export button on the top right to choose Export to Dashboard. Click New Timeboard.

    dashboard 2
  3. Click on View Dashboard in the success message.

    In the new dashboard, the Hit/error count on service graph for the web-store service is now available. It shows the entire throughput of this service as well as its total amount of errors.

    dashboard 3

    Note: You can click on the pencil icon to edit this graph and see what precise metrics are being used.

  4. Click on the Add graph placeholder tile on the dashboard space and then Drag a Timeseries to this space.

    This is the dashboard widget edit screen. It empowers you to create any type of visualization across all of the metrics available to you. See the Timeseries widget documentation to learn more.

  5. Click on the system.cpu.user box and choose the metric and parameters relevant to you, in this example:

    ParameterValueDescription
    metrictrace.rack.requests.errorsThe Ruby Rack total set of erroneous requests.
    fromservice:web-storeThe main service in this example stack, it is a Ruby service and all the information in the chart with come from it.
    sum byhttp.status_codeBreaking down the chart by http status codes.

    This specific breakdown is just one example of the many can choose. It is important to note that any metric that starts with trace. contains APM information. See the APM metric documentation to learn more.

  6. Drag another timeseries to the placeholder tile

    In this example two different types of metrics are added to a graph, a trace.* and a runtime.* one. Combined, these metrics allow you to correlate information between requests and code runtime performances. Specifically, the latency of a service is displayed next to the thread count, knowing that latency spikes might be associated with an increase in the thread count:

    1. First, add trace.rack.requests.errors metric into the widget:

      ParameterValueDescription
      metrictrace.rack.request.duration.by.service.99pThe 99th percentile of latency of requests in our service.
      fromservice:web-storeThe main service in this example stack, it is a Ruby service and all the information in the chart with come from it.
    2. Then click on the Graph additional: Metrics to add another metric to the chart:

      ParameterValueDescription
      metricruntime.ruby.thread_countThread count taken from the Ruby runtime metrics.
      fromservice:web-storeThe main service in this example stack, it is a Ruby service and all the information in the chart with come from it.

    This setup can show whether a spike in latency is associated with a spike in the ruby thread count, immediately pointing out the cause for latency allowing for fast resolution.

  7. Go to Analytics.

    This example shows how to query the latency across the example application: breaking it down by merchants on the platform and view the top-10 merchants with highest latency. From the Analytics screen, export the graph to the dashboard and view it there:

  8. Return to your dashboard.

    Multiple widgets can now be seen providing deep observability into the example application from both a technical perspective and a business one. But this is only the start of what you can do: add infrastructure metrics, use multiple types of visualizations and add calculations and projections.

    With the dashboard you can also explore related events.

  9. Click on the Search Events or Logs button and add search for a relevant event explorer. Note: in this example Ansible is used, your event explorer might be different.

    dashboard 1

    Here, alongside the view of our dashboard, recent events that have happened (in datadog or in external services like Ansible, Chef, etc.) can be seen such as: deployments, task completions, or monitors alerting. These events can then be correlated to what is happening to the metrics setup in the dashboard.

    Finally, make sure to use template variables. These are a set of values that dynamically control the widgets on the dashboards that every user can use without having to edit the widgets themselves. For more information, see the Template Variable documentation.

  10. Click on Add Variable in the header. Choose the tag that the variable will control, and configure its name, default value, or available values.

    In this example a template variable for Region is added to see how the dashboard behaves across us-east1 and europe-west-4, out two primary areas of operation.

    Add Variable popover showing field options to add variable name and variable tags

    You can now add this template variable to each of the graphs:

    Add dynamic template variables to your query, this example shows '$RG' to dynamically scope to the region template variable

    When you select template variable values, all values update in the applicable widgets of the dashboard.

    Be sure to explore all the metrics available to you and take full advantage of the Datadog 3 pillars of observability. You can turn this basic dashboard into a powerful tool that is a one-stop-shop for monitoring and observability in your organization.

Further Reading

PREVIEWING: may/unit-testing