Overview

Universal Service Monitoring discovers services using popular container tags (such as app, short_image, and kube_deployment) and generates entries in the Service Catalog for those services.

You can access request, error, and duration metrics in Datadog for both inbound and outbound traffic on all services discovered with Universal Service Monitoring. These service health metrics are useful for creating alerts, tracking deployments, and getting started with service level objectives (SLOs) so you can get broad visibility into all services running on your infrastructure.

Universal Service Monitoring SLOs for BITSBOUTIQUE

This guide describes how to search for USM metrics such as universal.http.* and use them in your monitors, SLOs, and dashboards.

USM metrics vs APM metrics

Metric NameUnitsTypeDescription
universal.http.clientSecondsDistributionOutbound request latency, counts, errors, and rates.
universal.http.client.hitsHitsCountTotal number of outbound requests and errors.
universal.http.client.apdexScoreGaugeThe Apdex score of outbound requests for this service.
universal.http.serverSecondsDistributionInbound request latency, counts, errors, and rates.
universal.http.server.hitsHitsCountTotal number of inbound requests and errors.
universal.http.server.apdexScoreGaugeThe Apdex score for this web service.

Unlike APM metrics, errors are available under the error:true tag instead of as a separate metric.

Note: The .hits metrics have all of your infrastructure tags and are the recommended way to query request and error counts. You can also add second primary tags to all USM metrics.

Metric syntax

The USM metric query syntax differs from the APM metric query syntax, which uses trace.*. USM Metrics fall under a single distribution metric name.

For example:

APMUSM
trace.universal.http.client.hits{*}count:universal.http.client{*}
trace.universal.http.client.errorscount:universal.http.client{error:true}
trace.universal.http.client.hits.by_http_statuscount:universal.http.client{*} by http_status_family
pXX:trace.universal.http.client{*}pXX:universal.http.client{*}
trace.universal.http.client.apdex{*}universal.http.client.apdex{*}

The same translations apply for the universal.http.server operation that captures inbound traffic. For more information about distribution metrics, see DDSketch-based Metrics in APM.

Usage

Navigate to Infrastructure > Universal Service Monitoring, filter by Universal Service Monitoring telemetry type, and click on a service. The Performance tab displays service-level graphs on hits, latency, requests, errors, and more. You can also access these metrics when creating a monitor or an SLO, or by looking at a dashboard in the Service Catalog.

Create a monitor

You can create an APM Monitor to trigger an alert when a USM metric such as universal.http.client either crosses a threshold or deviates from an expected pattern.

  1. Navigate to Monitors > New Monitor and click APM.
  2. Select APM Metrics and define a service or resource’s env and any other primary tags. Select a service or resource to monitor and define time interval for the monitor to evaluate the query over.
  3. Select Threshold Alert and select a USM metric such as Requests per Second for the monitor to trigger on. Then, define if the value should be above or below the alert and warning thresholds. Enter a value for the alert threshold, and optionally, for the warning threshold.
  4. The notification section contains a prepopulated message for the monitor. Customize the alert name and message and define the permissions for this monitor.
  5. Click Create.
Universal Service Monitoring Monitor for BITSBOUTIQUE

For more information, see the APM Monitor documentation.

Create an SLO

You can create an SLO on a per-service basis to ensure you are meeting objectives set by USM metrics and improving availability over time. Datadog recommends creating an SLO programmatically to cover a lot of services.

To create an SLO from the Service Catalog:

  1. Navigate to the Reliability tab of the Service Catalog.
  2. Under the SLOs column, hover over a service and click + Create Availability SLO or + Create Latency SLO.
Setting up a Universal Service Monitoring SLO for BITSBOUTIQUE

Optionally, to create an SLO manually using USM metrics:

  1. Navigate to Service Management > SLOs and click New SLO.

  2. Select Metric Based and create two queries in the Good events (numerator) section:

    • Query A: Enter a USM metric such as universal.http.server, filter to a specific service by adding primary service and env tags in the from field, and select count in the as field.
    • Query B: Enter a USM metric such as universal.http.server, filter to a specific service by adding primary service and env tags, in addition to an error:true tag in the from field, and select count in the as field.
  3. Click + Add Formula and enter a-b.

  4. In the Total events (denominator) section, enter a USM metric such as universal.http.server, filter to a specific service by adding primary service and env tags in the from field, and select count in the as field.

  5. Click + New Target to create a target threshold with the following settings:

    • The time window is 7 Days, the target threshold is 95%, and the warning threshold is 99.5%. Datadog recommends setting the same target threshold across all time windows.
  6. Enter a name and description for this SLO. Set primary env and service tags, in addition to the team tag.

  7. Click Save and Set Alert.

Setting up a Universal Service Monitoring SLO for BITSBOUTIQUE

For more information, see the Service Level Objectives documentation.

Access a defined dashboard

The Service Catalog identifies dashboards defined in your service definition file and lists them on the Dashboards tab. Click Manage Dashboards to access and edit the service definition directly in GitHub.

Manage Dashboards button in the Dashboards tab of a service in the Service Catalog

For more information, see the Dashboards documentation.

Further Reading

PREVIEWING: mervebolat/span-id-preprocessing