Universal Service Monitoring discovers services using popular container tags (such as app, short_image, and kube_deployment) and generates entries in the Service Catalog for those services.
You can access request, error, and duration metrics in Datadog for both inbound and outbound traffic on all services discovered with Universal Service Monitoring. These service health metrics are useful for creating alerts, tracking deployments, and getting started with service level objectives (SLOs) so you can get broad visibility into all services running on your infrastructure.
This guide describes how to search for USM metrics such as universal.http.* and use them in your monitors, SLOs, and dashboards.
USM metrics vs APM metrics
Metric Name
Units
Type
Description
universal.http.client
Seconds
Distribution
Outbound request latency, counts, errors, and rates.
universal.http.client.hits
Hits
Count
Total number of outbound requests and errors.
universal.http.client.apdex
Score
Gauge
The Apdex score of outbound requests for this service.
universal.http.server
Seconds
Distribution
Inbound request latency, counts, errors, and rates.
universal.http.server.hits
Hits
Count
Total number of inbound requests and errors.
universal.http.server.apdex
Score
Gauge
The Apdex score for this web service.
Unlike APM metrics, errors are available under the error:true tag instead of as a separate metric.
Note: The .hits metrics have all of your infrastructure tags and are the recommended way to query request and error counts. You can also add second primary tags to all USM metrics.
Metric syntax
The USM metric query syntax differs from the APM metric query syntax, which uses trace.*. USM Metrics fall under a single distribution metric name.
For example:
APM
USM
trace.universal.http.client.hits{*}
count:universal.http.client{*}
trace.universal.http.client.errors
count:universal.http.client{error:true}
trace.universal.http.client.hits.by_http_status
count:universal.http.client{*} by http_status_family
pXX:trace.universal.http.client{*}
pXX:universal.http.client{*}
trace.universal.http.client.apdex{*}
universal.http.client.apdex{*}
The same translations apply for the universal.http.server operation that captures inbound traffic. For more information about distribution metrics, see DDSketch-based Metrics in APM.
Usage
Navigate to Infrastructure > Universal Service Monitoring, filter by Universal Service Monitoring telemetry type, and click on a service. The Performance tab displays service-level graphs on hits, latency, requests, errors, and more. You can also access these metrics when creating a monitor or an SLO, or by looking at a dashboard in the Service Catalog.
Create a monitor
You can create an APM Monitor to trigger an alert when a USM metric such as universal.http.client either crosses a threshold or deviates from an expected pattern.
Select APM Metrics and define a service or resource’s env and any other primary tags. Select a service or resource to monitor and define time interval for the monitor to evaluate the query over.
Select Threshold Alert and select a USM metric such as Requests per Second for the monitor to trigger on. Then, define if the value should be above or below the alert and warning thresholds. Enter a value for the alert threshold, and optionally, for the warning threshold.
The notification section contains a prepopulated message for the monitor. Customize the alert name and message and define the permissions for this monitor.
You can create an SLO on a per-service basis to ensure you are meeting objectives set by USM metrics and improving availability over time. Datadog recommends creating an SLO programmatically to cover a lot of services.
Select Metric Based and create two queries in the Good events (numerator) section:
Query A: Enter a USM metric such as universal.http.server, filter to a specific service by adding primary service and env tags in the from field, and select count in the as field.
Query B: Enter a USM metric such as universal.http.server, filter to a specific service by adding primary service and env tags, in addition to an error:true tag in the from field, and select count in the as field.
Click + Add Formula and enter a-b.
In the Total events (denominator) section, enter a USM metric such as universal.http.server, filter to a specific service by adding primary service and env tags in the from field, and select count in the as field.
Click + New Target to create a target threshold with the following settings:
The time window is 7 Days, the target threshold is 95%, and the warning threshold is 99.5%. Datadog recommends setting the same target threshold across all time windows.
Enter a name and description for this SLO. Set primary env and service tags, in addition to the team tag.
The Service Catalog identifies dashboards defined in your service definition file and lists them on the Dashboards tab. Click Manage Dashboards to access and edit the service definition directly in GitHub.