Getting started

  1. Navigate to the SLO Manage page.

  2. Start thinking from the perspective of your user:

    • How are your users interacting with your application?
    • What is their journey through the application?
    • Which parts of your infrastructure do these journeys interact with?
    • What are they expecting from your systems and what are they hoping to accomplish?

Select the relevant SLI

STEP 1

Response/Request

Type of SLIDescription
AvailabilityCould the server respond to the request successfully?
LatencyHow long did it take for the server to respond to the request?
ThroughputHow many requests can be handled?

Storage

Type of SLIDescription
AvailabilityCan the data be accessed on demand?
LatencyHow long does it take to read or write data?
DurabilityIs the data still there when it is needed?

Pipeline

Type of SLIDescription
CorrectnessWas the right data returned?
FreshnessHow long does it take for new data or processed results to appear?

STEP 2

Best practices for choosing an SLO Type

  • Whenever possible, use metric-based SLOs. It’s best practice to have SLOs where the error budget reflects the number of bad events you have left before you breach your SLO. Your SLO calculations will also be volume weighted based on the number of events.
  • If, instead, you want an SLO that tracks uptime and uses a time-based SLI calculation, use time slice SLOs. Unlike monitor-based SLOs, time slice SLOs don’t require you to maintain an underlying monitor for your SLO.
  • Finally, consider monitor-based SLOs for use cases that are not covered by time slice SLOs, which include SLOs based on non-metric monitors or multiple monitors.

For a detailed comparison of the SLO types, see the SLO Type Comparison guide.

Do you require an SLI calculation that is time-based or count-based?

The following SLO types are available in Datadog:

Metric-based SLOs

Example: 99% of requests should complete in less than 250 ms over a 30-day window.

  • Count-based SLI calculation
  • SLI is calculated as the sum of good events divided by the sum of total events

Monitor-based SLOs

Example: the latency of all user requests should be less than 250 ms 99% of the time in any 30-day window.

  • Time-based SLI calculation
  • SLI calculated based on the underlying Monitor’s uptime
  • You can select a single monitor, multiple monitors (up to 20), or a single multi alert monitor with groups

If you need to create a new monitor go to the Monitor create page.

Time Slice SLOs

Example: the latency of all user requests should be less than 250 ms 99% of the time in any 30-day window.

  • Time-based SLI calculation
  • SLI calculated based on your custom uptime definition using a metric query

Implement your SLIs

  1. Custom metrics (for example, counters)
  2. Integration metrics (for example, load balancer, http requests)
  3. Datadog APM (for example, errors, latency on services and resources)
  4. Datadog Logs (for example, metrics generated from logs for a count of particular occurrence)

Set your target objective and time window

  1. Select your target: 99%, 99.5%, 99.9%, 99.95%, or any other target value that makes sense for your requirements.
  2. Select your time window: over the last rolling 7, 30, or 90 days

Name, describe, and tag your SLOs

  1. Name your SLO.
  2. Add a description: describe what the SLO is tracking and why it is important for your end user experience. You can also add links to dashboards for reference.
  3. Add tags: tagging by team and service is a common practice.

Use tags to search for your SLOs from the SLO list view.

Further Reading

PREVIEWING: may/unit-testing