This page is not yet available in Spanish. We are working on its translation.
If you have any questions or feedback about our current translation project, feel free to reach out to us!
Join the Preview!

LLM Observability Experiments is in Preview.

Request Access
LLM Observability, Experiment view. Heading: 'Comparing 12 experiments across 9 fields'. Line graph visualization charting the accuracy, correctness, duration, estimated cost, and other metrics of various experiments.

LLM Observability Experiments supports the entire lifecycle of building LLM applications and agents. It helps you understand how changes to prompts, models, providers, or system architecture affect performance. With this feature, you can:

  • Create and version datasets
  • Run and manage experiments
  • Compare results to evaluate impact

There are two ways to use Experiments:

Explore Experiments with Jupyter notebooks

You can use the Jupyter notebooks in the LLM Observability Experiments repository to learn more about Experiments.

Usage: Python SDK

Installation

Install Datadog’s LLM Observability Python SDK:

export DD_FAST_BUILD=1
pip install git+https://github.com/DataDog/dd-trace-py.git@llm-experiments

If you see errors regarding the Rust toolchain, ensure that Rust is installed. Instructions are provided in the error message.

Setup

Environment variables

Specify the following environment variables in your application startup command:

VariableDescription
DD_API_KEYYour Datadog API key
DD_APP_KEYYour Datadog application key
DD_SITEYour Datadog site. Defaults to datadoghq.com.

Project initialization

Call init() to define the project where you want to write your experiments.

import ddsource.llmobs.experimentation as dne

dne.init(project_name="example")

Dataset class

A dataset is a collection of inputs and expected outputs. You can construct datasets from production data, from staging data, or manually. You can also push and retrieve datasets from Datadog.

ParameterTypeDescription
name (required)stringName of the dataset
data (required)List[Dict[str, Union[str, Dict[str, Any]]]]List of dictionaries. The key is a string. The value can be a string or a dictionary.

The dictionaries should all have the same schema and contain the following keys:

input: String or dictionary of input data
expected_output (optional): String or dictionary of expected output data
descriptionstringDescription of the dataset

Returns

Instance of Dataset

Example

import ddtrace.llmobs.experimentation as dne

dne.init(project_name="example")

dataset = dne.Dataset(
    name="capitals-of-the-world",
    data=[
        {"input": "What is the capital of China?", "expected_output": "Beijing"},
        {
            "input": "Which city serves as the capital of South Africa?",
            "expected_output": "Pretoria",
        },
        {
            "input": "What is the capital of Switzerland?",
            "expected_output": "Bern",
        },
        {
            "input": "Name the capital city of a country that starts with 'Z'."  # Open-ended question
        }
    ],
)
Dataset.pull(name: str) -> Dataset
ParameterTypeDescription
name (required)stringName of the dataset to retrieve from Datadog

Returns

Instance of Dataset

Example

import ddtrace.llmobs.experimentation as dne

dne.init(project_name="example")

dataset = dne.Dataset.pull("capitals-of-the-world")
Dataset.from_csv(
        cls,
        filepath: str,
        name: str,
        description: str = "",
        delimiter: str = ",",
        input_columns: List[str] = None,
        expected_output_columns: List[str] = None,
    ) -> Dataset:
ParameterTypeDescription
path (required)stringLocal path to the CSV file
name (required)stringName of the dataset
descriptionstringDescription of the dataset
input_columns (required)List[str]List of column names to use as input data
expected_output_columns (required)List[str]List of column names to use as output data
metadata_columnsstringList of column names to include as metadata
delimiterstringDelimiter character for CSV files. Defaults to ,.

The CSV file must have a header row so that input and expected output columns can be mapped.

Returns

Instance of Dataset

Example

data.csv

question,answer,difficulty,category
What is 2+2?,4,easy,math
What is the capital of France?,Paris,medium,geography
import ddtrace.llmobs.experimentation as dne

dne.init(project_name="example")


dataset = dne.Dataset.from_csv(
path="data.csv", 
name="my_dataset", 
input_columns=["question", "category", "difficulty"], 
expected_output_columns=["answer"]
)
Dataset.push(overwrite: boolean = None, new_version: boolean = None)
ParameterTypeDescription
overwritebooleanIf True, overwrites the dataset rows of an existing version.
new_versionbooleanIf True, creates a new version of the dataset in Datadog. Defaults to True.

This flag is useful for creating a new dataset with entirely new data.

Example

import ddtrace.llmobs.experimentation as dne

dne.init(project_name="example")

dataset = dne.Dataset(...)

dataset.push()
Dataset.as_dataframe(multiindex: bool = True) -> pd.DataFrame
ParameterTypeDescription
multiindexbooleanIf True, expands nested dictionaries into MultiIndex columns. Defaults to True.

Returns

Instance of pandas DataFrame

Experiment class

An experiment is a collection of traces that tests the behavior of an LLM feature or LLM application against a dataset. The input data comes from the dataset, and the outputs are the final generations of the feature or application that is being tested. The Experiment class manages the execution and evaluation of LLM tasks on datasets.

Experiment(
    name: str,
    task: Callable,
    dataset: Dataset,
    evaluators: List[Callable],
    tags: List[str] = [],
    description: str = "",
    metadata: Dict[str, Any] = {},
    config: Optional[Dict[str, Any]] = None
)
ParameterTypeDescription
name (required)stringName of the experiment
task (required)functionFunction decorated with @task that processes each dataset record
dataset (required)DatasetDataset to run the experiment against
evaluatorsfunction[]List of functions decorated with @evaluator that run against all outputs in the results
tagsstring[]Optional list of tags for organizing experiments
descriptionstringDescription of the experiment
metadataDict[str, Any]Additional metadata about the experiment
configDict[str, Any]A key-value pair collection used inside a task to determine its behavior

Returns

Instance of Experiment

Experiment.run(jobs: int = 10, raise_errors: bool = False, sample_size: int = None) -> ExperimentResults
ParameterTypeDescription
jobsintNumber of worker threads used to run the task concurrently. Defaults to 10.
raise_errorsbooleanIf True, stops execution as soon as the first exception from the task is raised.

If False, every exception is handled, and the experiment runs continually until finished.
sample_sizeintNumber of rows used for the experiment. You can use sample_size with raise_errors to test before you run a long experiment.

Returns

Instance of ExperimentResults

Example

# To test, run top 10 rows and see if it throws errors
results = experiment.run(raise_errors=True, sample_size=10)

# If it's acceptable after that, run the whole thing
results = experiment.run()
Experiment.run_evaluations(evaluators: Optional[List[Callable]] = None, raise_errors: bool = False) -> ExperimentResults
ParameterTypeDescription
evaluatorsfunction[]List of functions decorated with @evaluator that run against all outputs in the results.
raise_errorsbooleanIf True, stops execution as soon as the first exception from the task is raised.

If False, every exception is handled, and the experiment runs continually until finished.

Returns

Instance of ExperimentResults

ExperimentResults class

Contains and manages the results of an experiment run.

ExperimentResults.as_dataframe(multiindex: bool = True) -> pd.DataFrame
ParameterTypeDescription
multiindexbooleanIf True, expands nested dictionaries into MultiIndex columns. Defaults to True.

Returns

Instance of pandas DataFrame

Decorators

Decorators are required to define the task functions and evaluator functions that an experiment uses.

ParameterTypeDescription
input (required)Dict[str, Any]Dataset input field used for your business logic
configDict[str, Any]Modifies the behavior of the task (prompts, models, etc).
import ddtrace.llmobs.experimentation as dne

dne.init(project_name="example")

@dne.task
def process(input: Dict[str, Any], config: Optional[Dict[str, Any]] = None) -> Any:
    # Your business logic
ParameterTypeDescription
input (required)AnyDataset input field used for your business logic
output (required)AnyTask output field used for your business logic
expected_output (required)AnyDataset expected_output field used for your business logic
import ddtrace.llmobs.experimentation as dne

dne.init(project_name="example")

@dne.evaluator
def evaluate(input: Any, output: Any, expected_output: Any) -> Any:
    # Your evaluation logic

Usage: LLM Observability Experiments API

Postman quickstart

Datadog highly recommends importing the Experiments Postman collection into Postman. Postman’s View documentation feature can help you better understand this API.

Request format

FieldTypeDescription
dataObject: DataThe request body is nested within a top level data field.

Example: Creating a project

{
  "data": {
    "type": "projects",  # request type
    "attributes": {
        "name": "Project example",
        "description": "Description example"
    }
  }
}

Response format

FieldTypeDescription
dataObject: DataThe request body of an experimentation API is nested within a top level data field.
metaObject: PagePagination attributes.

Example: Retrieving projects

{
    "data": [
        {
            "id": "4ac5b6b2-dcdb-40a9-ab29-f98463f73b4z",
            "type": "projects",
            "attributes": {
                "created_at": "2025-02-19T18:53:03.157337Z",
                "description": "Description example",
                "name": "Project example",
                "updated_at": "2025-02-19T18:53:03.157337Z"
            }
        }
    ],
    "meta": {
        "after": ""
    }
}

Object: Data

FieldTypeDescription
idstringThe ID of an experimentation entity.
Note: Set your ID field reference at this level.
typestringIdentifies the kind of resource an object represents. For example: projects, experiments, datasets, etc.
attributesjsonContains all the resource’s data except for the ID.

Object: Page

FieldTypeDescription
afterstringThe cursor to use to get the next results, if any. Provide the page[cursor] query parameter in your request to get the next results.

Projects API

Request type: projects

List all projects, sorted by creation date. The most recently-created projects are first.

Query parameters

ParameterTypeDescription
filter[name]stringThe name of a project to search for.
filter[id]stringThe ID of a project to search for.
page[cursor]stringList results with a cursor provided in the previous query.
page[limit]intLimits the number of results.

Response

FieldTypeDescription
within Data[]ProjectList of projects.

Object: Project

FieldTypeDescription
idstringUnique project ID. Set at the top level id field within the Data object.
ml_appstringML app name.
namestringUnique project name.
descriptionstringProject description.
created_attimestampTimestamp representing when the resource was created.
updated_attimestampTimestamp representing when the resource was last updated.

Create a project. If there is an existing project with the same name, the API returns the existing project unmodified.

Request

FieldTypeDescription
name (required)stringUnique project name.
ml_appstringML app name.
descriptionstringProject description.

Response

FieldTypeDescription
idUUIDUnique ID for the project. Set at the top level id field within the Data object.
ml_appstringML app name.
namestringUnique project name.
descriptionstringProject description.
created_attimestampTimestamp representing when the resource was created.
updated_attimestampTimestamp representing when the resource was last updated.

Partially update a project object. Specify the fields to update in the payload.

Request

FieldTypeDescription
namestringUnique project name.
ml_appstringML app name.
descriptionstringProject description.

Response

FieldTypeDescription
idUUIDUnique ID for the project. Set at the top level id field within the Data object.
ml_appstringML app name.
namestringUnique project name.
descriptionstringProject description.
updated_attimestampTimestamp representing when the resource was last updated.

Batch delete operation.

Request

FieldTypeDescription
project_ids (required)[]stringList of project IDs to delete.

Response

200 - OK

Datasets API

Request type: datasets

List all datasets, sorted by creation date. The most recently-created datasets are first.

Query parameters

ParameterTypeDescription
filter[name]stringThe name of a dataset to search for.
filter[id]stringThe ID of a dataset to search for.
page[cursor]stringList results with a cursor provided in the previous query.
page[limit]intLimits the number of results.

Response

FieldTypeDescription
within Data[]DatasetList of datasets.

Object: Dataset

FieldTypeDescription
idstringUnique dataset ID. Set at the top level id field within the Data object.
namestringUnique dataset name.
descriptionstringDataset description.
metadatajsonArbitrary user-defined metadata
created_attimestampTimestamp representing when the resource was created.
updated_attimestampTimestamp representing when the resource was last updated.

Create a dataset. If there is an existing dataset with the same name, the API returns the existing dataset unmodified.

Request

FieldTypeDescription
name (required)stringUnique dataset name.
descriptionstringDataset description.
metadatajsonArbitrary user-defined metadata.

Response

FieldTypeDescription
idUUIDUnique ID for the dataset. Set at the top level id field within the Data object.
namestringUnique dataset name.
descriptionstringDataset description.
metadatajsonArbitrary user-defined metadata.
created_attimestampTimestamp representing when the resource was created.
updated_attimestampTimestamp representing when the resource was last updated.

Partially update a dataset object. Specify the fields to update in the payload.

Request

FieldTypeDescription
namestringUnique dataset name.
descriptionstringDataset description.
metadatajsonArbitrary user-defined metadata.

Response

FieldTypeDescription
idUUIDUnique ID for the dataset. Set at the top level id field within the Data object.
namestringUnique dataset name.
descriptionstringDataset description.
metadatajsonArbitrary user-defined metadata.
created_attimestampTimestamp representing when the resource was created.
updated_attimestampTimestamp representing when the resource was last updated.

Batch delete operation.

Request

FieldTypeDescription
dataset_ids (required)[]stringList of dataset IDs to delete.

Response

200 - OK

List all dataset records, sorted by creation date. The most recently-created records are first.

Query parameters

ParameterTypeDescription
filter[version]stringList results for a given dataset version.
page[cursor]stringList results with a cursor provided in the previous query.
page[limit]intLimits the number of results.

Response

FieldTypeDescription
within Data[]RecordList of dataset records.

Object: Record

FieldTypeDescription
idstringUnique record ID.
dataset_idstringUnique dataset ID.
inputany valid JSON type (string, int, object, etc.)Data that serves as the starting point for an experiment.
expected_outputany valid JSON type (string, int, object, etc.)Expected output
metadatajsonArbitrary user-defined metadata.
created_attimestampTimestamp representing when the resource was created.
updated_attimestampTimestamp representing when the resource was last updated.

Appends records for a given dataset.

Request

FieldTypeDescription
records (required)[]RecordReqList of records to create.

Object: RecordReq

FieldTypeDescription
input (required)any valid JSON type (string, int, object, etc.)Data that serves as the starting point for an experiment.
expected_outputany valid JSON type (string, int, object, etc.)Expected output
metadatajsonArbitrary user-defined metadata.

Response

FieldTypeDescription
records[]RecordList of created records.

Partially update a dataset record object. Specify the fields to update in the payload.

Request

FieldTypeDescription
inputany valid JSON type (string, int, object, etc.)Data that serves as the starting point for an experiment.
expected_outputany valid JSON type (string, int, object, etc.)Expected output
metadatajsonArbitrary user-defined metadata.

Response

FieldTypeDescription
idstringUnique record ID.
dataset_idstringUnique dataset ID.
inputany valid JSON type (string, int, object, etc.)Data that serves as the starting point for an experiment.
expected_outputany valid JSON type (string, int, object, etc.)Expected output
metadatajsonArbitrary user-defined metadata.
created_attimestampTimestamp representing when the resource was created.
updated_attimestampTimestamp representing when the resource was last updated.

Batch delete operation.

Request

FieldTypeDescription
record_ids (required)[]stringList of dataset record IDs to delete.

Response

200 - OK

Experiments API

Request type: experiments

List all experiments, sorted by creation date. The most recently-created experiments are first.

Query parameters

ParameterTypeDescription
filter[project_id] (required if dataset not provided)stringThe ID of a project to retrieve experiments for.
filter[dataset_id]stringThe ID of a dataset to retrieve experiments for.
filter[id]stringThe ID(s) of an experiment to search for. To query for multiple experiments, use ?filter[id]=<>&filter[id]=<>.
filter[name]stringThe name of an experiment to search for.
page[cursor]stringList results with a cursor provided in the previous query.
page[limit]intLimits the number of results.

Response

FieldTypeDescription
within Data[]ExperimentList of experiments.

Object: Experiment

FieldTypeDescription
idUUIDUnique experiment ID. Set at the top level id field within the Data object.
project_idstringUnique project ID.
dataset_idstringUnique dataset ID.
namestringUnique experiment name.
descriptionstringExperiment description.
metadatajsonArbitrary user-defined metadata
created_attimestampTimestamp representing when the resource was created.
updated_attimestampTimestamp representing when the resource was last updated.

Create an experiment. If there is an existing experiment with the same name, the API returns the existing experiment unmodified.

Request

FieldTypeDescription
project_id (required)stringUnique project ID.
dataset_id (required)stringUnique dataset ID.
dataset_versionintDataset version.
name (required)stringUnique experiment name.
descriptionstringExperiment description.
metadatajsonArbitrary user-defined metadata
ensure_uniqueboolIf true, Datadog generates a new experiment with a unique name in the case of a conflict. Datadog recommends you set this field to true.

Response

FieldTypeDescription
idUUIDUnique experiment ID. Set at the top level id field within the Data object.
project_idstringUnique project ID.
dataset_idstringUnique dataset ID.
namestringUnique experiment name.
descriptionstringExperiment description.
metadatajsonArbitrary user-defined metadata
created_attimestampTimestamp representing when the resource was created.
updated_attimestampTimestamp representing when the resource was last updated.

Partially update an experiment object. Specify the fields to update in the payload.

Request

FieldTypeDescription
dataset_idstringUnique dataset ID.
namestringUnique experiment name.
descriptionstringExperiment description.
metadatajsonArbitrary user-defined metadata

Response

FieldTypeDescription
idUUIDUnique experiment ID. Set at the top level id field within the Data object.
project_idstringUnique project ID.
dataset_idstringUnique dataset ID.
namestringUnique experiment name.
descriptionstringExperiment description.
metadatajsonArbitrary user-defined metadata
created_attimestampTimestamp representing when the resource was created.
updated_attimestampTimestamp representing when the resource was last updated.

Batch delete operation.

Request

FieldTypeDescription
experiment_ids (required)[]stringList of experiment IDs to delete.

Response

200 - OK

Handle the ingestion of experiment spans or respective evaluation metrics.

Request

FieldTypeDescription
tags[]stringKey-value pair of strings.
spans (required)[]SpanSpans that represent an evaluation.
metrics[]EvalMetricGenerated evaluation metrics.

Response

202 - Accepted

Object: Span

FieldTypeDescription
span_id (required)stringUnique span ID.
trace_idstringTrace ID. Only needed if tracing.
start_ns (required)uint64The span’s start time in nanoseconds.
duration (required)uint64The span’s duration in nanoseconds.
dataset_record_idstringThe dataset record referenced.
meta (required)MetaThe core content of the span.

Object: Meta

FieldTypeDescription
errorErrorCaptures errors.
input (required)any valid JSON type (string, int, object, etc.)Input value to an operation.
output (required)any valid JSON type (string, int, object, etc.)Output value to an operation.
expected_outputany valid JSON type (string, int, object, etc.)Expected output value.
metadatajsonArbitrary user-defined metadata.

Object: EvalMetric

FieldTypeDescription
span_id (required)stringUnique span ID to join on.
trace_idstringTrace ID. Only needed if tracing.
errorErrorCaptures errors.
metric_type (required)enumDefines the metric type. Accepted values: categorical, score.
timestamp_ms (required)uint64Timestamp in which the evaluation occurred.
label (required)stringLabel for the metric.
categorical_value (required, if type is categorical)stringCategory value of the metric.
score_value (required, if type is score)float64Score value of the metric.

Object: Error

FieldTypeDescription
MessagestringError message.
StackstringError stack.
TypestringError type. For example, http.
PREVIEWING: cswatt/llm-datasets-experiments