Use CI jobs failure analysis to identify root causes in failed jobs

Overview

This guide explains how to use CI jobs failure analysis to determine the most common root cause of failed CI jobs. This can help improve the user experience with CI pipelines.

Understanding CI jobs failure analysis

CI Visibility uses an LLM model to generate enhanced error messages and categorize them with a domain and subdomain, based on the relevant logs collected from every failed CI job.

Failed CI jobs with LLM-generated errors

How does CI Visibility identify the relevant logs of a CI job?

CI Visibility considers that a log line is relevant when it has not appeared in the logs collected from the previous successful executions of that CI job. Log relevancy is only computed for logs coming from failed CI jobs.

You can check if a log line has been considered as relevant by using the @relevant:true tag in the Log Explorer.

What information does the LLM model use as input?

If a failed CI job has relevant logs, the LLM model uses the last 100 relevant log lines as input. If a failed CI job does not have relevant logs, CI Visibility sends the last 100 log lines.

Each log line is pre-scanned to redact any potentially sensitive information before being used.

The LLM model can classify errors with similar messages into distinct yet related subdomains. For example, if the error message is Cannot connect to docker daemon, it is usually categorized under domain:platform and subdomain:network. However, the LLM model may sometimes classify it under subdomain:infrastructure instead.

Domains and Subdomains

Errors are categorized with a domain and subdomain:

Domains
DomainDescription
codeFailures caused by the code being built and tested in the CI pipeline. They should be fixed by the developer that modified the code.
platformFailures caused by reasons external to the code being built and tested. These failures can come from the CI provider, the underlying infrastructure, or external dependencies. They are not related to the developer code changes and should often be fixed by the team owning the whole CI system.
unknownUsed when the logs do not reveal a clear root cause of job failure.
Subdomains

Click on a domain tab to see the correspondent subdomains:

SubdomainCauseExamples
buildCompilation or build errors.Compilation error in processor_test.go:28:50
testTest failures.7 failed tests. Error: Can't find http.request.headers.x-amzn-trace-id in span's meta.
qualityFormat or linting failures.Detected differences in files after running 'go fmt'. To fix, run 'go fmt' on the affected files and commit the changes.
securitySecurity violations.Security violation: Use of weak SHA1 hash for security. Consider usedforsecurity=False.
SubdomainCauseExamples
assemblyErrors in artifacts generation or assembly errors during a script execution.Artifact generation failed due to rejected file 'domains/backend/cart-shopping-proto/mod.info' that exists in the repository.
deploymentErrors during deployments, or related to deployments configurations.Subprocess command returned non-zero exit status 1 during deployment config generation.
infrastructureErrors related to the infrastructure on which the job was executed.Invalid docker image reference format for tag 'registry.gitlab.com/cart-shopping/infrastructure/backend-deploy-image:AE/create-kubectl-image'.
networkErrors on connectivity with other dependencies.Connection refused when accessing localhost:8080.
credentialsErrors on authentication; missing or wrong credentials.Failed to get image auth for docker.elastic.co. No credentials found. Unable to pull image 'docker.elastic.co/elasticsearch/elasticsearch:7.17.24'.
dependenciesErrors on installing or updating dependencies required to execute the job.Package 'systemd-container' cannot be installed. Depends on 'libsystemd-shared' v255.4-1ubuntu8.4 but v255.4-1ubuntu8.5 is to be installed.
gitErrors executing git commands.Automatic merge failed due to conflicts between branches 'cart-shopping-new-feature' and 'staging'.
checksErrors on required fulfillment of checks during the CI job execution.Release note not found during changelog validation
setupErrors on setting up the CI job.Execution failed during the TLS setup or client dialing process.
scriptSyntactic errors in the script in the CI job.No tests ran due to file or directory not found.
SubdomainDescriptionExample
unknownError could not be categorized.Job failed with exit code 1. View full logs or trace.

Supported CI providers

CI jobs failure analysis is available for the following CI providers:

Note: You must enable CI job logs collection, and the logs need to be indexed. To set up CI job logs collection, select your CI provider on Pipeline Visibility and follow the instructions to collect job logs.

If you are interested in CI jobs failure analysis but your CI provider is not supported yet, fill out this form.

Identify the most recurrent errors in your CI pipelines

Using the CI Health page

CI Health provides a high-level overview of the health and performance of your CI pipelines. It helps DevOps and engineering teams monitor CI jobs, detect failures, and optimize build performance.

On this page, you can see a breakdown of the errors in your CI pipelines split by error domain. Click on a CI pipeline, and check the Breakdown column in the Failed Executions section.

CI Job Failure analysis breakdown in CI Health

Using facets

Use the facets @error.message, @error.domain, and @error.subdomain to identify the most recurrent errors in your CI pipelines. Using those facets, you can create custom dashboards and notebooks.

Failed CI Jobs filtered by error.domain and error.subdomain

These facets are only available when using the ci_level:job in a query. If the CI jobs failures analysis can’t be computed (for example, if you are not using a supported CI provider), these facets will contain the error information coming from the CI provider.

Using the dashboard template

You can import the CI Visibility - CI Jobs Failure Analysis dashboard template:

  1. Open the civisibility-ci-jobs-failure-analysis-dashboard.json dashboard template and copy the contents into the clipboard.
  2. Create a New Dashboard in Datadog.
  3. Paste the copied content into the new dashboard.
  4. Save the dashboard.
CI jobs failure analysis dashboard

Further reading

PREVIEWING: drodriguezhdez/add_public_docs_log_summarization