Install openlineage provider for both Airflow schedulers and Airflow workers by adding the following into your requirements.txt file or wherever your Airflow depedencies are managed:
For Airflow 2.7 or later:
apache-airflow-providers-openlineage
For Airflow 2.5 & 2.6 :
openlineage-airflow
Configure openlineage provider. The simplest option is to set the following environment variables and make them available to pods where you run Airflow schedulers and Airflow workers:
If you’re using Airflow v2.7 or v2.8, also add these two environment variables along with the previous ones. This fixes an OpenLinage config issue fixed at apache-airflow-providers-openlineage v1.7, while Airflow v2.7 and v2.8 use previous versions.
#!/bin/sh
# Required for Airflow v2.7 & v2.8 onlyexportAIRFLOW__OPENLINEAGE__CONFIG_PATH=""exportAIRFLOW__OPENLINEAGE__DISABLED_FOR_OPERATORS=""
Check official documentation configuration-openlineage for other supported configurations of the openlineage provider.
Trigger an update to your Airflow pods and wait for them to finish.
Set OPENLINEAGE_CLIENT_LOGGING to DEBUG along with the other environment variables set previously for OpenLineage client and its child modules. This can be useful in troubleshooting during the configuration of openlineage provider.
If you’re using Airflow v2.7 or v2.8, also add these two environment variables to the startup script. This fixes an OpenLinage config issue fixed at apache-airflow-providers-openlineage v1.7, while Airflow v2.7 and v2.8 use previous versions.
#!/bin/sh
# Required for Airflow v2.7 & v2.8 onlyexportAIRFLOW__OPENLINEAGE__CONFIG_PATH=""exportAIRFLOW__OPENLINEAGE__DISABLED_FOR_OPERATORS=""
Check official documentation configuration-openlineage for other supported configurations of openlineage provider.
Deploy your updated requirements.txt and Amazon MWAA startup script to your Amazon S3 folder configured for your Amazon MWAA Environment.
Optionally, set up Log Collection for correlating task logs to DAG run executions in DJM:
Ensure your Execution role configured for your Amazon MWAA Environment has the right permissions to the requirements.txt and Amazon MWAA start script. This is required if you are managing your own Execution role and it’s the first time you are adding those supporting files. See official guide Amazon MWAA execution role for details if needed.
Set OPENLINEAGE_CLIENT_LOGGING to DEBUG in the Amazon MWAA start script for OpenLineage client and its child modules. This can be useful in troubleshooting during the configuration of openlineage provider.
To set up the OpenLineage provider, define the following environment variables. You can configure these variables in your Astronomer deployment using either of the following methods:
From the Astro UI: Navigate to your deployment settings and add the environment variables directly.
In the Dockerfile: Define the environment variables in your Dockerfile to ensure they are included during the build process.
Set AIRFLOW__OPENLINEAGE__NAMESPACE with a unique name for your Airflow deployment. This allows Datadog to logically separate this deployment’s jobs from those of other Airflow deployments.
Set OPENLINEAGE_CLIENT_LOGGING to DEBUG for the OpenLineage client and its child modules to log at a DEBUG logging level. This can be useful for troubleshooting during the configuration of an OpenLineage provider.
You can troubleshoot Airflow tasks that run Spark jobs more efficiently by connecting the Spark job run info and telemetry with the respective Airflow task.
To see the link between Airflow task and the Spark application it submitted, follow these steps:
Configure Airflow to turn off lazy loading of Airflow plugins by setting lazy_load_plugins config to False in your airflow.cfg or exporting the following environment variable where your Airflow schedulers and Airflow workers run:
exportAIRFLOW__CORE__LAZY_LOAD_PLUGINS='False'
Update your Airflow job’s DAG file by adding the following Spark configurations to your SparkSubmitOperator where you submit your Spark Application:
Once you have re-deployed your Airflow environment with the updated lazy_load_plugins config and the updated DAG file, and your Airflow DAG as been re-run, go to Data Jobs Monitoring page. You can then find your latest Airflow job run and see a SpanLink in the Airflow Job Run trace to the trace of the launched Spark Application. This makes it possible to debug issues in Airflow or Spark all in one place.