Posts Tagged DataOps
Amazon Managed Workflows for Apache Airflow — Configuration: Understanding Amazon MWAA’s Configuration Options
For anyone new to Amazon Managed Workflows for Apache Airflow (Amazon MWAA), especially those used to managing their own Apache Airflow platform, Amazon MWAA’s configuration might appear to be a bit of a black box at first. This brief post will explore Amazon MWAA’s configuration — how to inspect it and how to modify it. We will use Airflow DAGs to review an MWAA environment’s
airflow.cfg file, environment variables, and Python packages.
Apache Airflow is a popular open-source platform designed to schedule and monitor workflows. According to Wikipedia, Airflow was created at Airbnb in 2014 to manage the company’s increasingly complex workflows. From the beginning, the project was made open source, becoming an Apache Incubator project in 2016 and a top-level Apache Software Foundation project in 2019.
With the announcement of Amazon MWAA in November 2020, AWS customers can now focus on developing workflow automation while leaving the management of Airflow to AWS. Amazon MWAA can be used as an alternative to AWS Step Functions for workflow automation on AWS.
The Amazon MWAA service is available using the AWS Management Console, as well as the Amazon MWAA API using the latest versions of the AWS SDK and AWS CLI. For more information on Amazon MWAA, read my last post, Running Spark Jobs on Amazon EMR with Apache Airflow.
The DAGs referenced in this post are available on GitHub. Using this
git clone command, download a copy of this post’s GitHub repository to your local environment.
git clone --branch main --single-branch --depth 1 --no-tags \
Environment variables are an essential part of an MWAA environment’s configuration. There are various ways to examine the environment variables. You could use Airflow’s
BashOperator to simply call the command,
env, or the
PythonOperator to call a Python iterator function, as shown below. A sample DAG,
dags/get_env_vars.py, is included in the project.
PythonOperator will iterate over the MWAA environment’s environment variables and output them to the task’s log. Below is a snippet of an example task’s log.
Airflow Configuration File
According to Airflow, the
airflow.cfg file contains Airflow’s configuration. You can edit it to change any of the settings. The first time you run Apache Airflow, it creates an
airflow.cfg configuration file in your
AIRFLOW_HOME directory and attaches the configurations to your environment as environment variables.
Amazon MWAA doesn’t expose the
airflow.cfg in the Apache Airflow UI of an environment. Although you can’t access it directly, you can view the
airflow.cfg file. The configuration file is located in your
~/airflow by default).
There are multiple ways to examine your MWAA environment’s
airflow.cfg file. You could use Airflow’s
PythonOperator to call a Python function that reads the contents of the file, as shown below. The function uses the
AIRFLOW_HOME environment variable to locate and read the
airflow.cfg. A sample DAG,
dags/get_airflow_cfg.py, is included in the project.
The DAG’s task will read the MWAA environment’s
airflow.cfg file and output it to the task’s log. Below is a snippet of an example task’s log.
Customizing Airflow Configurations
While AWS doesn’t expose the
airflow.cfg in the Apache Airflow UI of your environment, you can change the default Apache Airflow configuration options directly within the Amazon MWAA console and continue using all other settings in
airflow.cfg. The configuration options changed in the Amazon MWAA console are translated into environment variables.
To customize the Apache Airflow configuration, change the default options directly on the Amazon MWAA console. Select Edit, add or modify configuration options and values in the Airflow configuration options menu, then select Save. For example, we can change Airflow’s default timezone (
Once the MWAA environment is updated, which may take several minutes, view your changes by re-running the DAG,
dags/get_env_vars.py. Note the new configuration item on both lines 2 and 6 of the log snippet shown below. The configuration item appears on its own (
AIRFLOW__CORE_DEFAULT__UI_TIMEZONE), as well as part of the
AIRFLOW_CONFIG_SECRETS dictionary environment variable.
Using the MWAA API
We can also make configuration changes using the MWAA API. For example, to change the default Airflow UI timezone, call the MWAA API’s
update-environment command using the AWS CLI. Include the
--airflow-configuration-option parameter, passing the
core.default_ui_timezone key/value pair as a JSON blob.
To review an environment’s configuration, use the
get-environment command in combination with
Below, we see an example of the output.
Airflow is written in Python, and workflows are created via Python scripts. Python packages are a crucial part of an MWAA environment’s configuration. According to the documentation, an ‘extra package’, is a Python subpackage that is not included in the Apache Airflow base, installed on your MWAA environment. As part of setting up an MWAA environment, you can specify the location of the
requirements.txt file in the Airflow S3 bucket. Extra packages are installed using the
There are several ways to check your MWAA environment’s installed Python packages and versions. You could use Airflow’s
BashOperator to call the command,
python3 -m pip list. A sample DAG,
dags/get_py_pkgs.py, is included in the project.
The DAG’s task will output a list of all Python packages and package versions to the task’s log. Below is a snippet of an example task’s log.
Understanding your Amazon MWAA environment’s
airflow.cfg file, environment variables, and Python packages are all important for proper Airflow platform management. This brief post learned more about Amazon MWAA’s configuration — how to inspect it using DAGs and how to modify it through the Amazon MWAA console.