Posts Tagged Automation
DevOps for DataOps: Building a CI/CD Pipeline for Apache Airflow DAGs
Posted by Gary A. Stafford in Analytics, AWS, Big Data, Build Automation, Cloud, Continuous Delivery, DevOps, Python, Software Development, Technology Consulting on December 14, 2021
Build an effective CI/CD pipeline to test and deploy your Apache Airflow DAGs to Amazon MWAA using GitHub Actions
Introduction
In this post, we will learn how to use GitHub Actions to build an effective CI/CD workflow for our Apache Airflow DAGs. We will use the DevOps concepts of Continuous Integration and Continuous Delivery to automate the testing and deployment of Airflow DAGs to Amazon Managed Workflows for Apache Airflow (Amazon MWAA) on AWS.

Technologies
Apache Airflow
According to the documentation, Apache Airflow is an open-source platform to author, schedule, and monitor workflows programmatically. With Airflow, you author workflows as Directed Acyclic Graphs (DAGs) of tasks written in Python.
Amazon Managed Workflows for Apache Airflow
According to AWS, Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a highly available, secure, and fully-managed workflow orchestration for Apache Airflow. MWAA automatically scales its workflow execution capacity to meet your needs and is integrated with AWS security services to help provide fast and secure access to data.

GitHub Actions
According to GitHub, GitHub Actions makes it easy to automate software workflows with CI/CD. GitHub Actions allow you to build, test, and deploy code right from GitHub. GitHub Actions are workflows triggered by GitHub events like push, issue creation, or a new release. You can leverage GitHub Actions prebuilt and maintained by the community.

If you are new to GitHub Actions, I recommend my previous post, Continuous Integration and Deployment of Docker Images using GitHub Actions.
Terminology
DataOps
According to Wikipedia, DataOps is an automated, process-oriented methodology used by analytic and data teams to improve the quality and reduce the cycle time of data analytics. While DataOps began as a set of best practices, it has now matured to become a new approach to data analytics.
DataOps applies to the entire data lifecycle from data preparation to reporting and recognizes the interconnected nature of the data analytics team and IT operations. DataOps incorporates the Agile methodology to shorten the software development life cycle (SDLC) of analytics development.
DevOps
According to Wikipedia, DevOps is a set of practices that combines software development (Dev) and IT operations (Ops). It aims to shorten the systems development life cycle and provide continuous delivery with high software quality.
DevOps is a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality. -Wikipedia
Fail Fast
According to Wikipedia, a fail-fast system is one that immediately reports any condition that is likely to indicate a failure. Using the DevOps concept of fail fast, we build steps into our workflows to uncover errors sooner in the SDLC. We shift testing as far to the left as possible (referring to the pipeline of steps moving from left to right) and test at multiple points along the way.
Source Code
All source code for this demonstration, including the GitHub Actions, Pytest unit tests, and Git Hooks, is open-sourced and located on GitHub.
Architecture
The diagram below represents the architecture for a recent blog post and video demonstration, Lakehouse Automation on AWS with Apache Airflow. The post and video show how to programmatically load and upload data from Amazon Redshift to an Amazon S3-based data lake using Apache Airflow.
In this post, we will review how the DAGs from the previous were developed, tested, and deployed to MWAA using a variety of progressively more effective CI/CD workflows. The workflows demonstrated could also be easily applied to other Airflow resources in addition to DAGs, such as SQL scripts, configuration and data files, Python requirement files, and plugins.
Workflows
No DevOps
Below we see a minimally viable workflow for loading DAGs into Amazon MWAA, which does not use the principles of CI/CD. Changes are made in the local Airflow developer’s environment. The modified DAGs are copied directly to the Amazon S3 bucket, which are then automatically synced with Amazon MWAA, barring any errors. Those changes are also (hopefully) pushed back to the centralized version control or source code management (SCM) system, which is GitHub in this post.

There are at least two significant issues with this error-prone workflow. First, the DAGs are always out of sync between the Amazon S3 bucket and GitHub. These are two independent steps — copying or syncing the DAGs to S3 and pushing the DAGs to GitHub. A developer might continue making changes and pushing DAGs to S3 without pushing to GitHub or vice versa.
Secondly, the DevOps concept of fail-fast is missing. The first time you know your DAG contains errors is likely when it is synced to MWAA and throws an Import Error. By then, the DAG has already been copied to S3, synced to MWAA, and possibly pushed to GitHub, which other developers could then pull.

GitHub Actions
A significant step up from the previous workflow is using GitHub Actions to test and deploy your code after pushing it to GitHub. Although in this workflow, code is still ‘pushed straight to Trunk’ (the main branch in GitHub) and risks other developers in a collaborative environment pulling potentially erroneous code, you have far less chance of DAG errors making it to MWAA.

Using GitHub Actions, you also eliminate human error that could result in the changes to DAGs not being synced to Amazon S3. Lastly, using this workflow improves security by eliminating the need to provide direct access to the Airflow Amazon S3 bucket to Airflow Developers.
Types of Tests
The first GitHub Action, test_dags.yml
, is triggered on a push to the dags
directory in the main
branch of the repository. It is also triggered whenever a pull request is made for the main
branch. The first GitHub Action runs a battery of tests, including checking Python dependencies, code style, code quality, DAG import errors, and unit tests. The tests catch issues with DAGs before being synced to S3 by a second GitHub Action.
name: Test DAGs on: push: paths: - 'dags/**' pull_request: branches: - main jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - name: Set up Python uses: actions/setup-python@v2 with: python-version: '3.7' - name: Install dependencies run: | python -m pip install --upgrade pip pip install -r requirements/requirements.txt pip check - name: Lint with Flake8 run: | pip install flake8 flake8 --ignore E501 dags --benchmark -v - name: Confirm Black code compliance (psf/black) run: | pip install pytest-black pytest dags --black -v - name: Test with Pytest run: | pip install pytest cd tests || exit pytest tests.py -v

Python Dependencies
The first test installs the modules listed in the requirements.txt
file used locally to develop the application. This test is designed to uncover any missing or conflicting modules.
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements/requirements.txt
pip check
It is essential to develop your DAGs against the same version of Python and with the same version of the Python modules used in your Airflow environment. You can use the BashOperator to run shell commands to obtain the versions of Python and module installed in your Airflow environment:
python3 --version; python3 -m pip list
A snippet of log output from DAG showing Python version and Python modules available in MWAA 2.0.2:

The latest stable release of Airflow is currently version 2.2.2, released 2021-11-15. However, as of December 2021, Amazon’s latest version of MWAA 2.x is version 2.0.2, released 2021-04-19. MWAA 2.0.2 currently runs Python3 version 3.7.10.

Flake8
Known as ‘your tool for style guide enforcement,’ Flake8 is described as the modular source code checker. It is a command-line utility for enforcing style consistency across Python projects. Flake8 is a wrapper around PyFlakes, pycodestyle, and Ned Batchelder’s McCabe script. The module, pycodestyle
, is a tool to check your Python code against some of the style conventions in PEP 8.
Flake8 is highly configurable, with options to ignore specific rules if not required by your development team. For example, in this demonstration, I intentionally ignored rule E501, which states that ‘line length should be limited to 72 characters.’
- name: Lint with Flake8
run: |
pip install flake8
flake8 --ignore E501 dags --benchmark -v
Black
Known as ‘the uncompromising code formatter,’ Python code formatted using Black (referred to as Blackened code) looks the same regardless of the project you’re reading. Formatting becomes transparent, allowing teams to focus on the content instead. Black makes code review faster by producing the smallest diffs possible, assuming all developers are using black
to format their code.
The Airflow DAGs in this GitHub repository are automatically formatted with black
using a pre-commit
Git Hooks before being committed and pushed to GitHub. The test confirms black
code compliance.
- name: Confirm Black code compliance (psf/black)
run: |
pip install pytest-black
pytest dags --black -v
Pytest
The pytest framework describes itself as a mature, fully-featured Python testing tool that helps you write better programs. The Pytest framework makes it easy to write small tests yet scales to support complex functional testing for applications and libraries.
The GitHub Action in the GitHub project, test_dags.yml
, calls the tests.py
file, also contained in the project.
- name: Test with Pytest
run: |
pip install pytest
cd tests || exit
pytest tests.py -v
The tests.py
file contains several pytest
unit tests. The tests are based on my project requirements; your tests will vary. These tests confirm that all DAGs:
- Do not contain DAG Import Errors (test catches 75% of my errors);
- Follow specific file naming conventions;
- Include a description and an owner other than ‘airflow’;
- Contain required project tags;
- Do not send emails (my projects use SNS or Slack for notifications);
- Do not retry more than three times;
import os
import sys
import pytest
from airflow.models import DagBag
sys.path.append(os.path.join(os.path.dirname(__file__), "../dags"))
sys.path.append(os.path.join(os.path.dirname(__file__), "../dags/utilities"))
# Airflow variables called from DAGs under test are stubbed out
os.environ["AIRFLOW_VAR_DATA_LAKE_BUCKET"] = "test_bucket"
os.environ["AIRFLOW_VAR_ATHENA_QUERY_RESULTS"] = "SELECT 1;"
os.environ["AIRFLOW_VAR_SNS_TOPIC"] = "test_topic"
os.environ["AIRFLOW_VAR_REDSHIFT_UNLOAD_IAM_ROLE"] = "test_role_1"
os.environ["AIRFLOW_VAR_GLUE_CRAWLER_IAM_ROLE"] = "test_role_2"
@pytest.fixture(params=["../dags/"])
def dag_bag(request):
return DagBag(dag_folder=request.param, include_examples=False)
def test_no_import_errors(dag_bag):
assert not dag_bag.import_errors
def test_requires_tags(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert dag.tags
def test_requires_specific_tag(dag_bag):
for dag_id, dag in dag_bag.dags.items():
try:
assert dag.tags.index("data lake demo") >= 0
except ValueError:
assert dag.tags.index("redshift demo") >= 0
def test_desc_len_greater_than_fifteen(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert len(dag.description) > 15
def test_owner_len_greater_than_five(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert len(dag.owner) > 5
def test_owner_not_airflow(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert str.lower(dag.owner) != "airflow"
def test_no_emails_on_retry(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert not dag.default_args["email_on_retry"]
def test_no_emails_on_failure(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert not dag.default_args["email_on_failure"]
def test_three_or_less_retries(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert dag.default_args["retries"] <= 3
def test_dag_id_contains_prefix(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert str.lower(dag_id).find("__") != -1
def test_dag_id_requires_specific_prefix(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert str.lower(dag_id).startswith("data_lake__") \
or str.lower(dag_id).startswith("redshift_demo__")
If you are building custom Airflow Operators, additional unit, functional, and integration tests are recommended.
Fork and Pull
We can improve on the practice of pushing directly to Trunk by implementing one of two collaborative development models, recommended by GitHub:
- The Shared repository model: uses ‘topic’ branches, which are reviewed, approved, and merged into the main branch.
- Fork and pull model: a repo is forked, changes are made, a pull request is created, the request is reviewed, and if approved, merged into the main branch.
In the fork and pull model, we create a fork of the DAG repository where we make our changes. We then commit and push those changes back to the forked repository. When ready, we create a pull request. If the pull request is approved and passes all the tests, it is manually or automatically merged into the main branch. DAGs are then synced to S3 and, eventually, to MWAA. I usually prefer to trigger merges manually once all tests have passed.
The fork and pull model greatly reduces the chance that bad code is merged to the main branch before passing all tests.

Syncing DAGs to S3
The second GitHub Action in the GitHub project, sync_dags.yml
, is triggered when the previous Action, test_dags.yml
, completes successfully, or in the case of the folk and pull method, the merge to the main
branch is successful.
name: Sync DAGs
on:
workflow_run:
workflows:
- 'Test DAGs'
types:
- completed
pull_request:
types:
- closed
jobs:
deploy:
runs-on: ubuntu-latest
if: ${{ github.event.workflow_run.conclusion == 'success' }}
steps:
- uses: actions/checkout@master
- uses: jakejarvis/s3-sync-action@master
env:
AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_REGION: 'us-east-1'
SOURCE_DIR: 'dags'
DEST_DIR: 'dags'
The GitHub Action, sync_dags.yml
, requires three GitHub encrypted secrets, created in advance and associated with the GitHub repository. According to GitHub, secrets are encrypted environment variables you create in an organization, repository, or repository environment. Encrypted secrets allow you to store sensitive information, such as access tokens, in your repository. The secrets that you create are available to use in GitHub Actions workflows.

The DAGs are synced to Amazon S3 and, eventually, automatically synced to MWAA.

Local Testing and Git Hooks
To further improve your CI/CD workflows, you should consider using Git Hooks. Using Git Hooks, we can ensure code is tested locally before committing and pushing changes to GitHub. Testing locally allows us to fail-faster, catching errors during development instead of once code is pushed to GitHub.

According to the documentation, Git has a way to fire off custom scripts when certain important actions occur. There are two types of hooks: client-side and server-side. Client-side hooks are triggered by operations such as committing and merging, while server-side hooks run on network operations such as receiving pushed commits.
You can use these hooks for all sorts of reasons. I often use a client-side pre-commit
hook to format DAGs using black
. Using a client-side pre-push
Git Hook, we will ensure that tests are run before pushing the DAGs to GitHub. According to Git, The pre-push
hook runs when the git push
command is executed after the remote refs have been updated but before any objects have been transferred. You can use it to validate a set of ref updates before a push occurs. A non-zero exit code will abort the push. The test could instead be run as part of the pre-commit
hook if they are not too time-consuming.
To use the pre-push
hook, create the following file within the local repository, .git/hooks/pre-push
:
#!/bin/sh
# do nothing if there are no commits to push
if [ -z "$(git log @{u}..)" ]; then
exit 0
fi
sh ./run_tests_locally.sh
Then, run the following chmod
command to make the hook executable:
chmod 755 .git/hooks/pre-push
The the pre-push
hook runs the shell script, run_tests_locally.sh
. The script executes nearly identical tests, locally, as the GitHub Action, test_dags.yml
, does remotely on GitHub:
#!/bin/sh
echo "Starting Flake8 test..."
flake8 --ignore E501 dags --benchmark || exit 1
echo "Starting Black test..."
python3 -m pytest --cache-clear
python3 -m pytest dags/ --black -v || exit 1
echo "Starting Pytest tests..."
cd tests || exit
python3 -m pytest tests.py -v || exit 1
echo "All tests completed successfully! 🥳"
References
Here are some additional references for testing and deploying Airflow DAGs and the use of GitHub Actions:
- Astronomer: Testing Airflow DAGs (documentation)
- Astronomer: Testing Airflow to Bullet Proof Your Code (YouTube video)
- GitHub: Building and testing Python (documentation)
- Manning: Chapter 9 of Data Pipelines with Apache Airflow
This blog represents my own viewpoints and not of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.
Amazon Managed Workflows for Apache Airflow — Configuration: Understanding Amazon MWAA’s Configuration Options
Posted by Gary A. Stafford in Software Development on December 29, 2020
Introduction
For anyone new to Amazon Managed Workflows for Apache Airflow (Amazon MWAA), especially those used to managing their own Apache Airflow platform, Amazon MWAA’s configuration might appear to be a bit of a black box at first. This brief post will explore Amazon MWAA’s configuration — how to inspect it and how to modify it. We will use Airflow DAGs to review an MWAA environment’s airflow.cfg
file, environment variables, and Python packages.
Amazon MWAA
Apache Airflow is a popular open-source platform designed to schedule and monitor workflows. According to Wikipedia, Airflow was created at Airbnb in 2014 to manage the company’s increasingly complex workflows. From the beginning, the project was made open source, becoming an Apache Incubator project in 2016 and a top-level Apache Software Foundation project in 2019.
With the announcement of Amazon MWAA in November 2020, AWS customers can now focus on developing workflow automation while leaving the management of Airflow to AWS. Amazon MWAA can be used as an alternative to AWS Step Functions for workflow automation on AWS.
The Amazon MWAA service is available using the AWS Management Console, as well as the Amazon MWAA API using the latest versions of the AWS SDK and AWS CLI. For more information on Amazon MWAA, read my last post, Running Spark Jobs on Amazon EMR with Apache Airflow.

Source Code
The DAGs referenced in this post are available on GitHub. Using this git clone
command, download a copy of this post’s GitHub repository to your local environment.
git clone --branch main --single-branch --depth 1 --no-tags \
https://github.com/garystafford/aws-airflow-demo.git
Accessing Configuration
Environment Variables
Environment variables are an essential part of an MWAA environment’s configuration. There are various ways to examine the environment variables. You could use Airflow’s BashOperator
to simply call the command, env
, or the PythonOperator
to call a Python iterator function, as shown below. A sample DAG, dags/get_env_vars.py
, is included in the project.
The DAG’s PythonOperator
will iterate over the MWAA environment’s environment variables and output them to the task’s log. Below is a snippet of an example task’s log.
Airflow Configuration File
According to Airflow, the airflow.cfg
file contains Airflow’s configuration. You can edit it to change any of the settings. The first time you run Apache Airflow, it creates an airflow.cfg
configuration file in your AIRFLOW_HOME
directory and attaches the configurations to your environment as environment variables.
Amazon MWAA doesn’t expose the airflow.cfg
in the Apache Airflow UI of an environment. Although you can’t access it directly, you can view the airflow.cfg
file. The configuration file is located in your AIRFLOW_HOME
directory, /usr/local/airflow
(~/airflow
by default).
There are multiple ways to examine your MWAA environment’s airflow.cfg
file. You could use Airflow’s PythonOperator
to call a Python function that reads the contents of the file, as shown below. The function uses the AIRFLOW_HOME
environment variable to locate and read the airflow.cfg
. A sample DAG, dags/get_airflow_cfg.py
, is included in the project.
The DAG’s task will read the MWAA environment’s airflow.cfg
file and output it to the task’s log. Below is a snippet of an example task’s log.
Customizing Airflow Configurations
While AWS doesn’t expose the airflow.cfg
in the Apache Airflow UI of your environment, you can change the default Apache Airflow configuration options directly within the Amazon MWAA console and continue using all other settings in airflow.cfg
. The configuration options changed in the Amazon MWAA console are translated into environment variables.
To customize the Apache Airflow configuration, change the default options directly on the Amazon MWAA console. Select Edit, add or modify configuration options and values in the Airflow configuration options menu, then select Save. For example, we can change Airflow’s default timezone (core.default_ui_timezone
) to America/New_York
.

Once the MWAA environment is updated, which may take several minutes, view your changes by re-running the DAG,dags/get_env_vars.py
. Note the new configuration item on both lines 2 and 6 of the log snippet shown below. The configuration item appears on its own (AIRFLOW__CORE_DEFAULT__UI_TIMEZONE
), as well as part of the AIRFLOW_CONFIG_SECRETS
dictionary environment variable.
Using the MWAA API
We can also make configuration changes using the MWAA API. For example, to change the default Airflow UI timezone, call the MWAA API’s update-environment
command using the AWS CLI. Include the --airflow-configuration-option
parameter, passing the core.default_ui_timezone
key/value pair as a JSON blob.
To review an environment’s configuration, use the get-environment
command in combination with jq
.
Below, we see an example of the output.
Python Packages
Airflow is written in Python, and workflows are created via Python scripts. Python packages are a crucial part of an MWAA environment’s configuration. According to the documentation, an ‘extra package’, is a Python subpackage that is not included in the Apache Airflow base, installed on your MWAA environment. As part of setting up an MWAA environment, you can specify the location of the requirements.txt
file in the Airflow S3 bucket. Extra packages are installed using the requirements.txt
file.

There are several ways to check your MWAA environment’s installed Python packages and versions. You could use Airflow’s BashOperator
to call the command, python3 -m pip list
. A sample DAG, dags/get_py_pkgs.py
, is included in the project.
The DAG’s task will output a list of all Python packages and package versions to the task’s log. Below is a snippet of an example task’s log.
Conclusion
Understanding your Amazon MWAA environment’s airflow.cfg
file, environment variables, and Python packages are all important for proper Airflow platform management. This brief post learned more about Amazon MWAA’s configuration — how to inspect it using DAGs and how to modify it through the Amazon MWAA console.
Running Spark Jobs on Amazon EMR with Apache Airflow: Using the new Amazon Managed Workflows for Apache Airflow (Amazon MWAA) Service on AWS
Posted by Gary A. Stafford in AWS, Bash Scripting, Big Data, Build Automation, Cloud, DevOps, Python, Software Development on December 24, 2020
Introduction
In the first post of this series, we explored several ways to run PySpark applications on Amazon EMR using AWS services, including AWS CloudFormation, AWS Step Functions, and the AWS SDK for Python. This second post in the series will examine running Spark jobs on Amazon EMR using the recently announced Amazon Managed Workflows for Apache Airflow (Amazon MWAA) service.
Amazon EMR
According to AWS, Amazon Elastic MapReduce (Amazon EMR) is a Cloud-based big data platform for processing vast amounts of data using common open-source tools such as Apache Spark, Hive, HBase, Flink, Hudi, and Zeppelin, Jupyter, and Presto. Using Amazon EMR, data analysts, engineers, and scientists are free to explore, process, and visualize data. EMR takes care of provisioning, configuring, and tuning the underlying compute clusters, allowing you to focus on running analytics.

Users interact with EMR in a variety of ways, depending on their specific requirements. For example, you might create a transient EMR cluster, execute a series of data analytics jobs using Spark, Hive, or Presto, and immediately terminate the cluster upon job completion. You only pay for the time the cluster is up and running. Alternatively, for time-critical workloads or continuously high volumes of jobs, you could choose to create one or more persistent, highly available EMR clusters. These clusters automatically scale compute resources horizontally, including the use of EC2 Spot instances, to meet processing demands, maximizing performance and cost-efficiency.
AWS currently offers 5.x and 6.x versions of Amazon EMR. Each major and minor release of Amazon EMR offers incremental versions of nearly 25 different, popular open-source big-data applications to choose from, which Amazon EMR will install and configure when the cluster is created. The latest Amazon EMR releases are Amazon EMR Release 6.2.0 and Amazon EMR Release 5.32.0.
Amazon MWAA
Apache Airflow is a popular open-source platform designed to schedule and monitor workflows. According to Wikipedia, Airflow was created at Airbnb in 2014 to manage the company’s increasingly complex workflows. From the beginning, the project was made open source, becoming an Apache Incubator project in 2016 and a top-level Apache Software Foundation project (TLP) in 2019.
Many organizations build, manage, and maintain Apache Airflow on AWS using compute services such as Amazon EC2 or Amazon EKS. Amazon recently announced Amazon Managed Workflows for Apache Airflow (Amazon MWAA). With the announcement of Amazon MWAA in November 2020, AWS customers can now focus on developing workflow automation, while leaving the management of Airflow to AWS. Amazon MWAA can be used as an alternative to AWS Step Functions for workflow automation on AWS.

Apache recently announced the release of Airflow 2.0.0 on December 17, 2020. The latest 1.x version of Airflow is 1.10.14, released December 12, 2020. However, at the time of this post, Amazon MWAA was running Airflow 1.10.12, released August 25, 2020. Ensure that when you are developing workflows for Amazon MWAA, you are using the correct Apache Airflow 1.10.12 documentation.
The Amazon MWAA service is available using the AWS Management Console, as well as the Amazon MWAA API using the latest versions of the AWS SDK and AWS CLI.
Airflow has a mechanism that allows you to expand its functionality and integrate with other systems. Given its integration capabilities, Airflow has extensive support for AWS, including Amazon EMR, Amazon S3, AWS Batch, Amazon RedShift, Amazon DynamoDB, AWS Lambda, Amazon Kinesis, and Amazon SageMaker. Outside of support for Amazon S3, most AWS integrations can be found in the Hooks, Secrets, Sensors, and Operators of Airflow codebase’s contrib section.
Getting Started
Source Code
Using this git clone
command, download a copy of this post’s GitHub repository to your local environment.
git clone --branch main --single-branch --depth 1 --no-tags \ https://github.com/garystafford/aws-airflow-demo.git
Preliminary Steps
This post assumes the reader has completed the demonstration in the previous post, Running PySpark Applications on Amazon EMR Methods for Interacting with PySpark on Amazon Elastic MapReduce. This post will re-use many of the last post’s AWS resources, including the EMR VPC, Subnets, Security Groups, AWS Glue Data Catalog, Amazon S3 buckets, EMR Roles, EC2 key pair, AWS Systems Manager Parameter Store parameters, PySpark applications, and Kaggle datasets.
Configuring Amazon MWAA
The easiest way to create a new MWAA Environment is through the AWS Management Console. I strongly suggest that you review the pricing for Amazon MWAA before continuing. The service can be quite costly to operate, even when idle, with the smallest Environment class potentially running into the hundreds of dollars per month.

Using the Console, create a new Amazon MWAA Environment. The Amazon MWAA interface will walk you through the creation process. Note the current ‘Airflow version’, 1.10.12
.

Amazon MWAA requires an Amazon S3 bucket to store Airflow assets. Create a new Amazon S3 bucket. According to the documentation, the bucket must start with the prefix airflow-
. You must also enable Bucket Versioning on the bucket. Specify a dags
folder within the bucket to store Airflow’s Directed Acyclic Graphs (DAG). You can leave the next two options blank since we have no additional Airflow plugins or additional Python packages to install.

With Amazon MWAA, your data is secure by default as workloads run within their own Amazon Virtual Private Cloud (Amazon VPC). As part of the MWAA Environment creation process, you are given the option to have AWS create an MWAA VPC CloudFormation stack.

For this demonstration, choose to have MWAA create a new VPC and associated networking resources.

The MWAA CloudFormation stack contains approximately 22 AWS resources, including a VPC, a pair of public and private subnets, route tables, an Internet Gateway, two NAT Gateways, and associated Elastic IPs (EIP). See the MWAA documentation for more details.


As part of the Amazon MWAA Networking configuration, you must decide if you want web access to Airflow to be public or private. The details of the network configuration can be found in the MWAA documentation. I am choosing public webserver access for this demonstration, but the recommended choice is private for greater security. With the public option, AWS still requires IAM authentication to sign in to the AWS Management Console in order to access the Airflow UI.
You must select an existing VPC Security Group or have MWAA create a new one. For this demonstration, choose to have MWAA create a Security Group for you.
Lastly, select an appropriately-sized Environment class for Airflow based on the scale of your needs. The mw1.small
class will be sufficient for this demonstration.

Finally, for Permissions, you must select an existing Airflow execution service role or create a new role. For this demonstration, create a new Airflow service role. We will later add additional permissions.

Airflow Execution Role
As part of this demonstration, we will be using Airflow to run Spark jobs on EMR (EMR Steps). To allow Airflow to interact with EMR, we must increase the new Airflow execution role’s default permissions. Additional permissions include allowing the new Airflow role to assume the EMR roles using iam:PassRole
. For this demonstration, we will include the two default EMR Service and JobFlow roles, EMR_DefaultRole
and EMR_EC2_DefaultRole
. We will also include the corresponding custom EMR roles created in the previous post, EMR_DemoRole
and EMR_EC2_DemoRole
. For this demonstration, the Airflow service role also requires three specific EMR permissions as shown below. Later in the post, Airflow will also read files from S3, which requires s3:GetObject
permission.
Create a new policy by importing the project’s JSON file, iam_policy/airflow_emr_policy.json
, and attach the new policy to the Airflow service role. Be sure to update the AWS Account ID in the file with your own Account ID.
The Airflow service role, created by MWAA, is shown below with the new policy attached.

Final Architecture
Below is the final high-level architecture for the post’s demonstration. The diagram shows the approximate route of a DAG Run request, in red. The diagram includes an optional S3 Gateway VPC endpoint, not detailed in the post, but recommended for additional security. According to AWS, a VPC endpoint enables you to privately connect your VPC to supported AWS services and VPC endpoint services powered by AWS PrivateLink without requiring an internet gateway. In this case a private connection between the MWAA VPC and Amazon S3. It is also possible to create an EMR Interface VPC Endpoint to securely route traffic directly to EMR from MWAA, instead of connecting over the Internet.

Amazon MWAA Environment
The new MWAA Environment will include a link to the Airflow UI.

Airflow UI
Using the supplied link, you should be able to access the Airflow UI using your web browser.

Our First DAG
The Amazon MWAA documentation includes an example DAG, which contains one of several sample programs, SparkPi, which comes with Spark. I have created a similar DAG that is included in the GitHub project, dags/emr_steps_demo.py
. The DAG will create a minimally-sized single-node EMR cluster with no Core or Task nodes. The DAG will then use that cluster to submit the calculate_pi
job to Spark. Once the job is complete, the DAG will terminate the EMR cluster.
Upload the DAG to the Airflow S3 bucket’s dags
directory. Substitute your Airflow S3 bucket name in the AWS CLI command below, then run it from the project’s root.
aws s3 cp dags/spark_pi_example.py \
s3://<your_airflow_bucket_name>/dags/
The DAG, spark_pi_example
, should automatically appear in the Airflow UI. Click on ‘Trigger DAG’ to create a new EMR cluster and start the Spark job.

The DAG has no optional configuration to input as JSON. Select ‘Trigger’ to submit the job, as shown below.

The DAG should complete all three tasks successfully, as shown in the DAG’s ‘Graph View’ tab below.

Switching to the EMR Console, you should see the single-node EMR cluster being created.

On the ‘Steps’ tab, you should see that the ‘calculate_pi’ Spark job has been submitted and is waiting for the cluster to be ready to be run.

Triggering DAGs Programmatically
The Amazon MWAA service is available using the AWS Management Console, as well as the Amazon MWAA API using the latest versions of the AWS SDK and AWS CLI. To automate the DAG Run, we could use the AWS CLI and invoke the Airflow CLI via an endpoint on the Apache Airflow Webserver. The Amazon MWAA documentation and Airflow’s CLI documentation explains how.
Below is an example of triggering the spark_pi_example
DAG programmatically using Airflow’s trigger_dag
CLI command. You will need to replace the WEB_SERVER_HOSTNAME
variable with your own Airflow Web Server’s hostname. The ENVIROMENT_NAME
variable assumes only one MWAA environment is returned by jq
.
Analytics Job with Airflow
Next, we will submit an actual analytics job to EMR. If you recall from the previous post, we had four different analytics PySpark applications, which performed analyses on the three Kaggle datasets. For the next DAG, we will run a Spark job that executes the bakery_sales_ssm.py
PySpark application. This job should already exist in the processed
data S3 bucket.
The DAG, dags/bakery_sales.py
, creates an EMR cluster identical to the EMR cluster created with the run_job_flow.py Python script in the previous post. All EMR configuration options available when using AWS Step Functions are available with Airflow’s airflow.contrib.operators
and airflow.contrib.sensors
packages for EMR.
Airflow leverages Jinja Templating and provides the pipeline author with a set of built-in parameters and macros. The Bakery Sales DAG contains eleven Jinja template variables. Seven variables will be configured in the Airflow UI by importing a JSON file into the ‘Admin’ ⇨ ‘Variables’ tab. These template variables are prefixed with var.value
in the DAG. The other three variables will be passed as a DAG Run configuration as a JSON blob, similar to the previous DAG example. These template variables are prefixed with dag_run.conf
.
Import Variables into Airflow UI
First, to import the required variables, change the values in the project’s airflow_variables/admin_variables_bakery.json
file. You will need to update the values for bootstrap_bucket
, emr_ec2_key_pair
, logs_bucket
, and work_bucket
. The three S3 buckets should all exist from the previous post.
Next, import the variables file from the ‘Admin’ ⇨ ‘Variables’ tab of the Airflow UI.

Upload the DAG, dags/bakery_sales.py
, to the Airflow S3 bucket, similar to the first DAG.
aws s3 cp dags/bakery_sales.py \
s3://<your_airflow_bucket_name>/dags/
The second DAG, bakery_sales
, should automatically appear in the Airflow UI. Click on ‘Trigger DAG’ to create a new EMR cluster and start the Spark job.

Input the three required parameters in the ‘Trigger DAG’ interface, used to pass the DAG Run configuration, and select ‘Trigger’. A sample of the JSON blob can be found in the project, airflow_variables/dag_run.conf_bakery.json
.
{ "airflow_email": "analytics_team@example.com", "email_on_failure": false, "email_on_retry": false }
This is just for demonstration purposes. To send and receive emails, you will need to configure Airflow.

Switching to the EMR Console, you should see the ‘Bakery Sales’ Spark job in the ‘Steps’ tab.

Multi-Step DAG
In our last example, we will use a single DAG to run four Spark jobs in parallel. The Spark job arguments (EmrAddStepsOperator
steps
parameter) will be loaded from an external JSON file residing in Amazon S3, instead of defined in the DAG, as in the previous two DAG examples. Additionally, the EMR cluster specifications (EmrCreateJobFlowOperator
job_flow_overrides
parameter) will also be loaded from an external JSON file. Using this method, we decouple the EMR provisioning and job details from the DAG. DataOps or DevOps Engineers might manage the EMR cluster specifications as code, while Data Analysts manage the Spark job arguments, separately. A third team might manage the DAG itself.
We still maintain the variables in the JSON files. The DAG will read the JSON file-based configuration into the tasks as JSON blobs, then replace the Jinja template variables (expressions) in the DAG with variable values defined in Airflow or input as parameters when the DAG is triggered.
Below we see a snippet of two of the four Spark submit-job
job definitions (steps
), which have been moved to a separate JSON file, emr_steps/emr_steps.json
.
Below are the EMR cluster specifications (job_flow_overrides)
, which have been moved to a separate JSON file, job_flow_overrides/job_flow_overrides.json
.
Decoupling the configurations reduces the DAG from well over 200 lines of code to less than 75 lines. Note lines 56 and 63 of the DAG below. Instead of referencing a local object variable, the parameters now reference the function, get_objects(key, bucket_name)
, which loads the JSON.
This time, we need to upload three files to S3, the DAG to the Airflow S3 bucket, and the two JSON files to the EMR Work S3 bucket. Change the bucket names to match your environment, then run the three AWS CLI commands shown below.
aws s3 cp emr_steps/emr_steps.json \
s3://emr-demo-work-123412341234
-us-east-1/emr_steps/
aws s3 cp job_flow_overrides/job_flow_overrides.json \
s3://emr-demo-work-123412341234
-us-east-1/job_flow_overrides/
aws s3 cp dags/multiple_steps.py \
s3://airflow-123412341234
-us-east-1/dags/
The second DAG, multiple_steps
, should automatically appear in the Airflow UI. Click on ‘Trigger DAG’ to create a new EMR cluster and start the Spark job. The three required input parameters in the ‘Trigger DAG’ interface are identical to the previous bakery_sales
DAG. A sample of that JSON blob can be found in the project at airflow_variables/dag_run.conf_bakery.json
.

Below we see that the EMR cluster has completed the four Spark jobs (EMR Steps) and has auto-terminated. Note that all four jobs were started at the exact same time. If you recall from the previous post, this is possible because we preset the ‘Concurrency’ level to 5.

Triggering DAGs Programmatically
AWS CLI
Similar to the previous example, below we can trigger the multiple_steps
DAG programmatically using Airflow’s trigger_dag
CLI command. Note the addition of the —-conf
named argument, which passes the configuration, containing three key/value pairs, to the trigger command as a JSON blob.
AWS SDK
Airflow DAGs can also be triggered using the AWS SDK. For example, with boto3
for Python, we could use a script, similar to the following to remotely trigger a DAG.
Cleaning Up
Once you are done with the MWAA Environment, be sure to delete it as soon as possible to save additional costs. Also, delete the MWAA-VPC
CloudFormation stack. These resources, like the two NAT Gateways, will also continue to generate additional costs.
aws mwaa delete-environment --name <your_mwaa_environment_name>
aws cloudformation delete-stack --stack-name MWAA-VPC
Conclusion
In this second post in the series, we explored using the newly released Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to run PySpark applications on Amazon Elastic MapReduce (Amazon EMR). In future posts, we will explore the use of Jupyter and Zeppelin notebooks for data science, scientific computing, and machine learning on EMR.
If you are interested in learning more about configuring Amazon MWAA and Airflow, see my recent post, Amazon Managed Workflows for Apache Airflow — Configuration: Understanding Amazon MWAA’s Configuration Options.
This blog represents my own viewpoints and not of my employer, Amazon Web Services. All product names, logos, and brands are the property of their respective owners.
Provision and Deploy a Consul Cluster on AWS, using Terraform, Docker, and Jenkins
Posted by Gary A. Stafford in AWS, Build Automation, Cloud, Continuous Delivery, DevOps, Software Development on March 20, 2017
Introduction
Modern DevOps tools, such as HashiCorp’s Packer and Terraform, make it easier to provision and manage complex cloud architecture. Utilizing a CI/CD server, such as Jenkins, to securely automate the use of these DevOps tools, ensures quick and consistent results.
In a recent post, Distributed Service Configuration with Consul, Spring Cloud, and Docker, we built a Consul cluster using Docker swarm mode, to host distributed configurations for a Spring Boot application. The cluster was built locally with VirtualBox. This architecture is fine for development and testing, but not for use in Production.
In this post, we will deploy a highly available three-node Consul cluster to AWS. We will use Terraform to provision a set of EC2 instances and accompanying infrastructure. The instances will be built from a hybrid AMIs containing the new Docker Community Edition (CE). In a recent post, Baking AWS AMI with new Docker CE Using Packer, we provisioned an Ubuntu AMI with Docker CE, using Packer. We will deploy Docker containers to each EC2 host, containing an instance of Consul server.
All source code can be found on GitHub.
Jenkins
I have chosen Jenkins to automate all of the post’s build, provisioning, and deployment tasks. However, none of the code is written specifically to Jenkins; you may run all of it from the command line.
For this post, I have built four projects in Jenkins, as follows:
- Provision Docker CE AMI: Builds Ubuntu AMI with Docker CE, using Packer
- Provision Consul Infra AWS: Provisions Consul infrastructure on AWS, using Terraform
- Deploy Consul Cluster AWS: Deploys Consul to AWS, using Docker
- Destroy Consul Infra AWS: Destroys Consul infrastructure on AWS, using Terraform
We will primarily be using the ‘Provision Consul Infra AWS’, ‘Deploy Consul Cluster AWS’, and ‘Destroy Consul Infra AWS’ Jenkins projects in this post. The fourth Jenkins project, ‘Provision Docker CE AMI’, automates the steps found in the recent post, Baking AWS AMI with new Docker CE Using Packer, to build the AMI used to provision the EC2 instances in this post.
Terraform
Using Terraform, we will provision EC2 instances in three different Availability Zones within the US East 1 (N. Virginia) Region. Using Terraform’s Amazon Web Services (AWS) provider, we will create the following AWS resources:
- (1) Virtual Private Cloud (VPC)
- (1) Internet Gateway
- (1) Key Pair
- (3) Elastic Cloud Compute (EC2) Instances
- (2) Security Groups
- (3) Subnets
- (1) Route
- (3) Route Tables
- (3) Route Table Associations
The final AWS architecture should resemble the following:
Production Ready AWS
Although we have provisioned a fairly complete VPC for this post, it is far from being ready for Production. I have created two security groups, limiting the ingress and egress to the cluster. However, to further productionize the environment would require additional security hardening. At a minimum, you should consider adding public/private subnets, NAT gateways, network access control list rules (network ACLs), and the use of HTTPS for secure communications.
In production, applications would communicate with Consul through local Consul clients. Consul clients would take part in the LAN gossip pool from different subnets, Availability Zones, Regions, or VPCs using VPC peering. Communications would be tightly controlled by IAM, VPC, subnet, IP address, and port.
Also, you would not have direct access to the Consul UI through a publicly exposed IP or DNS address. Access to the UI would be removed altogether or locked down to specific IP addresses, and accessed restricted to secure communication channels.
Consul
We will achieve high availability (HA) by clustering three Consul server nodes across the three Elastic Cloud Compute (EC2) instances. In this minimally sized, three-node cluster of Consul servers, we are protected from the loss of one Consul server node, one EC2 instance, or one Availability Zone(AZ). The cluster will still maintain a quorum of two nodes. An additional level of HA that Consul supports, multiple datacenters (multiple AWS Regions), is not demonstrated in this post.
Docker
Having Docker CE already installed on each EC2 instance allows us to execute remote Docker commands over SSH from Jenkins. These commands will deploy and configure a Consul server node, within a Docker container, on each EC2 instance. The containers are built from HashiCorp’s latest Consul Docker image pulled from Docker Hub.
Getting Started
Preliminary Steps
If you have built infrastructure on AWS with Terraform, these steps should be familiar to you:
- First, you will need an AMI with Docker. I suggest reading Baking AWS AMI with new Docker CE Using Packer.
- You will need an AWS IAM User with the proper access to create the required infrastructure. For this post, I created a separate Jenkins IAM User with PowerUser level access.
- You will need to have an RSA public-private key pair, which can be used to SSH into the EC2 instances and install Consul.
- Ensure you have your AWS credentials set. I usually source mine from a
.env
file, as environment variables. Jenkins can securely manage credentials, using secret text or files. - Fork and/or clone the Consul cluster project from GitHub.
- Change the
aws_key_name
andpublic_key_path
variable values to your own RSA key, in thevariables.tf
file - Change the
aws_amis_base
variable values to your own AMI ID (see step 1) - If you are do not want to use the US East 1 Region and its AZs, modify the
variables.tf
,network.tf
, andinstances.tf
files. - Disable Terraform’s remote state or modify the resource to match your remote state configuration, in the
main.tf file
. I am using an Amazon S3 bucket to store my Terraform remote state.
Building an AMI with Docker
If you have not built an Amazon Machine Image (AMI) for use in this post already, you can do so using the scripts provided in the previous post’s GitHub repository. To automate the AMI build task, I built the ‘Provision Docker CE AMI’ Jenkins project. Identical to the other three Jenkins projects in this post, this project has three main tasks, which include: 1) SCM: clone the Packer AMI GitHub project, 2) Bindings: set up the AWS credentials, and 3) Build: run Packer.
The SCM and Bindings tasks are identical to the other projects (see below for details), except for the use of a different GitHub repository. The project’s Build step, which runs the packer_build_ami.sh
script looks as follows:
The resulting AMI ID will need to be manually placed in Terraform’s variables.tf
file, before provisioning the AWS infrastructure with Terraform. The new AMI ID will be displayed in Jenkin’s build output.
Provisioning with Terraform
Based on the modifications you made in the Preliminary Steps, execute the terraform validate
command to confirm your changes. Then, run the terraform plan
command to review the plan. Assuming are were no errors, finally, run the terraform apply
command to provision the AWS infrastructure components.
In Jenkins, I have created the ‘Provision Consul Infra AWS’ project. This project has three tasks, which include: 1) SCM: clone the GitHub project, 2) Bindings: set up the AWS credentials, and 3) Build: run Terraform. Those tasks look as follows:
You will obviously need to use your modified GitHub project, incorporating the configuration changes detailed above, as the SCM source for Jenkins.
You will also need to configure your AWS credentials.
The provision_infra.sh
script provisions the AWS infrastructure using Terraform. The script also updates Terraform’s remote state. Remember to update the remote state configuration in the script to match your personal settings.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
cd tf_env_aws/ | |
terraform remote config \ | |
-backend=s3 \ | |
-backend-config="bucket=your_bucket" \ | |
-backend-config="key=terraform_consul.tfstate" \ | |
-backend-config="region=your_region" | |
terraform plan | |
terraform apply |
The Jenkins build output should look similar to the following:
Although the build only takes about 90 seconds to complete, the EC2 instances could take a few extra minutes to complete their Status Checks and be completely ready. The final results in the AWS EC2 Management Console should look as follows:
Note each EC2 instance is running in a different US East 1 Availability Zone.
Installing Consul
Once the AWS infrastructure is running and the EC2 instances have completed their Status Checks successfully, we are ready to deploy Consul. In Jenkins, I have created the ‘Deploy Consul Cluster AWS’ project. This project has three tasks, which include: 1) SCM: clone the GitHub project, 2) Bindings: set up the AWS credentials, and 3) Build: run an SSH remote Docker command on each EC2 instance to deploy Consul. The SCM and Bindings tasks are identical to the project above. The project’s Build step looks as follows:
First, the delete_containers.sh
script deletes any previous instances of Consul containers. This is helpful if you need to re-deploy Consul. Next, the deploy_consul.sh
script executes a series of SSH remote Docker commands to install and configure Consul on each EC2 instance.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Advertised Consul IP | |
export ec2_server1_private_ip=$(aws ec2 describe-instances \ | |
–filters Name='tag:Name,Values=tf-instance-consul-server-1' \ | |
–output text –query 'Reservations[*].Instances[*].PrivateIpAddress') | |
echo "consul-server-1 private ip: ${ec2_server1_private_ip}" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Deploy Consul Server 1 | |
ec2_public_ip=$(aws ec2 describe-instances \ | |
–filters Name='tag:Name,Values=tf-instance-consul-server-1' \ | |
–output text –query 'Reservations[*].Instances[*].PublicIpAddress') | |
consul_server="consul-server-1" | |
ssh -oStrictHostKeyChecking=no -T \ | |
-i ~/.ssh/consul_aws_rsa \ | |
ubuntu@${ec2_public_ip} << EOSSH | |
docker run -d \ | |
–net=host \ | |
–hostname ${consul_server} \ | |
–name ${consul_server} \ | |
–env "SERVICE_IGNORE=true" \ | |
–env "CONSUL_CLIENT_INTERFACE=eth0" \ | |
–env "CONSUL_BIND_INTERFACE=eth0" \ | |
–volume /home/ubuntu/consul/data:/consul/data \ | |
–publish 8500:8500 \ | |
consul:latest \ | |
consul agent -server -ui -client=0.0.0.0 \ | |
-bootstrap-expect=3 \ | |
-advertise='{{ GetInterfaceIP "eth0" }}' \ | |
-data-dir="/consul/data" | |
sleep 5 | |
docker logs consul-server-1 | |
docker exec -i consul-server-1 consul members | |
EOSSH |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Deploy Consul Server 2 | |
ec2_public_ip=$(aws ec2 describe-instances \ | |
–filters Name='tag:Name,Values=tf-instance-consul-server-2' \ | |
–output text –query 'Reservations[*].Instances[*].PublicIpAddress') | |
consul_server="consul-server-2" | |
ssh -oStrictHostKeyChecking=no -T \ | |
-i ~/.ssh/consul_aws_rsa \ | |
ubuntu@${ec2_public_ip} << EOSSH | |
docker run -d \ | |
–net=host \ | |
–hostname ${consul_server} \ | |
–name ${consul_server} \ | |
–env "SERVICE_IGNORE=true" \ | |
–env "CONSUL_CLIENT_INTERFACE=eth0" \ | |
–env "CONSUL_BIND_INTERFACE=eth0" \ | |
–volume /home/ubuntu/consul/data:/consul/data \ | |
–publish 8500:8500 \ | |
consul:latest \ | |
consul agent -server -ui -client=0.0.0.0 \ | |
-advertise='{{ GetInterfaceIP "eth0" }}' \ | |
-retry-join="${ec2_server1_private_ip}" \ | |
-data-dir="/consul/data" | |
sleep 5 | |
docker logs consul-server-2 | |
docker exec -i consul-server-2 consul members | |
EOSSH |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Deploy Consul Server 3 | |
ec2_public_ip=$(aws ec2 describe-instances \ | |
–filters Name='tag:Name,Values=tf-instance-consul-server-3' \ | |
–output text –query 'Reservations[*].Instances[*].PublicIpAddress') | |
consul_server="consul-server-3" | |
ssh -oStrictHostKeyChecking=no -T \ | |
-i ~/.ssh/consul_aws_rsa \ | |
ubuntu@${ec2_public_ip} << EOSSH | |
docker run -d \ | |
–net=host \ | |
–hostname ${consul_server} \ | |
–name ${consul_server} \ | |
–env "SERVICE_IGNORE=true" \ | |
–env "CONSUL_CLIENT_INTERFACE=eth0" \ | |
–env "CONSUL_BIND_INTERFACE=eth0" \ | |
–volume /home/ubuntu/consul/data:/consul/data \ | |
–publish 8500:8500 \ | |
consul:latest \ | |
consul agent -server -ui -client=0.0.0.0 \ | |
-advertise='{{ GetInterfaceIP "eth0" }}' \ | |
-retry-join="${ec2_server1_private_ip}" \ | |
-data-dir="/consul/data" | |
sleep 5 | |
docker logs consul-server-3 | |
docker exec -i consul-server-3 consul members | |
EOSSH |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Output Consul Web UI URL | |
ec2_public_ip=$(aws ec2 describe-instances \ | |
–filters Name='tag:Name,Values=tf-instance-consul-server-1' \ | |
–output text –query 'Reservations[*].Instances[*].PublicIpAddress') | |
echo " " | |
echo "*** Consul UI: http://${ec2_public_ip}:8500/ui/ ***" |
The entire Jenkins build process only takes about 30 seconds. Afterward, the output from a successful Jenkins build should show that all three Consul server instances are running, have formed a quorum, and have elected a Leader.
Persisting State
The Consul Docker image exposes VOLUME /consul/data
, which is a path were Consul will place its persisted state. Using Terraform’s remote-exec
provisioner, we create a directory on each EC2 instance, at /home/ubuntu/consul/config
. The docker run
command bind-mounts the container’s /consul/data
path to the EC2 host’s /home/ubuntu/consul/config
directory.
According to Consul, the Consul server container instance will ‘store the client information plus snapshots and data related to the consensus algorithm and other state, like Consul’s key/value store and catalog’ in the /consul/data
directory. That container directory is now bind-mounted to the EC2 host, as demonstrated below.
Accessing Consul
Following a successful deployment, you should be able to use the public URL, displayed in the build output of the ‘Deploy Consul Cluster AWS’ project, to access the Consul UI. Clicking on the Nodes tab in the UI, you should see all three Consul server instances, one per EC2 instance, running and healthy.
Destroying Infrastructure
When you are finished with the post, you may want to remove the running infrastructure, so you don’t continue to get billed by Amazon. The ‘Destroy Consul Infra AWS’ project destroys all the AWS infrastructure, provisioned as part of this post, in about 60 seconds. The project’s SCM and Bindings tasks are identical to the both previous projects. The Build step calls the destroy_infra.sh
script, which is included in the GitHub project. The script executes the terraform destroy -force
command. It will delete all running infrastructure components associated with the post and update Terraform’s remote state.
Conclusion
This post has demonstrated how modern DevOps tooling, such as HashiCorp’s Packer and Terraform, make it easy to build, provision and manage complex cloud architecture. Using a CI/CD server, such as Jenkins, to securely automate the use of these tools, ensures quick and consistent results.
All opinions in this post are my own and not necessarily the views of my current employer or their clients.
Baking AWS AMI with new Docker CE Using Packer
Posted by Gary A. Stafford in AWS, Build Automation, Cloud, DevOps on March 6, 2017
Introduction
On March 2 (less than a week ago as of this post), Docker announced the release of Docker Enterprise Edition (EE), a new version of the Docker platform optimized for business-critical deployments. As part of the release, Docker also renamed the free Docker products to Docker Community Edition (CE). Both products are adopting a new time-based versioning scheme for both Docker EE and CE. The initial release of Docker CE and EE, the 17.03 release, is the first to use the new scheme.
Along with the release, Docker delivered excellent documentation on installing, configuring, and troubleshooting the new Docker EE and CE. In this post, I will demonstrate how to partially bake an existing Amazon Machine Image (Amazon AMI) with the new Docker CE, preparing it as a base for the creation of Amazon Elastic Compute Cloud (Amazon EC2) compute instances.
Adding Docker and similar tooling to an AMI is referred to as partially baking an AMI, often referred to as a hybrid AMI. According to AWS, ‘hybrid AMIs provide a subset of the software needed to produce a fully functional instance, falling in between the fully baked and JeOS (just enough operating system) options on the AMI design spectrum.’
Installing Docker CE on an AWS AMI should not be confused with Docker’s also recently announced Docker Community Edition (CE) for AWS. Docker for AWS offers multiple CloudFormation templates for Docker EE and CE. According to Docker, Docker for AWS ‘provides a Docker-native solution that avoids operational complexity and adding unneeded additional APIs to the Docker stack.’
Base AMI
Docker provides detailed directions for installing Docker CE and EE onto several major Linux distributions. For this post, we will choose a widely used Linux distro, Ubuntu. According to Docker, currently Docker CE and EE can be installed on three popular Ubuntu releases:
- Yakkety 16.10
- Xenial 16.04 (LTS)
- Trusty 14.04 (LTS)
To provision a small EC2 instance in Amazon’s US East (N. Virginia) Region, I will choose Ubuntu 16.04.2 LTS Xenial Xerus . According to Canonical’s Amazon EC2 AMI Locator website, a Xenial 16.04 LTS AMI is available, ami-09b3691f
, for US East 1, as a t2.micro EC2 instance type.
Packer
HashiCorp Packer will be used to partially bake the base Ubuntu Xenial 16.04 AMI with Docker CE 17.03. HashiCorp describes Packer as ‘a tool for creating machine and container images for multiple platforms from a single source configuration.’ The JSON-format Packer file is as follows:
{ "variables": { "aws_access_key": "{{env `AWS_ACCESS_KEY_ID`}}", "aws_secret_key": "{{env `AWS_SECRET_ACCESS_KEY`}}", "us_east_1_ami": "ami-09b3691f", "name": "aws-docker-ce-base", "us_east_1_name": "ubuntu-xenial-docker-ce-base", "ssh_username": "ubuntu" }, "builders": [ { "name": "{{user `us_east_1_name`}}", "type": "amazon-ebs", "access_key": "{{user `aws_access_key`}}", "secret_key": "{{user `aws_secret_key`}}", "region": "us-east-1", "vpc_id": "", "subnet_id": "", "source_ami": "{{user `us_east_1_ami`}}", "instance_type": "t2.micro", "ssh_username": "{{user `ssh_username`}}", "ssh_timeout": "10m", "ami_name": "{{user `us_east_1_name`}} {{timestamp}}", "ami_description": "{{user `us_east_1_name`}} AMI", "run_tags": { "ami-create": "{{user `us_east_1_name`}}" }, "tags": { "ami": "{{user `us_east_1_name`}}" }, "ssh_private_ip": false, "associate_public_ip_address": true } ], "provisioners": [ { "type": "file", "source": "bootstrap_docker_ce.sh", "destination": "/tmp/bootstrap_docker_ce.sh" }, { "type": "file", "source": "cleanup.sh", "destination": "/tmp/cleanup.sh" }, { "type": "shell", "execute_command": "echo 'packer' | sudo -S sh -c '{{ .Vars }} {{ .Path }}'", "inline": [ "whoami", "cd /tmp", "chmod +x bootstrap_docker_ce.sh", "chmod +x cleanup.sh", "ls -alh /tmp", "./bootstrap_docker_ce.sh", "sleep 10", "./cleanup.sh" ] } ] }
The Packer file uses Packer’s amazon-ebs builder type. This builder is used to create Amazon AMIs backed by Amazon Elastic Block Store (EBS) volumes, for use in EC2.
Bootstrap Script
To install Docker CE on the AMI, the Packer file executes a bootstrap shell script. The bootstrap script and subsequent cleanup script are executed using Packer’s remote shell provisioner. The bootstrap is like the following:
#!/bin/sh sudo apt-get remove docker docker-engine sudo apt-get install \ apt-transport-https \ ca-certificates \ curl \ software-properties-common curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - sudo apt-key fingerprint 0EBFCD88 sudo add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable" sudo apt-get update sudo apt-get -y upgrade sudo apt-get install -y docker-ce sudo groupadd docker sudo usermod -aG docker ubuntu sudo systemctl enable docker
This script closely follows directions provided by Docker, for installing Docker CE on Ubuntu. After removing any previous copies of Docker, the script installs Docker CE. To ensure sudo
is not required to execute Docker commands on any EC2 instance provisioned from resulting AMI, the script adds the ubuntu
user to the docker
group.
The bootstrap script also uses systemd to start the Docker daemon. Starting with Ubuntu 15.04, Systemd System and Service Manager is used by default instead of the previous init system, Upstart. Systemd ensures Docker will start on boot.
Cleaning Up
It is best good practice to clean up your activities after baking an AMI. I have included a basic clean up script. The cleanup script is as follows:
#!/bin/sh set -e echo 'Cleaning up after bootstrapping...' sudo apt-get -y autoremove sudo apt-get -y clean sudo rm -rf /tmp/* cat /dev/null > ~/.bash_history history -c exit
Partially Baking
Before running Packer to build the Docker CE AMI, I set both my AWS access key and AWS secret access key. The Packer file expects the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables.
Running the packer build ubuntu_docker_ce_ami.json
command builds the AMI. The abridged output should look similar to the following:
$ packer build docker_ami.json ubuntu-xenial-docker-ce-base output will be in this color. ==> ubuntu-xenial-docker-ce-base: Prevalidating AMI Name... ubuntu-xenial-docker-ce-base: Found Image ID: ami-09b3691f ==> ubuntu-xenial-docker-ce-base: Creating temporary keypair: packer_58bc7a49-9e66-7f76-ce8e-391a67d94987 ==> ubuntu-xenial-docker-ce-base: Creating temporary security group for this instance... ==> ubuntu-xenial-docker-ce-base: Authorizing access to port 22 the temporary security group... ==> ubuntu-xenial-docker-ce-base: Launching a source AWS instance... ubuntu-xenial-docker-ce-base: Instance ID: i-0ca883ecba0c28baf ==> ubuntu-xenial-docker-ce-base: Waiting for instance (i-0ca883ecba0c28baf) to become ready... ==> ubuntu-xenial-docker-ce-base: Adding tags to source instance ==> ubuntu-xenial-docker-ce-base: Waiting for SSH to become available... ==> ubuntu-xenial-docker-ce-base: Connected to SSH! ==> ubuntu-xenial-docker-ce-base: Uploading bootstrap_docker_ce.sh => /tmp/bootstrap_docker_ce.sh ==> ubuntu-xenial-docker-ce-base: Uploading cleanup.sh => /tmp/cleanup.sh ==> ubuntu-xenial-docker-ce-base: Provisioning with shell script: /var/folders/kf/637b0qns7xb0wh9p8c4q0r_40000gn/T/packer-shell189662158 ... ubuntu-xenial-docker-ce-base: Reading package lists... ubuntu-xenial-docker-ce-base: Building dependency tree... ubuntu-xenial-docker-ce-base: Reading state information... ubuntu-xenial-docker-ce-base: E: Unable to locate package docker-engine ubuntu-xenial-docker-ce-base: Reading package lists... ubuntu-xenial-docker-ce-base: Building dependency tree... ubuntu-xenial-docker-ce-base: Reading state information... ubuntu-xenial-docker-ce-base: ca-certificates is already the newest version (20160104ubuntu1). ubuntu-xenial-docker-ce-base: apt-transport-https is already the newest version (1.2.19). ubuntu-xenial-docker-ce-base: curl is already the newest version (7.47.0-1ubuntu2.2). ubuntu-xenial-docker-ce-base: software-properties-common is already the newest version (0.96.20.5). ubuntu-xenial-docker-ce-base: 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded. ubuntu-xenial-docker-ce-base: OK ubuntu-xenial-docker-ce-base: pub 4096R/0EBFCD88 2017-02-22 ubuntu-xenial-docker-ce-base: Key fingerprint = 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88 ubuntu-xenial-docker-ce-base: uid Docker Release (CE deb) ubuntu-xenial-docker-ce-base: sub 4096R/F273FCD8 2017-02-22 ubuntu-xenial-docker-ce-base: ubuntu-xenial-docker-ce-base: Hit:1 http://us-east-1.ec2.archive.ubuntu.com/ubuntu xenial InRelease ubuntu-xenial-docker-ce-base: Get:2 http://us-east-1.ec2.archive.ubuntu.com/ubuntu xenial-updates InRelease [102 kB] ... ubuntu-xenial-docker-ce-base: Get:27 http://security.ubuntu.com/ubuntu xenial-security/universe amd64 Packages [89.5 kB] ubuntu-xenial-docker-ce-base: Fetched 10.6 MB in 2s (4,065 kB/s) ubuntu-xenial-docker-ce-base: Reading package lists... ubuntu-xenial-docker-ce-base: Reading package lists... ubuntu-xenial-docker-ce-base: Building dependency tree... ubuntu-xenial-docker-ce-base: Reading state information... ubuntu-xenial-docker-ce-base: Calculating upgrade... ubuntu-xenial-docker-ce-base: 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded. ubuntu-xenial-docker-ce-base: Reading package lists... ubuntu-xenial-docker-ce-base: Building dependency tree... ubuntu-xenial-docker-ce-base: Reading state information... ubuntu-xenial-docker-ce-base: The following additional packages will be installed: ubuntu-xenial-docker-ce-base: aufs-tools cgroupfs-mount libltdl7 ubuntu-xenial-docker-ce-base: Suggested packages: ubuntu-xenial-docker-ce-base: mountall ubuntu-xenial-docker-ce-base: The following NEW packages will be installed: ubuntu-xenial-docker-ce-base: aufs-tools cgroupfs-mount docker-ce libltdl7 ubuntu-xenial-docker-ce-base: 0 upgraded, 4 newly installed, 0 to remove and 0 not upgraded. ubuntu-xenial-docker-ce-base: Need to get 19.4 MB of archives. ubuntu-xenial-docker-ce-base: After this operation, 89.4 MB of additional disk space will be used. ubuntu-xenial-docker-ce-base: Get:1 http://us-east-1.ec2.archive.ubuntu.com/ubuntu xenial/universe amd64 aufs-tools amd64 1:3.2+20130722-1.1ubuntu1 [92.9 kB] ... ubuntu-xenial-docker-ce-base: Get:4 https://download.docker.com/linux/ubuntu xenial/stable amd64 docker-ce amd64 17.03.0~ce-0~ubuntu-xenial [19.3 MB] ubuntu-xenial-docker-ce-base: debconf: unable to initialize frontend: Dialog ubuntu-xenial-docker-ce-base: debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.) ubuntu-xenial-docker-ce-base: debconf: falling back to frontend: Readline ubuntu-xenial-docker-ce-base: debconf: unable to initialize frontend: Readline ubuntu-xenial-docker-ce-base: debconf: (This frontend requires a controlling tty.) ubuntu-xenial-docker-ce-base: debconf: falling back to frontend: Teletype ubuntu-xenial-docker-ce-base: dpkg-preconfigure: unable to re-open stdin: ubuntu-xenial-docker-ce-base: Fetched 19.4 MB in 1s (17.8 MB/s) ubuntu-xenial-docker-ce-base: Selecting previously unselected package aufs-tools. ubuntu-xenial-docker-ce-base: (Reading database ... 53844 files and directories currently installed.) ubuntu-xenial-docker-ce-base: Preparing to unpack .../aufs-tools_1%3a3.2+20130722-1.1ubuntu1_amd64.deb ... ubuntu-xenial-docker-ce-base: Unpacking aufs-tools (1:3.2+20130722-1.1ubuntu1) ... ... ubuntu-xenial-docker-ce-base: Setting up docker-ce (17.03.0~ce-0~ubuntu-xenial) ... ubuntu-xenial-docker-ce-base: Processing triggers for libc-bin (2.23-0ubuntu5) ... ubuntu-xenial-docker-ce-base: Processing triggers for systemd (229-4ubuntu16) ... ubuntu-xenial-docker-ce-base: Processing triggers for ureadahead (0.100.0-19) ... ubuntu-xenial-docker-ce-base: groupadd: group 'docker' already exists ubuntu-xenial-docker-ce-base: Synchronizing state of docker.service with SysV init with /lib/systemd/systemd-sysv-install... ubuntu-xenial-docker-ce-base: Executing /lib/systemd/systemd-sysv-install enable docker ubuntu-xenial-docker-ce-base: Cleanup... ubuntu-xenial-docker-ce-base: Reading package lists... ubuntu-xenial-docker-ce-base: Building dependency tree... ubuntu-xenial-docker-ce-base: Reading state information... ubuntu-xenial-docker-ce-base: 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded. ==> ubuntu-xenial-docker-ce-base: Stopping the source instance... ==> ubuntu-xenial-docker-ce-base: Waiting for the instance to stop... ==> ubuntu-xenial-docker-ce-base: Creating the AMI: ubuntu-xenial-docker-ce-base 1288227081 ubuntu-xenial-docker-ce-base: AMI: ami-e9ca6eff ==> ubuntu-xenial-docker-ce-base: Waiting for AMI to become ready... ==> ubuntu-xenial-docker-ce-base: Modifying attributes on AMI (ami-e9ca6eff)... ubuntu-xenial-docker-ce-base: Modifying: description ==> ubuntu-xenial-docker-ce-base: Modifying attributes on snapshot (snap-058a26c0250ee3217)... ==> ubuntu-xenial-docker-ce-base: Adding tags to AMI (ami-e9ca6eff)... ==> ubuntu-xenial-docker-ce-base: Tagging snapshot: snap-043a16c0154ee3217 ==> ubuntu-xenial-docker-ce-base: Creating AMI tags ==> ubuntu-xenial-docker-ce-base: Creating snapshot tags ==> ubuntu-xenial-docker-ce-base: Terminating the source AWS instance... ==> ubuntu-xenial-docker-ce-base: Cleaning up any extra volumes... ==> ubuntu-xenial-docker-ce-base: No volumes to clean up, skipping ==> ubuntu-xenial-docker-ce-base: Deleting temporary security group... ==> ubuntu-xenial-docker-ce-base: Deleting temporary keypair... Build 'ubuntu-xenial-docker-ce-base' finished. ==> Builds finished. The artifacts of successful builds are: --> ubuntu-xenial-docker-ce-base: AMIs were created: us-east-1: ami-e9ca6eff
Results
The result is an Ubuntu 16.04 AMI in US East 1 with Docker CE 17.03 installed. To confirm the new AMI is now available, I will use the AWS CLI to examine the resulting AMI:
aws ec2 describe-images \ --filters Name=tag-key,Values=ami Name=tag-value,Values=ubuntu-xenial-docker-ce-base \ --query 'Images[*].{ID:ImageId}'
Resulting output:
{ "Images": [ { "VirtualizationType": "hvm", "Name": "ubuntu-xenial-docker-ce-base 1488747081", "Tags": [ { "Value": "ubuntu-xenial-docker-ce-base", "Key": "ami" } ], "Hypervisor": "xen", "SriovNetSupport": "simple", "ImageId": "ami-e9ca6eff", "State": "available", "BlockDeviceMappings": [ { "DeviceName": "/dev/sda1", "Ebs": { "DeleteOnTermination": true, "SnapshotId": "snap-048a16c0250ee3227", "VolumeSize": 8, "VolumeType": "gp2", "Encrypted": false } }, { "DeviceName": "/dev/sdb", "VirtualName": "ephemeral0" }, { "DeviceName": "/dev/sdc", "VirtualName": "ephemeral1" } ], "Architecture": "x86_64", "ImageLocation": "931066906971/ubuntu-xenial-docker-ce-base 1488747081", "RootDeviceType": "ebs", "OwnerId": "931066906971", "RootDeviceName": "/dev/sda1", "CreationDate": "2017-03-05T20:53:41.000Z", "Public": false, "ImageType": "machine", "Description": "ubuntu-xenial-docker-ce-base AMI" } ] }
Finally, here is the new AMI as seen in the AWS EC2 Management Console:
Terraform
To confirm Docker CE is installed and running, I can provision a new EC2 instance, using HashiCorp Terraform. This post is too short to detail all the Terraform code required to stand up a complete environment. I’ve included the complete code in the GitHub repo for this post. Not, the Terraform code is only used to testing. No security, including the use of a properly configured security groups, public/private subnets, and a NAT server, is configured.
Below is a greatly abridged version of the Terraform code I used to provision a new EC2 instance, using Terraform’s aws_instance
resource. The resulting EC2 instance should have Docker CE available.
# test-docker-ce instance resource "aws_instance" "test-docker-ce" { connection { user = "ubuntu" private_key = "${file("~/.ssh/test-docker-ce")}" timeout = "${connection_timeout}" } ami = "ami-e9ca6eff" instance_type = "t2.nano" availability_zone = "us-east-1a" count = "1" key_name = "${aws_key_pair.auth.id}" vpc_security_group_ids = ["${aws_security_group.test-docker-ce.id}"] subnet_id = "${aws_subnet.test-docker-ce.id}" tags { Owner = "Gary A. Stafford" Terraform = true Environment = "test-docker-ce" Name = "tf-instance-test-docker-ce" } }
By using the AWS CLI, once again, we can confirm the new EC2 instance was built using the correct AMI:
aws ec2 describe-instances \ --filters Name='tag:Name,Values=tf-instance-test-docker-ce' \ --output text --query 'Reservations[*].Instances[*].ImageId'
Resulting output looks good:
ami-e9ca6eff
Finally, here is the new EC2 as seen in the AWS EC2 Management Console:
SSHing into the new EC2 instance, I should observe that the operating system is Ubuntu 16.04.2 LTS and that Docker version 17.03.0-ce is installed and running:
Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-64-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage Get cloud support with Ubuntu Advantage Cloud Guest: http://www.ubuntu.com/business/services/cloud 0 packages can be updated. 0 updates are security updates. Last login: Sun Mar 5 22:06:01 2017 from ubuntu@ip-:~$ docker --version Docker version 17.03.0-ce, build 3a232c8
Conclusion
Docker EE and CE represent a significant step forward in expanding Docker’s enterprise-grade toolkit. Replacing or installing Docker EE or CE on your AWS AMIs is easy, using Docker’s guide along with HashiCorp Packer.
All source code for this post can be found on GitHub.
All opinions in this post are my own and not necessarily the views of my current employer or their clients.
Scaffolding Modern Web Applications
Posted by Gary A. Stafford in Client-Side Development, Java Development, Mobile HTML Development, Software Development on November 29, 2015
Scaffold Three Full-Stack Modern Web Application Examples using Yeoman with npm, Bower, and Grunt
Introduction
The capabilities of modern web applications have quickly matched and surpassed those of most traditional desktop applications. However, with the increase in capabilities, comes an increase in architectural complexity of web applications. To help deal with the increase in complexity, developers have benefitted from a plethora of popular support libraries, frameworks, API’s, and similar tooling. Examples of these include AngularJS, React.js, Play!, Node.js, Express, npm, Yeoman, Bower, Grunt, and Gulp.
With so many choices, selecting an optimal toolset to construct a modern web application can be overwhelming. In this post, we will examine three distinct modern web application architectures, predominantly JavaScript-based. The post will discuss how to select and install the required tools, and how to use those tools to scaffold the applications. By the end of the post, we will have three working web applications, ready for further development.
Generators
The post’s examples use Yeoman generators. What is a generator? According to Wikipedia, Yeoman’s generator concept was inspired by Ruby on Rails. Using a generator’s blueprint, Yeoman scaffolds an application’s project directory and file structure, and installs required vendor libraries and dependencies. Yeoman generators usually run interactively, guiding the developer through a series of configuration questions. The configuration choices determine the project’s physical structure and components installed by Yeoman.
Generators are inherently opinionated. They dictate a particular application architecture they feel represents current best practices. However, good generators also allow developers to select from a range of architectural choices to meet the requirements of a developer’s environment. For example, a generator might allow Grunt or Gulp for task automation, or allow either Less or Sass for styling the UI. Similar to npm, RubyGems, Bower, Docker Hub, and Puppet Forge, Yeoman provides a searchable public repository. This allows developers to choose from a variety of generators to meet their specific needs.
Preparing the Development Environment
The examples in this post were built on Mac OS X. However, all the tools discussed in the post are available on the three major platforms, Linux, Mac, and Windows. Installation and configuration will vary.
An IDE is not required to scaffold the post’s application examples. However, for further development of the applications, I strongly recommend JetBrain’s WebStorm. According to Slant, WebStorm is a popular, highly rated IDE for building modern web applications. WebStorm is available on all three major platforms. A paid license is required, but well worth the reasonable investment based on the IDE’s rich feature set. WebStorm is integrated with many popular JavaScript frameworks. Additionally, there are hundreds of plug-ins available to extend WebStorm’s functionality.
Each of the post’s examples varies, architecturally. However, each also shares several common components, which we will install. They include:
npm
We will use npm (aka Node Package Manager), a leading server-side package manager, to manage the application’s server-side JavaScript dependencies.
Node.js
We will use Node.js, a JavaScript runtime, to power the JavaScript-based web application examples.
Bower
Similar to npm, Bower, is a popular client-side package manager. We will use Bower to manage the application’s client-side JavaScript dependencies.
Yeoman
We will use Yeoman, a leading web application scaffolding tool, to quickly build the frameworks for the example applications based on best practices and tooling.
Grunt
We will use Grunt, a leading JavaScript task runner, to automate common tasks such as minification, compilation, unit-testing, linting, and packaging of applications for deployment. At least two of the three examples also offer Gulp as an alternative.
Express
We will use Express, a Node.js web application framework that provides a robust set of features for web applications development, to support our example applications.
MongoDB
We will use MongoDB, a leading open-source NoSQL document database, for all three examples. For two of the examples, you can easily substitute alternate databases, such as MySQL, when configuring the application with Yeoman. The choice of database is of secondary importance in this post.
First install Node.js, which comes packaged with npm. Then, use npm to install Bower, Yeoman, and Grunt. Make sure you run the command to fix the permissions for npm. If permissions are set correctly, you should not have to use sudo
with your npm commands.
The global mode option (-g
) installs packages globally. Packages are usually installed globally, only if they are used as a command line tool, such as with Bower, Yeoman, and Grunt.
The easiest way to install Node.js and npm on OS X, is with Homebrew:
# install homebrew and confirm | |
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" | |
brew update | |
brew doctor | |
export PATH="/usr/local/bin:$PATH" | |
brew --version | |
# install node and npm with brew and confirm | |
brew install node | |
node --version; npm --version |
Alternatively, install Node and npm by downloading the package installer for Mac OS X (x64) from nodejs.org. If you are not a currently a Homebrew user, I suggest this route. At the time of this post, you have the choice of Node.js version v4.2.3 LTS or v5.1.1 Stable.
Fix the potential permission problem with npm, so sudo
is not required:
sudo chown -R `whoami` $(npm config get prefix) |
Then, install Yeoman, Bower, and Grunt, globally:
npm install -g bower yo grunt-cli | |
bower --version; yo --version; grunt --version |
Lastly, install MongoDB. Similar to Node, we can use Homebrew, or download and install MongoDB from the MongoDB website. Review this page for detailed installation and configuration instructions. To install MongoDB with Homebrew, we would issue the following commands:
brew update | |
brew install mongod | |
mongod --version | |
sudo mkdir -p /data/db | |
sudo chown -R `whoami` /data/db/ | |
mongod |
Example #1: Server-Centric Express Application
For our first example, we will scaffold a server-centric JavaScript web application using Pete Cooper’s Express Generator (v2.9.2). According to the project’s GitHub site, the Express Generator is ‘An Expressjs generator for Yeoman, based on the express command line tool.’ I suggest reading the project’s documentation before continuing; it describes the generator’s functionality in greater detail.
To install, download the Express Generator with npm, and install with Yeoman, as follows:
npm install -g generator-express | |
yo express |
As part of the Express Generator’s configuration process, Yeoman will ask a series of configuration questions. The Express generator offers several choices for scaffolding the application. For this example, we will choose the following options: MVC, Marko, Sass, MongoDB, and Grunt.
For those not as familiar with developing full-stack JavaScript applications, some of the generator’s choices may be unfamiliar, such as view engines, css preprocessors, and build tools. For this example, we will select Marko, a highly regarded JavaScript templating engine (aka view engine), for the first application. You can compare different engines on Slant. For CSS preprocessors, you can also refer to Slant for a comparison of leading candidates. We will choose Sass.
Lastly, for a build tool (aka task runner) we will choose Grunt. Grunt and Gulp are the two most popular choices. Either is a proven tool for automation tasks such as minification, compilation, unit-testing, linting, and packaging applications for deployment.
As shown below, Yeoman creates a series of files and directories and installs JavaScript libraries with npm and Bower. Choices are based on best practices, as prescribed by the generator.
npm
Yeoman uses npm and bower to install the generator’s required packages. Based on our five configuration choices for the Express Generator, npm installed over 225 packages in the project’s local node_module
directory. This includes primary and secondary npm package dependencies. For example, Marko, one of the choices, which npm installed, has 24 dependencies it requires. In turn, each of those packages may have more dependencies. You quickly see why npm, and other similar package managers, are invaluable to building and managing a modern web application. The npm dependencies are declared in the package.json
file, in the project’s root directory.

Partial List of npm Packages Installed
We will still need to install a few more items. We chose Sass as an Express generator option. Sass requires Ruby, which comes preinstalled on Mac OS X. If you wish, you can upgrade your pre-installed version of Ruby with Homebrew, but it is not required. Sass is installed with RubyGems, a package manager for Ruby. To automate the Sass-related tasks with Grunt, we also need to install the Grunt plug-in for Sass, grunt-contrib-sass, using npm:
# optional install updated version of Ruby and confirm | |
brew install ruby | |
ruby --version | |
# install grunt-contrib-sass | |
npm install grunt-contrib-sass --save-dev | |
# install sass and confirm | |
gem install sass | |
gem list sass |
The Express Generator’s test are written in Mocha. Mocha is an asynchronous JavaScript test framework running on Node.js. The website suggests installing Mocha globally with npm. Mocha can be run from the command line:
npm install -g mocha | |
mocha --version |
Up and Running
Simply running the grunt
command will start the default Generator-Express MVC application. While in development, I prefer to run Grunt with the debug option (grunt -d
) and/or with the verbose option (grunt -v
or grunt -dv
). These options offer enhanced feedback, especially on which tasks are run by Grunt. Review the terminal output to make sure the application started properly.
To confirm the Express application started correctly, in a second terminal window, curl the application using curl -I localhost:3000
. Easier yet, point your web browser to localhost:3000
. You should see the following default web page.
Example #2: Full-Stack MEAN Application
In our second example, we will scaffold a true full-stack JavaScript web application using Tyler Henkel’s popular AngularJS Full-Stack Generator for Yeoman (v3.0.2). According to the project’s GitHub site, the generator is a ‘Yeoman generator for creating MEAN stack applications, using MongoDB, Express, AngularJS, and Node – lets you quickly set up a project following best practices.’ As with the earlier example, I suggest you read the project’s documentation before continuing.
To install theAngularJS Full-Stack Generator, download the with npm and install with Yeoman:
npm install -g generator-angular-fullstack | |
mkdir demo-web-app-2; cd $_ | |
yo angular-fullstack meanapp |
Similar to the Express example, Yeoman will ask a series of configuration questions. We will choose the following options: Jade, Less, ngRoute, Bootstrap, UI Bootstrap, and MongoDB with Mongoose. AngularJS, Express, and Grunt are installed by the generator, automatically. For the sake of brevity, we will not include other available options, including Babel for ES6, OAuth authentication, or socket.io.

AngularJS Full-Stack Generator Config Options
PhantomJS
After generating the AngularJS Full-Stack project, I received errors regarding PhantomJS. According to several sources, this is not uncommon. The AngularJS Full-Stack project uses PhantomJS as the default browser for Karma, the popular test runner, designed by the AngularJS team. Although npm installed PhantomJS locally, as part of the project, Karma complained about missing the path to the PhantomJS binary. To eliminate the issue, I installed PhantomJS globally with npm. I then manually added the PhantomJS binary path to the $PATH
environment variable:
npm install -g phantomjs | |
sudo vi ~/.bash_profile | |
# add PHANTOMJS_BIN environment variable (next two lines) | |
export PHANTOMJS_BIN=/usr/local/lib/node_modules/phantomjs/bin | |
# your file/line may vary | |
resetexport PATH=${PHANTOMJS_BIN}:$PATH | |
phantomjs --version |
To test Karma, with PhantomJS, run the grunt test
command. This should result in error-free output, similar to the following.
Client/Server Architecture
Similar to the previous example, Yeoman creates a series of files and directories, and installs JavaScript packages on the server and client sides with npm and Bower.
Both the Express and AngularJS examples share several common files and directories. However, one major difference between the two is the client/server oriented directory structure of the AngularJS generator. Unlike the Express example, the AngularJS example has both a client and a server directory. The server-side of the application (aka back-end) is driven primarily by Express and Node. Mongoose serves as an interface between our application’s domain model and MongoDB, on the server-side. Also, on the server-side, Jade is used for HTML templating. The client-side of the application (aka front-end) is driven primarily by AngularJS. Twitter’s Bootstrap and Bootstrap UI offer a responsive web interface for our example application.
Up and Running
Running the grunt serve
command will eventually start the default AngularJS Full-Stack application, after running a series of pre-defined Grunt tasks.

AngularJS Full-Stack App Starting with Grunt
Review the terminal output to make sure the application started properly. You may see some warnings, suggesting the installation of several dependencies globally. You may also see warnings about dependency versions being outdated. Outdated versions are one of the challenges with generators that are not constantly kept refreshed and tested with the latest package dependencies. Warnings shouldn’t prevent the application from starting, only Errors.

AngularJS Full-Stack App Started with Grunt
To confirm the application started, in a second terminal window, curl the application using curl -I localhost:9000
. Easier yet, point your web browser at localhost:9000
. The default web page for the AngularJS Full-Stack web application is much more elaborate than the previous Express example. This is thanks to the Bootstrap and AngularJS client-side components.

AngularJS Full-Stack App Running in Browser
Additional Generator Features
The AngularJS Full-Stack generator is capable of generating more than just the default application project. The AngularJS Full-Stack generator contains a set of generators. Beyond generating the basic application framework, you may use the generator to create boilerplate code for AngularJS and Node.js components for endpoints, services, routes, models, controllers, directives, and filters. The generators also provide the ability to prepare your application for deployment to OpenStack and Heroku.
The best place to review available options for the generators is on the GitHub sites. You can display a high-level list of the generator’s features using the yo --help
command. Below are the three generators used in this post.
Below, is an example of generating additional application components using the AngularJS Full-Stack generator. First scaffold a server-side Express RESTful API endpoint, called ‘user’. The single command generates a server-side directory structure and several boilerplate files, including an Express model, controller, and router, and Mocha tests.

List of Installed Yeoman Generators
Next, generate a client-side AngularJS service, which connects to the server-side, Express RESTful ‘user’ endpoint above. The command creates a boilerplate AngularJS service and Mocha test. Lastly, create an AngularJS route. This generator command creates a boilerplate AngularJS route and controller, Mocha test, Jade view template, and less file.

AngularJS Full-Stack Component Generators
Example #3: Java Hipster Application
In the third and last example, we will scaffold another full-stack web application. However, this time, we will use a generator that relies on Java EE as the primary development platform on the server-side, as opposed to JavaScript. JavaScript will be relegated largely to the client-side.
Again, we will use a Yeoman generator, JHipster, built by Julien Dubois and team, to scaffold the application (v2.25.0). According to the project’s GitHub site, JHipster uses a robust server-side Java EE stack with Spring Boot and Maven. JHipster’s mobile-first front-end is enabled with AngularJS and Bootstrap. Being the most complex of the three examples, it’s important to review the project documentation.
JHipster offers three ways to install the application, which are 1) locally, 2) a Docker container, or 3) a Vagrant VM. We will install the application framework locally as not to introduce additional complexity. To install the generator, download the JHipster generator with npm, and install with Yeoman:
npm install -g generator-jhipster | |
mkdir demo-web-app-3; cd $_ | |
yo jhipster |
Again, Yeoman will ask a series of configuration questions on behalf of JHipster. For this example, we will choose the following options: token-based authentication, MongoDB, Maven, Grunt, LibSass (Sass), and Gatling for testing. AngularJS and Bootstrap are installed automatically. We have chosen not to include other configuration options in this example, such as Angular Translate, WebSockets, and clustered HTTP sessions.
Once Yeoman finishes scaffolding the application, you should see the following output.
Maven Project Structure
The file and directory structure of JHipster is very different from the previous two examples. The first two example’s project structure is typical of a JavaScript project. In contrast, the JHipster example’s structure is more typical of a Maven-based Java project. In the JHipster project, the client-side JavaScript files are in the /src/main/webapp/
directory. The presence of the webapp
directory is based on part of the project’s reliance on the Spring MVC web framework. Additionally, npm has loaded required server-side JavaScript packages into the node_modules
directory, in the project’s root directory.
Up and Running
Running the mvn
command will start the default JHipster application. The URL for the JHipster application is included in the terminal output.
To confirm the application has started, curl the application in a second terminal window, using curl -I localhost:8080
. Easier yet, point your web browser to localhost:8080
. Again, thanks to Bootstrap and AngularJS, the application presents a rich client UI.
Conclusion
The post’s examples represent a narrow sampling of available modern web application stacks, which can be easily scaffolded with generators. The JavaScript space continues to evolve rapidly. Even within the realm of JavaScript-based solutions, we didn’t examine several other popular frameworks, such as Meteor, FaceBook’s ReactJS, Ember, Backbone, and Polymer. They are all worth exploring, along with the hundreds of popular supporting frameworks, libraries, and API’s.
Useful Links
- Post: JavaScript Frameworks: The Best 10 for Modern Web Apps (link)
- Post: Best Web Frameworks (link)
- Post: State Of Web Development 2014 (link)
- YouTube: Modern Front-end Engineering (link)
- YouTube: WebStorm – Things You Probably Didn’t Know (link)
- Website: MEAN Stack (link)
- Website: Full Stack Python (link)
- Post: Best Full Stack Web Framework (link)
Installing Foreman and Puppet Agent on Multiple VMs Using Vagrant and VirtualBox
Posted by Gary A. Stafford in DevOps, Enterprise Software Development, Software Development on January 18, 2015
Automatically install and configure Foreman, the open source infrastructure lifecycle management tool, and multiple Puppet Agent VMs using Vagrant and VirtualBox.
Introduction
In the last post, Installing Puppet Master and Agents on Multiple VM Using Vagrant and VirtualBox, we installed Puppet Master/Agent on VirtualBox VMs using Vagrant. Puppet Master is an excellent tool, but lacks the ease-of-use of Puppet Enterprise or Foreman. In this post, we will build an almost identical environment, substituting Foreman for Puppet Master.
According to Foreman’s website, “Foreman is an open source project that helps system administrators manage servers throughout their lifecycle, from provisioning and configuration to orchestration and monitoring. Using Puppet or Chef and Foreman’s smart proxy architecture, you can easily automate repetitive tasks, quickly deploy applications, and proactively manage change, both on-premise with VMs and bare-metal or in the cloud.”
Combined with Puppet Labs’ Open Source Puppet, Foreman is an effective solution to manage infrastructure and system configuration. Again, according to Foreman’s website, the Foreman installer is a collection of Puppet modules that installs everything required for a full working Foreman setup. The installer uses native OS packaging and adds necessary configuration for the complete installation. By default, the Foreman installer will configure:
- Apache HTTP with SSL (using a Puppet-signed certificate)
- Foreman running under mod_passenger
- Smart Proxy configured for Puppet, TFTP and SSL
- Puppet master running under mod_passenger
- Puppet agent configured
- TFTP server (under xinetd on Red Hat platforms)
For the average Systems Engineer or Software Developer, installing and configuring Foreman, Puppet Master, Apache, Puppet Agent, and the other associated software packages listed above, is daunting. If the installation doesn’t work properly, you must troubleshooting, or trying to remove and reinstall some or all the components.
A better solution is to automate the installation of Foreman into a Docker container, or on to a VM using Vagrant. Automating the installation process guarantees accuracy and consistency. The Vagrant VirtualBox VM can be snapshotted, moved to another host, or simply destroyed and recreated, if needed.
All code for this post is available on GitHub. However, it been updated as of 8/23/2015. Changes were required to fix compatibility issues with the latest versions of Puppet 4.x and Foreman. Additionally, the version of CentOS on all VMs was updated from 6.6 to 7.1 and the version of Foreman was updated from 1.7 to 1.9.
The Post’s Example
In this post, we will use Vagrant and VirtualBox to create three VMs. The VMs in this post will be build from a standard CentOS 6.5 x64 base Vagrant Box, located on Atlas. We will use a single JSON-format configuration file to automatically build all three VMs with Vagrant. As part of the provisioning process, using Vagrant’s shell provisioner, we will execute a bootstrap shell script. The script will install Foreman and it’s associated software on the first VM, and Puppet Agent on the two remaining VMs (aka Puppet ‘agent nodes’ or Foreman ‘hosts’).
Foreman does have the ability to provision on bare-metal infrastructure and public or private clouds. However, this example would simulate an environment where you have existing nodes you want to manage with Foreman.
The Foreman bootstrap script will also download several Puppet modules. To test Foreman once the provisioning is complete, import those module’s classes into Foreman and assign the classes to the hosts. The hosts will fetch and apply the configurations. You can then test for the installed instances of those module’s components on the puppet agent hosts.
Vagrant
To begin the process, we will use the JSON-format configuration file to create the three VMs, using Vagrant and VirtualBox.
{ "nodes": { "theforeman.example.com": { ":ip": "192.168.35.5", "ports": [], ":memory": 1024, ":bootstrap": "bootstrap-foreman.sh" }, "agent01.example.com": { ":ip": "192.168.35.10", "ports": [], ":memory": 1024, ":bootstrap": "bootstrap-node.sh" }, "agent02.example.com": { ":ip": "192.168.35.20", "ports": [], ":memory": 1024, ":bootstrap": "bootstrap-node.sh" } } }
The Vagrantfile
uses the JSON-format configuration file, to provision the three VMs, using a single ‘vagrant up
‘ command. That’s it, less than 30 lines of actual code in the Vagrantfile
to create as many VMs as you want. For this post’s example, we will not need to add any VirtualBox port mappings. However, that can also done from the JSON configuration file (see the READM.md for more directions).
If you have not used the CentOS Vagrant Box, it will take a few minutes the first time for Vagrant to download the it to the local Vagrant Box repository.
# -*- mode: ruby -*- # vi: set ft=ruby : # Builds single Foreman server and # multiple Puppet Agent Nodes using JSON config file # Gary A. Stafford - 01/15/2015 # read vm and chef configurations from JSON files nodes_config = (JSON.parse(File.read("nodes.json")))['nodes'] VAGRANTFILE_API_VERSION = "2" Vagrant.configure(VAGRANTFILE_API_VERSION) do |config| config.vm.box = "chef/centos-6.5" nodes_config.each do |node| node_name = node[0] # name of node node_values = node[1] # content of node config.vm.define node_name do |config| # configures all forwarding ports in JSON array ports = node_values['ports'] ports.each do |port| config.vm.network :forwarded_port, host: port[':host'], guest: port[':guest'], id: port[':id'] end config.vm.hostname = node_name config.vm.network :private_network, ip: node_values[':ip'] config.vm.provider :virtualbox do |vb| vb.customize ["modifyvm", :id, "--memory", node_values[':memory']] vb.customize ["modifyvm", :id, "--name", node_name] end config.vm.provision :shell, :path => node_values[':bootstrap'] end end end
Once provisioned, the three VMs, also called ‘Machines’ by Vagrant, should appear in Oracle VM VirtualBox Manager.
The name of the VMs, referenced in Vagrant commands, is the parent node name in the JSON configuration file (node_name
), such as, ‘vagrant ssh theforeman.example.com
‘.
Bootstrapping Foreman
As part of the Vagrant provisioning process (‘vagrant up
‘ command), a bootstrap script is executed on the VMs (shown below). This script will do almost of the installation and configuration work. Below is script for bootstrapping the Foreman VM.
#!/bin/sh # Run on VM to bootstrap Foreman server # Gary A. Stafford - 01/15/2015 if ps aux | grep "/usr/share/foreman" | grep -v grep 2> /dev/null then echo "Foreman appears to all already be installed. Exiting..." else # Configure /etc/hosts file echo "" | sudo tee --append /etc/hosts 2> /dev/null && \ echo "192.168.35.5 theforeman.example.com theforeman" | sudo tee --append /etc/hosts 2> /dev/null # Update system first sudo yum update -y # Install Foreman for CentOS 6 sudo rpm -ivh http://yum.puppetlabs.com/puppetlabs-release-el-6.noarch.rpm && \ sudo yum -y install epel-release http://yum.theforeman.org/releases/1.7/el6/x86_64/foreman-release.rpm && \ sudo yum -y install foreman-installer && \ sudo foreman-installer # First run the Puppet agent on the Foreman host which will send the first Puppet report to Foreman, # automatically creating the host in Foreman's database sudo puppet agent --test --waitforcert=60 # Install some optional puppet modules on Foreman server to get started... sudo puppet module install -i /etc/puppet/environments/production/modules puppetlabs-ntp sudo puppet module install -i /etc/puppet/environments/production/modules puppetlabs-git sudo puppet module install -i /etc/puppet/environments/production/modules puppetlabs-docker fi
Bootstrapping Puppet Agent Nodes
Below is script for bootstrapping the puppet agent nodes. The agent node bootstrap script was executed as part of the Vagrant provisioning process.
#!/bin/sh # Run on VM to bootstrap Puppet Agent nodes # Gary A. Stafford - 01/15/2015 if ps aux | grep "puppet agent" | grep -v grep 2> /dev/null then echo "Puppet Agent is already installed. Moving on..." else # Update system first sudo yum update -y # Install Puppet for CentOS 6 sudo rpm -ivh http://yum.puppetlabs.com/puppetlabs-release-el-6.noarch.rpm && \ sudo yum -y install puppet # Configure /etc/hosts file echo "" | sudo tee --append /etc/hosts 2> /dev/null && \ echo "192.168.35.5 theforeman.example.com theforeman" | sudo tee --append /etc/hosts 2> /dev/null # Add agent section to /etc/puppet/puppet.conf (sets run interval to 120 seconds) echo "" | sudo tee --append /etc/puppet/puppet.conf 2> /dev/null && \ echo " server = theforeman.example.com" | sudo tee --append /etc/puppet/puppet.conf 2> /dev/null && \ echo " runinterval = 120" | sudo tee --append /etc/puppet/puppet.conf 2> /dev/null sudo service puppet stop sudo service puppet start sudo puppet resource service puppet ensure=running enable=true sudo puppet agent --enable fi
Now that the Foreman is running, use the command, ‘vagrant ssh agent01.example.com
‘, to ssh into the first puppet agent node. Run the command below.
sudo puppet agent --test --waitforcert=60
The command above manually starts Puppet’s Certificate Signing Request (CSR) process, to generate the certificates and security credentials (private and public keys) generated by Puppet’s built-in certificate authority (CA). Each puppet agent node must have it certificate signed by the Foreman, first. According to Puppet’s website, “Before puppet agent nodes can retrieve their configuration catalogs, they need a signed certificate from the local Puppet certificate authority (CA). When using Puppet’s built-in CA (that is, not using an external CA), agents will submit a certificate signing request (CSR) to the CA Puppet Master (Foreman) and will retrieve a signed certificate once one is available.”
Open the Foreman browser-based interface, running at https://theforeman.example.com. Proceed to the ‘Infrastructure’ -> ‘Smart Proxies’ tab. Sign the certificate(s) from the agent nodes (shown below). The agent node will wait for the Foreman to sign the certificate, before continuing with the initial configuration.
Once the certificate signing process is complete, the host retrieves the client configuration from the Foreman and applies it to the hosts.
That’s it, you should now have one host running Foreman and two puppet agent nodes.
Testing Foreman
To test Foreman, import the classes from the Puppet modules installed with the Foreman bootstrap script.
Next, apply ntp, git, and Docker classes to both agent nodes (aka, Foreman ‘hosts’), as well as the Foreman node, itself.
Every two minutes, the two agent node hosts should fetch their latest configuration from Foreman and apply it. In a few minutes, check the times reported in the ‘Last report’ column on the ‘All Hosts’ tab. If the times are two minutes or less, Foreman and Puppet Agent are working. Note we changed the runinterval
to 120 seconds (‘120s’) in the bootstrap script to speed up the Puppet Agent updates for the sake of the demo. The normal default interval is 30 minutes. I recommend changing the agent node’s runinterval
back to 30 minutes (’30m’) on the hosts, once everything is working to save unnecessary use of resources.
Finally, to verify that the configuration was successfully applied to the hosts, check if ntp, git, and Docker are now running on the hosts.
Helpful Links
All the source code this project is on Github.
Foreman:
http://theforeman.org
Atlas – Discover Vagrant Boxes:
https://atlas.hashicorp.com/boxes/search
Learning Puppet – Basic Agent/Master Puppet
https://docs.puppetlabs.com/learning/agent_master_basic.html
Puppet Glossary (of terms):
https://docs.puppetlabs.com/references/glossary.html
Installing Puppet Master and Agents on Multiple VM Using Vagrant and VirtualBox
Posted by Gary A. Stafford in Bash Scripting, Build Automation, DevOps, Enterprise Software Development, Software Development on December 14, 2014
Automatically provision multiple VMs with Vagrant and VirtualBox. Automatically install, configure, and test Puppet Master and Puppet Agents on those VMs.
Introduction
Note this post and accompanying source code was updated on 12/16/2014 to v0.2.1. It contains several improvements to improve and simplify the install process.
Puppet Labs’ Open Source Puppet Agent/Master architecture is an effective solution to manage infrastructure and system configuration. However, for the average System Engineer or Software Developer, installing and configuring Puppet Master and Puppet Agent can be challenging. If the installation doesn’t work properly, the engineer’s stuck troubleshooting, or trying to remove and re-install Puppet.
A better solution, automate the installation of Puppet Master and Puppet Agent on Virtual Machines (VMs). Automating the installation process guarantees accuracy and consistency. Installing Puppet on VMs means the VMs can be snapshotted, cloned, or simply destroyed and recreated, if needed.
In this post, we will use Vagrant and VirtualBox to create three VMs. The VMs will be build from a Ubuntu 14.04.1 LTS (Trusty Tahr) Vagrant Box, previously on Vagrant Cloud, now on Atlas. We will use a single JSON-format configuration file to build all three VMs, automatically. As part of the Vagrant provisioning process, we will run a bootstrap shell script to install Puppet Master on the first VM (Puppet Master server) and Puppet Agent on the two remaining VMs (agent nodes).
Lastly, to test our Puppet installations, we will use Puppet to install some basic Puppet modules, including ntp and git on the server, and ntp, git, Docker and Fig, on the agent nodes.
All the source code this project is on Github.
Vagrant
To begin the process, we will use the JSON-format configuration file to create the three VMs, using Vagrant and VirtualBox.
{ "nodes": { "puppet.example.com": { ":ip": "192.168.32.5", "ports": [], ":memory": 1024, ":bootstrap": "bootstrap-master.sh" }, "node01.example.com": { ":ip": "192.168.32.10", "ports": [], ":memory": 1024, ":bootstrap": "bootstrap-node.sh" }, "node02.example.com": { ":ip": "192.168.32.20", "ports": [], ":memory": 1024, ":bootstrap": "bootstrap-node.sh" } } }
The Vagrantfile uses the JSON-format configuration file, to provision the three VMs, using a single ‘vagrant up
‘ command. That’s it, less than 30 lines of actual code in the Vagrantfile to create as many VMs as we need. For this post’s example, we will not need to add any port mappings, which can be done from the JSON configuration file (see the READM.md for more directions). The Vagrant Box we are using already has the correct ports opened.
If you have not previously used the Ubuntu Vagrant Box, it will take a few minutes the first time for Vagrant to download the it to the local Vagrant Box repository.
# vi: set ft=ruby : # Builds Puppet Master and multiple Puppet Agent Nodes using JSON config file # Author: Gary A. Stafford # read vm and chef configurations from JSON files nodes_config = (JSON.parse(File.read("nodes.json")))['nodes'] VAGRANTFILE_API_VERSION = "2" Vagrant.configure(VAGRANTFILE_API_VERSION) do |config| config.vm.box = "ubuntu/trusty64" nodes_config.each do |node| node_name = node[0] # name of node node_values = node[1] # content of node config.vm.define node_name do |config| # configures all forwarding ports in JSON array ports = node_values['ports'] ports.each do |port| config.vm.network :forwarded_port, host: port[':host'], guest: port[':guest'], id: port[':id'] end config.vm.hostname = node_name config.vm.network :private_network, ip: node_values[':ip'] config.vm.provider :virtualbox do |vb| vb.customize ["modifyvm", :id, "--memory", node_values[':memory']] vb.customize ["modifyvm", :id, "--name", node_name] end config.vm.provision :shell, :path => node_values[':bootstrap'] end end end
Once provisioned, the three VMs, also referred to as ‘Machines’ by Vagrant, should appear, as shown below, in Oracle VM VirtualBox Manager.
The name of the VMs, referenced in Vagrant commands, is the parent node name in the JSON configuration file (node_name
), such as, ‘vagrant ssh puppet.example.com
‘.
Bootstrapping Puppet Master Server
As part of the Vagrant provisioning process, a bootstrap script is executed on each of the VMs (script shown below). This script will do 98% of the required work for us. There is one for the Puppet Master server VM, and one for each agent node.
#!/bin/sh # Run on VM to bootstrap Puppet Master server if ps aux | grep "puppet master" | grep -v grep 2> /dev/null then echo "Puppet Master is already installed. Exiting..." else # Install Puppet Master wget https://apt.puppetlabs.com/puppetlabs-release-trusty.deb && \ sudo dpkg -i puppetlabs-release-trusty.deb && \ sudo apt-get update -yq && sudo apt-get upgrade -yq && \ sudo apt-get install -yq puppetmaster # Configure /etc/hosts file echo "" | sudo tee --append /etc/hosts 2> /dev/null && \ echo "# Host config for Puppet Master and Agent Nodes" | sudo tee --append /etc/hosts 2> /dev/null && \ echo "192.168.32.5 puppet.example.com puppet" | sudo tee --append /etc/hosts 2> /dev/null && \ echo "192.168.32.10 node01.example.com node01" | sudo tee --append /etc/hosts 2> /dev/null && \ echo "192.168.32.20 node02.example.com node02" | sudo tee --append /etc/hosts 2> /dev/null # Add optional alternate DNS names to /etc/puppet/puppet.conf sudo sed -i 's/.*\[main\].*/&\ndns_alt_names = puppet,puppet.example.com/' /etc/puppet/puppet.conf # Install some initial puppet modules on Puppet Master server sudo puppet module install puppetlabs-ntp sudo puppet module install garethr-docker sudo puppet module install puppetlabs-git sudo puppet module install puppetlabs-vcsrepo sudo puppet module install garystafford-fig # symlink manifest from Vagrant synced folder location ln -s /vagrant/site.pp /etc/puppet/manifests/site.pp fi
There are a few last commands we need to run ourselves, from within the VMs. Once the provisioning process is complete, ‘vagrant ssh puppet.example.com
‘ into the newly provisioned Puppet Master server. Below are the commands we need to run within the ‘puppet.example.com
‘ VM.
sudo service puppetmaster status # test that puppet master was installed sudo service puppetmaster stop sudo puppet master --verbose --no-daemonize # Ctrl+C to kill puppet master sudo service puppetmaster start sudo puppet cert list --all # check for 'puppet' cert
According to Puppet’s website, ‘these steps will create the CA certificate and the puppet master certificate, with the appropriate DNS names included.‘
Bootstrapping Puppet Agent Nodes
Now that the Puppet Master server is running, open a second terminal tab (‘Shift+Ctrl+T
‘). Use the command, ‘vagrant ssh node01.example.com
‘, to ssh into the new Puppet Agent node. The agent node bootstrap script should have already executed as part of the Vagrant provisioning process.
#!/bin/sh # Run on VM to bootstrap Puppet Agent nodes # http://blog.kloudless.com/2013/07/01/automating-development-environments-with-vagrant-and-puppet/ if ps aux | grep "puppet agent" | grep -v grep 2> /dev/null then echo "Puppet Agent is already installed. Moving on..." else sudo apt-get install -yq puppet fi if cat /etc/crontab | grep puppet 2> /dev/null then echo "Puppet Agent is already configured. Exiting..." else sudo apt-get update -yq && sudo apt-get upgrade -yq sudo puppet resource cron puppet-agent ensure=present user=root minute=30 \ command='/usr/bin/puppet agent --onetime --no-daemonize --splay' sudo puppet resource service puppet ensure=running enable=true # Configure /etc/hosts file echo "" | sudo tee --append /etc/hosts 2> /dev/null && \ echo "# Host config for Puppet Master and Agent Nodes" | sudo tee --append /etc/hosts 2> /dev/null && \ echo "192.168.32.5 puppet.example.com puppet" | sudo tee --append /etc/hosts 2> /dev/null && \ echo "192.168.32.10 node01.example.com node01" | sudo tee --append /etc/hosts 2> /dev/null && \ echo "192.168.32.20 node02.example.com node02" | sudo tee --append /etc/hosts 2> /dev/null # Add agent section to /etc/puppet/puppet.conf echo "" && echo "[agent]\nserver=puppet" | sudo tee --append /etc/puppet/puppet.conf 2> /dev/null sudo puppet agent --enable fi
Run the two commands below within both the ‘node01.example.com
‘ and ‘node02.example.com
‘ agent nodes.
sudo service puppet status # test that agent was installed sudo puppet agent --test --waitforcert=60 # initiate certificate signing request (CSR)
The second command above will manually start Puppet’s Certificate Signing Request (CSR) process, to generate the certificates and security credentials (private and public keys) generated by Puppet’s built-in certificate authority (CA). Each Puppet Agent node must have it certificate signed by the Puppet Master, first. According to Puppet’s website, “Before puppet agent nodes can retrieve their configuration catalogs, they need a signed certificate from the local Puppet certificate authority (CA). When using Puppet’s built-in CA (that is, not using an external CA), agents will submit a certificate signing request (CSR) to the CA Puppet Master and will retrieve a signed certificate once one is available.”
Back on the Puppet Master Server, run the following commands to sign the certificate(s) from the agent node(s). You may sign each node’s certificate individually, or wait and sign them all at once. Note the agent node(s) will wait for the Puppet Master to sign the certificate, before continuing with the Puppet Agent configuration run.
sudo puppet cert list # should see 'node01.example.com' cert waiting for signature sudo puppet cert sign --all # sign the agent node certs sudo puppet cert list --all # check for signed certs
Once the certificate signing process is complete, the Puppet Agent retrieves the client configuration from the Puppet Master and applies it to the local agent node. The Puppet Agent will execute all applicable steps in the site.pp
manifest on the Puppet Master server, designated for that specific Puppet Agent node (ie.’node node02.example.com {...}
‘).
Below is the main site.pp
manifest on the Puppet Master server, applied by Puppet Agent on the agent nodes.
node default { # Test message notify { "Debug output on ${hostname} node.": } include ntp, git } node 'node01.example.com', 'node02.example.com' { # Test message notify { "Debug output on ${hostname} node.": } include ntp, git, docker, fig }
That’s it! You should now have one server VM running Puppet Master, and two agent node VMs running Puppet Agent. Both agent nodes should have successfully been registered with Puppet Master, and configured themselves based on the Puppet Master’s main manifest. Agent node configuration includes installing ntp, git, Fig, and Docker.
Helpful Links
All the source code this project is on Github.
Puppet Glossary (of terms):
https://docs.puppetlabs.com/references/glossary.html
Puppet Labs Open Source Automation Tools:
http://puppetlabs.com/misc/download-options
Puppet Master Overview:
http://ci.openstack.org/puppet.html
Install Puppet on Ubuntu:
https://docs.puppetlabs.com/guides/install_puppet/install_debian_ubuntu.html
Installing Puppet Master:
http://andyhan.linuxdict.com/index.php/sys-adm/item/273-puppet-371-on-centos-65-quick-start-i
Regenerating Node Certificates:
https://docs.puppetlabs.com/puppet/latest/reference/ssl_regenerate_certificates.html
Automating Development Environments with Vagrant and Puppet:
http://blog.kloudless.com/2013/07/01/automating-development-environments-with-vagrant-and-puppet
Managing Windows Servers with Chef, Book Review
Posted by Gary A. Stafford in .NET Development, Build Automation, DevOps, Enterprise Software Development, PowerShell Scripting, Software Development on August 11, 2014
Harness the power of Chef to automate management of Windows-based systems using hands-on examples.
Recently, I had the opportunity to read, ‘Managing Windows Servers with Chef’, authored John Ewart, and published in May, 2014 by Packt Publishing. At a svelte 110 pages in paperback form, ‘Managing Windows Servers with Chef’, is a quick read, packed with concise information, relevant examples, and excellent code samples. Available on Packt Publishing’s website for a mere $11.90 for the ebook, it a worthwhile investment for anyone considering Chef Software’s Chef product for automating their Windows-based infrastructure.
As an IT professional, I use Chef for both Windows and Linux-based IT automation, on a regular basis. In my experience, there is a plethora of information on the Internet about properly implementing and scaling Chef. There is seldom a topic I can’t find the answers to, online. However, it has also been my experience, information is often Linux-centric. That is one reason I really appreciated Ewart’s book, concentrating almost exclusively on Windows-based implementations of Chef.
IT professionals, just getting starting with Chef, or migrating from Puppet, will find the ‘Managing Windows Servers with Chef’ invaluable. Ewart does a good job building the user’s understanding of the Chef ecosystem, before beginning to explain its application to a Windows-based environment. If you are considering Chef versus Puppet Lab’s Puppet for Windows-based IT automation, reading this book will give you a solid overview of Chef.
Seasoned users of Chef will also find the ‘Managing Windows Servers with Chef’ useful. Professionals quickly master the Chef principles, and develop the means to automate their specific tasks with Chef. But inevitably, there comes the day when they must automate something new with Chef. That is where the book can serve as a handy reference.
Of all the books topics, I especially found value in Chapter 5 (Managing Cloud Services with Chef) and Chapter 6 (Going Beyond the Basics – Testing Recipes). Even large enterprise-scale corporations are moving infrastructure to cloud providers. Ewart demonstrates Chef’s Windows-based integration with Microsoft’s Azure, Amazon’s EC2, and Rackspace’s Cloud offerings. Also, Ewart’s section on testing is a reminder to all of us, of the importance of unit testing. I admit I more often practice TAD (‘Testing After Development’) than TDD (Test Driven Development), LOL. Ewart introduces both RSpec and ChefSpec for testing Chef recipes.
I recommend ‘Managing Windows Servers with Chef’ for anyone considering Chef, or who is seeking a good introductory guide to getting started with Chef for Windows-based systems.
Windows PowerShell 4.0 for .NET Developers, Book Review
Posted by Gary A. Stafford in .NET Development, DevOps, PowerShell Scripting, Team Foundation Server (TFS) Development on May 2, 2014
A brief review of ‘Windows PowerShell 4.0 for .NET Developers’, a fast-paced PowerShell guide, enabling you to efficiently administer and maintain your development environment.
Introduction
Recently, I had the opportunity to review ‘Windows PowerShell 4.0 for .NET Developers‘, published by Packt Publishing. According to its author, Sherif Talaat, the book is ‘a fast-paced PowerShell guide, enabling you to efficiently administer and maintain your development environment.‘ Working in a large and complex software development organization, technologies such as PowerShell, which enable increased speed and automation, are essential to our success. Having used PowerShell on a regular basis as a .NET developer for the past few years, I was excited to see what Sherif’s newest book offered.
Requirements
The book recommends the following minimal software configuration to work through the code samples:
- Windows Server 2012 R2 (includes PowerShell 4.0 and .NET 4.5)
- SQL Server 2012
- Visual Studio 2012/2013
- Visual Studio Team Foundation Server (TFS) 2012/2013
To test the book’s samples, I provisioned a fresh VM, and using my MSDN subscription, installed the required Windows Server, SQL Server, and Team Foundation Server. I worked directly on the VM, as well as remotely from a Windows 7 Enterprise-based development machine with Visual Studio 2012 installed. The code samples worked fairly well, with only a few minor problems I found. There is still no errata published for the book as of the time of review.
A key aspect many authors do not address, is the complexities of using PowerShell in a corporate environment. Working individually or on a small network, developers don’t always experience the added burden of restrictive network security, LDAP, proxy servers, proxy authentication, XML gateways, firewalls, and centralized computer administration. Any code that requires access to remote servers and systems, often requires additional coding to work within a corporate environment. It can be frustrating to debug and extend simple examples to work successfully within an enterprise setting.
Contents
Windows PowerShell 4.0 for .NET Developers, at 115 pages in length, is divided into five chapters:
- Chapter 1: Getting Started with Windows PowerShell
- Chapter 2: Unleashing Your Development Skills with PowerShell
- Chapter 3: PowerShell for Your Daily Administration Tasks
- Chapter 4: PowerShell and Web Technologies
- Chapter 5: PowerShell and Team Foundation Server
Chapter 1 provides a brief introduction to PowerShell. At a scant 30 pages, I would not recommend this book as a way to learn PowerShell for the beginner. For learning PowerShell, I recommend Instant Windows PowerShell, by Vinith Menon, also published by Packt Publishing. Alternatively, I recommend a few books by Manning Publications, including Learn Windows PowerShell in a Month of Lunches, Second Edition.
Chapter 2 discusses PowerShell in relationship to several key Microsoft technologies, including Windows Management Instrumentation (WMI), Common Information Model (CIM), Component Object Model (COM) and Extensible Markup Language (XML). As a .NET developer, it’s almost impossible not to have worked with one, or all of these technologies. Chapter 2 discusses how PowerShell works with .NET objects, and extend the .NET framework. The chapter also includes an easy-to-follow example of creating, importing, and calling a PowerShell binary module (compiled .NET class library), using Visual Studio.
Chapter 3 explores areas where .NET developer can start leveraging PowerShell for daily administrative tasks. In particular, I found the sections on PowerShell Remoting and administering IIS and SQL Server particularly useful. Being able to easily connect to remote web, application, and database servers from the command line (or, PowerShell prompt) and do basic system administration is a huge time savings in an agile development environment.
Chapters 4 focuses on how PowerShell interfaces with SOAP and REST based services, web requests, and JSON. Windows Communication Foundation (WCF) based service-oriented application development has been a trend for the last few years. Being able to manage, test, and monitor SOAP and RESTful services and HTTP requests/responses is important to .NET developers. PowerShell can often quicker and easier than writing and compiling service utilities in Visual Studio, or using proprietary third-party applications.
Chapter 5 is dedicated to Visual Studio Team Foundation Server (TFS), Microsoft’s end-to-end, Application Lifecycle Management (ALM) solution. Chapter 5 details the installation and use of TFS Power Tools and TFS PowerShell snap-in. Having held the roles of lead developer and Scrum Master, I have personally found some of the best uses for PowerShell in automating various aspects of TFS. Managing TFS often requires repetitive tasks, the place where PowerShell excels. You will need to explore additional resources beyond the scope of this book to really start automating TFS with PowerShell.
Conclusion
Overall, I enjoyed the book and felt it was well worth the time to explore. I applaud Sherif for targeting a PowerShell book specifically to developers. Due to its short length, the book did leave me wanting more information on a few subjects that were barely skimmed. I also found myself expecting guidance on a few subjects the book did not touch upon, such as PowerShell for cloud-based development (Azure), test automation, and build and deployment automation. For more information on some of those subjects, I recommend Sherif’s other book, also published by Packt Publishing, PowerShell 3.0 Advanced Administration Handbook.