Posts Tagged Automation

DevOps for DataOps: Building a CI/CD Pipeline for Apache Airflow DAGs

Build an effective CI/CD pipeline to test and deploy your Apache Airflow DAGs to Amazon MWAA using GitHub Actions

Introduction

In this post, we will learn how to use GitHub Actions to build an effective CI/CD workflow for our Apache Airflow DAGs. We will use the DevOps concepts of Continuous Integration and Continuous Delivery to automate the testing and deployment of Airflow DAGs to Amazon Managed Workflows for Apache Airflow (Amazon MWAA) on AWS.

Fork and pull model of collaborative Airflow development used in this post

Technologies

Apache Airflow

According to the documentation, Apache Airflow is an open-source platform to author, schedule, and monitor workflows programmatically. With Airflow, you author workflows as Directed Acyclic Graphs (DAGs) of tasks written in Python.

Amazon Managed Workflows for Apache Airflow

According to AWS, Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a highly available, secure, and fully-managed workflow orchestration for Apache Airflow. MWAA automatically scales its workflow execution capacity to meet your needs and is integrated with AWS security services to help provide fast and secure access to data.

Example of Apache Airflow UI within Amazon MWAA Environment

GitHub Actions

According to GitHub, GitHub Actions makes it easy to automate software workflows with CI/CD. GitHub Actions allow you to build, test, and deploy code right from GitHub. GitHub Actions are workflows triggered by GitHub events like push, issue creation, or a new release. You can leverage GitHub Actions prebuilt and maintained by the community.

Example of GitHub Action workflow running in the GitHub repository used in this post

If you are new to GitHub Actions, I recommend my previous post, Continuous Integration and Deployment of Docker Images using GitHub Actions.

Terminology

DataOps

According to Wikipedia, DataOps is an automated, process-oriented methodology used by analytic and data teams to improve the quality and reduce the cycle time of data analytics. While DataOps began as a set of best practices, it has now matured to become a new approach to data analytics.

DataOps applies to the entire data lifecycle from data preparation to reporting and recognizes the interconnected nature of the data analytics team and IT operations. DataOps incorporates the Agile methodology to shorten the software development life cycle (SDLC) of analytics development.

DevOps

According to Wikipedia, DevOps is a set of practices that combines software development (Dev) and IT operations (Ops). It aims to shorten the systems development life cycle and provide continuous delivery with high software quality.

DevOps is a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality. -Wikipedia

Fail Fast

According to Wikipedia, a fail-fast system is one that immediately reports any condition that is likely to indicate a failure. Using the DevOps concept of fail fast, we build steps into our workflows to uncover errors sooner in the SDLC. We shift testing as far to the left as possible (referring to the pipeline of steps moving from left to right) and test at multiple points along the way.

Source Code

All source code for this demonstration, including the GitHub Actions, Pytest unit tests, and Git Hooks, is open-sourced and located on GitHub.

Architecture

The diagram below represents the architecture for a recent blog post and video demonstration, Lakehouse Automation on AWS with Apache Airflow. The post and video show how to programmatically load and upload data from Amazon Redshift to an Amazon S3-based data lake using Apache Airflow.

Architecture for the post and video, Lakehouse Automation on AWS with Apache Airflow

In this post, we will review how the DAGs from the previous were developed, tested, and deployed to MWAA using a variety of progressively more effective CI/CD workflows. The workflows demonstrated could also be easily applied to other Airflow resources in addition to DAGs, such as SQL scripts, configuration and data files, Python requirement files, and plugins.

Workflows

No DevOps

Below we see a minimally viable workflow for loading DAGs into Amazon MWAA, which does not use the principles of CI/CD. Changes are made in the local Airflow developer’s environment. The modified DAGs are copied directly to the Amazon S3 bucket, which are then automatically synced with Amazon MWAA, barring any errors. Those changes are also (hopefully) pushed back to the centralized version control or source code management (SCM) system, which is GitHub in this post.

Error-prone, non-DevOps workflow for modifying and syncing DAGs to MWAA

There are at least two significant issues with this error-prone workflow. First, the DAGs are always out of sync between the Amazon S3 bucket and GitHub. These are two independent steps — copying or syncing the DAGs to S3 and pushing the DAGs to GitHub. A developer might continue making changes and pushing DAGs to S3 without pushing to GitHub or vice versa.

Secondly, the DevOps concept of fail-fast is missing. The first time you know your DAG contains errors is likely when it is synced to MWAA and throws an Import Error. By then, the DAG has already been copied to S3, synced to MWAA, and possibly pushed to GitHub, which other developers could then pull.

Example of a typical DAG Import Error, easily caught with a simple test

GitHub Actions

A significant step up from the previous workflow is using GitHub Actions to test and deploy your code after pushing it to GitHub. Although in this workflow, code is still ‘pushed straight to Trunk’ (the main branch in GitHub) and risks other developers in a collaborative environment pulling potentially erroneous code, you have far less chance of DAG errors making it to MWAA.

GitHub Actions allow you to fail faster and catch errors sooner

Using GitHub Actions, you also eliminate human error that could result in the changes to DAGs not being synced to Amazon S3. Lastly, using this workflow improves security by eliminating the need to provide direct access to the Airflow Amazon S3 bucket to Airflow Developers.

Fork and pull model of collaborative Airflow development used in this post (video only)

Types of Tests

The first GitHub Action, test_dags.yml, is triggered on a push to the dags directory in the main branch of the repository. It is also triggered whenever a pull request is made for the main branch. The first GitHub Action runs a battery of tests, including checking Python dependencies, code style, code quality, DAG import errors, and unit tests. The tests catch issues with DAGs before being synced to S3 by a second GitHub Action.

name: Test DAGs

on:
  push:
    paths:
      - 'dags/**'
  pull_request:
    branches:
      - main

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.7'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements/requirements.txt
        pip check
    - name: Lint with Flake8
      run: |
        pip install flake8
        flake8 --ignore E501 dags --benchmark -v
    - name: Confirm Black code compliance (psf/black)
      run: |
        pip install pytest-black
        pytest dags --black -v
    - name: Test with Pytest
      run: |
        pip install pytest
        cd tests || exit
        pytest tests.py -v

Successful runs of the ‘Test DAGs’ GitHub Action, shown in the Actions Console

Python Dependencies

The first test installs the modules listed in the requirements.txt file used locally to develop the application. This test is designed to uncover any missing or conflicting modules.

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements/requirements.txt
pip check

It is essential to develop your DAGs against the same version of Python and with the same version of the Python modules used in your Airflow environment. You can use the BashOperator to run shell commands to obtain the versions of Python and module installed in your Airflow environment:

python3 --version; python3 -m pip list

A snippet of log output from DAG showing Python version and Python modules available in MWAA 2.0.2:

Python version and Python modules available in MWAA 2.0.2

The latest stable release of Airflow is currently version 2.2.2, released 2021-11-15. However, as of December 2021, Amazon’s latest version of MWAA 2.x is version 2.0.2, released 2021-04-19. MWAA 2.0.2 currently runs Python3 version 3.7.10.

Available versions of Amazon MWAA as of December 2021

Flake8

Known as ‘your tool for style guide enforcement,’ Flake8 is described as the modular source code checker. It is a command-line utility for enforcing style consistency across Python projects. Flake8 is a wrapper around PyFlakes, pycodestyle, and Ned Batchelder’s McCabe script. The module, pycodestyle, is a tool to check your Python code against some of the style conventions in PEP 8.

Flake8 is highly configurable, with options to ignore specific rules if not required by your development team. For example, in this demonstration, I intentionally ignored rule E501, which states that ‘line length should be limited to 72 characters.

- name: Lint with Flake8
run: |
pip install flake8
flake8 --ignore E501 dags --benchmark -v

Black

Known as ‘the uncompromising code formatter,’ Python code formatted using Black (referred to as Blackened code) looks the same regardless of the project you’re reading. Formatting becomes transparent, allowing teams to focus on the content instead. Black makes code review faster by producing the smallest diffs possible, assuming all developers are using black to format their code.

The Airflow DAGs in this GitHub repository are automatically formatted with black using a pre-commit Git Hooks before being committed and pushed to GitHub. The test confirms black code compliance.

- name: Confirm Black code compliance (psf/black)
run: |
pip install pytest-black
pytest dags --black -v

Pytest

The pytest framework describes itself as a mature, fully-featured Python testing tool that helps you write better programs. The Pytest framework makes it easy to write small tests yet scales to support complex functional testing for applications and libraries.

The GitHub Action in the GitHub project, test_dags.yml, calls the tests.py file, also contained in the project.

- name: Test with Pytest
run: |
pip install pytest
cd tests || exit
pytest tests.py -v

The tests.py file contains several pytest unit tests. The tests are based on my project requirements; your tests will vary. These tests confirm that all DAGs:

  1. Do not contain DAG Import Errors (test catches 75% of my errors);
  2. Follow specific file naming conventions;
  3. Include a description and an owner other than ‘airflow’;
  4. Contain required project tags;
  5. Do not send emails (my projects use SNS or Slack for notifications);
  6. Do not retry more than three times;
import os
import sys
import pytest
from airflow.models import DagBag
sys.path.append(os.path.join(os.path.dirname(__file__), "../dags"))
sys.path.append(os.path.join(os.path.dirname(__file__), "../dags/utilities"))
# Airflow variables called from DAGs under test are stubbed out
os.environ["AIRFLOW_VAR_DATA_LAKE_BUCKET"] = "test_bucket"
os.environ["AIRFLOW_VAR_ATHENA_QUERY_RESULTS"] = "SELECT 1;"
os.environ["AIRFLOW_VAR_SNS_TOPIC"] = "test_topic"
os.environ["AIRFLOW_VAR_REDSHIFT_UNLOAD_IAM_ROLE"] = "test_role_1"
os.environ["AIRFLOW_VAR_GLUE_CRAWLER_IAM_ROLE"] = "test_role_2"
@pytest.fixture(params=["../dags/"])
def dag_bag(request):
return DagBag(dag_folder=request.param, include_examples=False)
def test_no_import_errors(dag_bag):
assert not dag_bag.import_errors
def test_requires_tags(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert dag.tags
def test_requires_specific_tag(dag_bag):
for dag_id, dag in dag_bag.dags.items():
try:
assert dag.tags.index("data lake demo") >= 0
except ValueError:
assert dag.tags.index("redshift demo") >= 0
def test_desc_len_greater_than_fifteen(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert len(dag.description) > 15
def test_owner_len_greater_than_five(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert len(dag.owner) > 5
def test_owner_not_airflow(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert str.lower(dag.owner) != "airflow"
def test_no_emails_on_retry(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert not dag.default_args["email_on_retry"]
def test_no_emails_on_failure(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert not dag.default_args["email_on_failure"]
def test_three_or_less_retries(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert dag.default_args["retries"] <= 3
def test_dag_id_contains_prefix(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert str.lower(dag_id).find("__") != -1
def test_dag_id_requires_specific_prefix(dag_bag):
for dag_id, dag in dag_bag.dags.items():
assert str.lower(dag_id).startswith("data_lake__") \
or str.lower(dag_id).startswith("redshift_demo__")

If you are building custom Airflow Operators, additional unit, functional, and integration tests are recommended.

Fork and Pull

We can improve on the practice of pushing directly to Trunk by implementing one of two collaborative development models, recommended by GitHub:

  1. The Shared repository model: uses ‘topic’ branches, which are reviewed, approved, and merged into the main branch.
  2. Fork and pull model: a repo is forked, changes are made, a pull request is created, the request is reviewed, and if approved, merged into the main branch.

In the fork and pull model, we create a fork of the DAG repository where we make our changes. We then commit and push those changes back to the forked repository. When ready, we create a pull request. If the pull request is approved and passes all the tests, it is manually or automatically merged into the main branch. DAGs are then synced to S3 and, eventually, to MWAA. I usually prefer to trigger merges manually once all tests have passed.

The fork and pull model greatly reduces the chance that bad code is merged to the main branch before passing all tests.

Errors are caught early in the fork and pull model prior to merging code changes

Syncing DAGs to S3

The second GitHub Action in the GitHub project, sync_dags.yml, is triggered when the previous Action, test_dags.yml, completes successfully, or in the case of the folk and pull method, the merge to the main branch is successful.

name: Sync DAGs

on:
workflow_run:
workflows:
- 'Test DAGs'
types:
- completed
pull_request:
types:
- closed

jobs:
deploy:
runs-on: ubuntu-latest
if: ${{ github.event.workflow_run.conclusion == 'success' }}
steps:
- uses: actions/checkout@master
- uses: jakejarvis/s3-sync-action@master
env:
AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_REGION: 'us-east-1'
SOURCE_DIR: 'dags'
DEST_DIR: 'dags'

The GitHub Action, sync_dags.yml, requires three GitHub encrypted secrets, created in advance and associated with the GitHub repository. According to GitHub, secrets are encrypted environment variables you create in an organization, repository, or repository environment. Encrypted secrets allow you to store sensitive information, such as access tokens, in your repository. The secrets that you create are available to use in GitHub Actions workflows.

Encrypted repository secrets used by GitHub Action to sync with Amazon S3

The DAGs are synced to Amazon S3 and, eventually, automatically synced to MWAA.

GitHub Action syncs DAGs to Amazon S3 if tests are successful

Local Testing and Git Hooks

To further improve your CI/CD workflows, you should consider using Git Hooks. Using Git Hooks, we can ensure code is tested locally before committing and pushing changes to GitHub. Testing locally allows us to fail-faster, catching errors during development instead of once code is pushed to GitHub.

Errors are caught even early using Git Hooks

According to the documentation, Git has a way to fire off custom scripts when certain important actions occur. There are two types of hooks: client-side and server-side. Client-side hooks are triggered by operations such as committing and merging, while server-side hooks run on network operations such as receiving pushed commits.

You can use these hooks for all sorts of reasons. I often use a client-side pre-commit hook to format DAGs using black. Using a client-side pre-push Git Hook, we will ensure that tests are run before pushing the DAGs to GitHub. According to Git, The pre-push hook runs when the git push command is executed after the remote refs have been updated but before any objects have been transferred. You can use it to validate a set of ref updates before a push occurs. A non-zero exit code will abort the push. The test could instead be run as part of the pre-commit hook if they are not too time-consuming.

To use the pre-push hook, create the following file within the local repository, .git/hooks/pre-push:

#!/bin/sh
# do nothing if there are no commits to push
if [ -z "$(git log @{u}..)" ]; then
exit 0
fi
sh ./run_tests_locally.sh

Then, run the following chmod command to make the hook executable:

chmod 755 .git/hooks/pre-push

The the pre-push hook runs the shell script, run_tests_locally.sh. The script executes nearly identical tests, locally, as the GitHub Action, test_dags.yml, does remotely on GitHub:

#!/bin/sh
echo "Starting Flake8 test..."
flake8 --ignore E501 dags --benchmark || exit 1
echo "Starting Black test..."
python3 -m pytest --cache-clear
python3 -m pytest dags/ --black -v || exit 1
echo "Starting Pytest tests..."
cd tests || exit
python3 -m pytest tests.py -v || exit 1
echo "All tests completed successfully! 🥳"

Complete CI/CD workflow including running tests locally using a Git Hook (video only)

References

Here are some additional references for testing and deploying Airflow DAGs and the use of GitHub Actions:

This blog represents my own viewpoints and not of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.

, , , , , , ,

Leave a comment

Amazon Managed Workflows for Apache Airflow — Configuration: Understanding Amazon MWAA’s Configuration Options

Introduction

For anyone new to Amazon Managed Workflows for Apache Airflow (Amazon MWAA), especially those used to managing their own Apache Airflow platform, Amazon MWAA’s configuration might appear to be a bit of a black box at first. This brief post will explore Amazon MWAA’s configuration — how to inspect it and how to modify it. We will use Airflow DAGs to review an MWAA environment’s airflow.cfg file, environment variables, and Python packages.

Amazon MWAA

Apache Airflow is a popular open-source platform designed to schedule and monitor workflows. According to Wikipedia, Airflow was created at Airbnb in 2014 to manage the company’s increasingly complex workflows. From the beginning, the project was made open source, becoming an Apache Incubator project in 2016 and a top-level Apache Software Foundation project in 2019.

With the announcement of Amazon MWAA in November 2020, AWS customers can now focus on developing workflow automation while leaving the management of Airflow to AWS. Amazon MWAA can be used as an alternative to AWS Step Functions for workflow automation on AWS.

The Amazon MWAA service is available using the AWS Management Console, as well as the Amazon MWAA API using the latest versions of the AWS SDK and AWS CLI. For more information on Amazon MWAA, read my last post, Running Spark Jobs on Amazon EMR with Apache Airflow.

Image for post
Apache Airflow UI

Source Code

The DAGs referenced in this post are available on GitHub. Using this git clone command, download a copy of this post’s GitHub repository to your local environment.

git clone --branch main --single-branch --depth 1 --no-tags \
https://github.com/garystafford/aws-airflow-demo.git

Accessing Configuration

Environment Variables

Environment variables are an essential part of an MWAA environment’s configuration. There are various ways to examine the environment variables. You could use Airflow’s BashOperator to simply call the command, env, or the PythonOperator to call a Python iterator function, as shown below. A sample DAG, dags/get_env_vars.py, is included in the project.

def print_env_vars():
keys = str(os.environ.keys().replace("', '", "'|'").split("|")
keys.sort()
for key in keys:
print(key)
get_env_vars_operator = PythonOperator(
task_id='get_env_vars_task',
python_callable=print_env_vars
)
view raw get_env_vars.py hosted with ❤ by GitHub

The DAG’s PythonOperator will iterate over the MWAA environment’s environment variables and output them to the task’s log. Below is a snippet of an example task’s log.

[2020-12-25 23:59:07,170] {{standard_task_runner.py:78}} INFO – Job 272: Subtask get_env_vars_task
[2020-12-25 23:59:08,423] {{logging_mixin.py:112}} INFO – 'AIRFLOW_CONN_AWS_DEFAULT': 'aws://'
[2020-12-25 23:59:08,516] {{logging_mixin.py:112}} INFO – 'AIRFLOW_CONSOLE_LOGS_ENABLED': 'false'
[2020-12-25 23:59:08,689] {{logging_mixin.py:112}} INFO – 'AIRFLOW_CONSOLE_LOG_LEVEL': 'WARNING'
[2020-12-25 23:59:08,777] {{logging_mixin.py:112}} INFO – 'AIRFLOW_CTX_DAG_EMAIL': 'airflow@example.com'
[2020-12-25 23:59:08,877] {{logging_mixin.py:112}} INFO – 'AIRFLOW_CTX_DAG_ID': 'get_env_vars'
[2020-12-25 23:59:08,970] {{logging_mixin.py:112}} INFO – 'AIRFLOW_CTX_DAG_OWNER': 'airflow'
[2020-12-25 23:59:09,269] {{logging_mixin.py:112}} INFO – 'AIRFLOW_CTX_TASK_ID': 'get_env_vars_task'
[2020-12-25 23:59:09,357] {{logging_mixin.py:112}} INFO – 'AIRFLOW_DAG_PROCESSING_LOGS_ENABLED': 'false'
[2020-12-25 23:59:09,552] {{logging_mixin.py:112}} INFO – 'AIRFLOW_DAG_PROCESSING_LOG_LEVEL': 'WARNING'
[2020-12-25 23:59:09,647] {{logging_mixin.py:112}} INFO – 'AIRFLOW_ENV_NAME': 'MyAirflowEnvironment'
[2020-12-25 23:59:09,729] {{logging_mixin.py:112}} INFO – 'AIRFLOW_HOME': '/usr/local/airflow'
[2020-12-25 23:59:09,827] {{logging_mixin.py:112}} INFO – 'AIRFLOW_SCHEDULER_LOGS_ENABLED': 'false'
[2020-12-25 23:59:12,915] {{logging_mixin.py:112}} INFO – 'AIRFLOW__CORE__DAG_CONCURRENCY': '10000'
[2020-12-25 23:59:12,986] {{logging_mixin.py:112}} INFO – 'AIRFLOW__CORE__EXECUTOR': 'CeleryExecutor'
[2020-12-25 23:59:13,136] {{logging_mixin.py:112}} INFO – 'AIRFLOW__CORE__LOAD_EXAMPLES': 'False'
[2020-12-25 23:59:13,217] {{logging_mixin.py:112}} INFO – 'AIRFLOW__CORE__PARALLELISM': '10000'
[2020-12-25 23:59:14,531] {{logging_mixin.py:112}} INFO – 'AWS_DEFAULT_REGION': 'us-east-1'
[2020-12-25 23:59:14,565] {{logging_mixin.py:112}} INFO – 'AWS_EXECUTION_ENV': 'AWS_ECS_FARGATE'
[2020-12-25 23:59:14,616] {{logging_mixin.py:112}} INFO – 'AWS_REGION': 'us-east-1'
[2020-12-25 23:59:14,647] {{logging_mixin.py:112}} INFO – 'CELERY_LOG_FILE': ''
[2020-12-25 23:59:14,679] {{logging_mixin.py:112}} INFO – 'CELERY_LOG_LEVEL': '20'
[2020-12-25 23:59:14,711] {{logging_mixin.py:112}} INFO – 'CELERY_LOG_REDIRECT': '1'
[2020-12-25 23:59:14,747] {{logging_mixin.py:112}} INFO – 'CELERY_LOG_REDIRECT_LEVEL': 'WARNING'

Airflow Configuration File

According to Airflow, the airflow.cfg file contains Airflow’s configuration. You can edit it to change any of the settings. The first time you run Apache Airflow, it creates an airflow.cfg configuration file in your AIRFLOW_HOME directory and attaches the configurations to your environment as environment variables.

Amazon MWAA doesn’t expose the airflow.cfg in the Apache Airflow UI of an environment. Although you can’t access it directly, you can view the airflow.cfg file. The configuration file is located in your AIRFLOW_HOME directory, /usr/local/airflow (~/airflow by default).

There are multiple ways to examine your MWAA environment’s airflow.cfg file. You could use Airflow’s PythonOperator to call a Python function that reads the contents of the file, as shown below. The function uses the AIRFLOW_HOME environment variable to locate and read the airflow.cfg. A sample DAG, dags/get_airflow_cfg.py, is included in the project.

def print_airflow_cfg():
with open(f"{os.getenv('AIRFLOW_HOME')}/airflow.cfg", 'r') as airflow_cfg:
file_contents = airflow_cfg.read()
print(f'\n{file_contents}')
get_airflow_cfg_operator = PythonOperator(
task_id='get_airflow_cfg_task',
python_callable=print_airflow_cfg
)

The DAG’s task will read the MWAA environment’s airflow.cfg file and output it to the task’s log. Below is a snippet of an example task’s log.

[2020-12-26 00:02:57,163] {{standard_task_runner.py:78}} INFO – Job 274: Subtask get_airflow_cfg_task
[2020-12-26 00:02:57,583] {{logging_mixin.py:112}} INFO –
[core]
# The folder where your airflow pipelines live, most likely a
# subfolder in a code repository
# This path must be absolute
dags_folder = /usr/local/airflow/dags
# The folder where airflow should store its log files
# This path must be absolute
base_log_folder = /usr/local/airflow/logs
# Airflow can store logs remotely in AWS S3, Google Cloud Storage or Elastic Search.
# Set this to True if you want to enable remote logging.
remote_logging = True
# Users must supply an Airflow connection id that provides access to the storage
# location.
remote_log_conn_id = aws_default
remote_base_log_folder = cloudwatch://arn:aws:logs:::log-group:airflow-logs:*
encrypt_s3_logs = False
# Logging level
logging_level = INFO
# Logging level for Flask-appbuilder UI
fab_logging_level = WARN
# Logging class
# Specify the class that will specify the logging configuration
# This class has to be on the python classpath
# Example: logging_config_class = my.path.default_local_settings.LOGGING_CONFIG
logging_config_class = log_config.LOGGING_CONFIG
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
[aws_mwaa]
redirect_url = https://console.aws.amazon.com/
session_duration_minutes = 720
[webserver]
# The base url of your website as airflow cannot guess what domain or
# cname you are using. This is used in automated emails that
# airflow sends to point links to the right web server
base_url = http://localhost:8080
# Default timezone to display all dates in the RBAC UI, can be UTC, system, or
# any IANA timezone string (e.g. Europe/Amsterdam). If left empty the
# default value of core/default_timezone will be used
# Example: default_ui_timezone = America/New_York
default_ui_timezone = UTC
# The ip specified when starting the web server
web_server_host = 0.0.0.0
# The port on which to run the web server
web_server_port = 8080

Customizing Airflow Configurations

While AWS doesn’t expose the airflow.cfg in the Apache Airflow UI of your environment, you can change the default Apache Airflow configuration options directly within the Amazon MWAA console and continue using all other settings in airflow.cfg. The configuration options changed in the Amazon MWAA console are translated into environment variables.

To customize the Apache Airflow configuration, change the default options directly on the Amazon MWAA console. Select Edit, add or modify configuration options and values in the Airflow configuration options menu, then select Save. For example, we can change Airflow’s default timezone (core.default_ui_timezone) to America/New_York.

Image for post
Amazon MWAA’s Airflow configuration options

Once the MWAA environment is updated, which may take several minutes, view your changes by re-running the DAG,dags/get_env_vars.py. Note the new configuration item on both lines 2 and 6 of the log snippet shown below. The configuration item appears on its own (AIRFLOW__CORE_DEFAULT__UI_TIMEZONE), as well as part of the AIRFLOW_CONFIG_SECRETS dictionary environment variable.

[2020-12-26 05:00:57,756] {{standard_task_runner.py:78}} INFO – Job 293: Subtask get_env_vars_task
[2020-12-26 05:00:58,158] {{logging_mixin.py:112}} INFO – 'AIRFLOW_CONFIG_SECRETS': '{"AIRFLOW__CORE__DEFAULT_UI_TIMEZONE":"America/New_York"}'
[2020-12-26 05:00:58,190] {{logging_mixin.py:112}} INFO – 'AIRFLOW_CONN_AWS_DEFAULT': 'aws://'
[2020-12-26 05:01:00,537] {{logging_mixin.py:112}} INFO – 'AIRFLOW__CORE__DAG_CONCURRENCY': '10000'
[2020-12-26 05:01:00,578] {{logging_mixin.py:112}} INFO – 'AIRFLOW__CORE__DEFAULT_UI_TIMEZONE': 'America/New_York'
[2020-12-26 05:01:00,630] {{logging_mixin.py:112}} INFO – 'AIRFLOW__CORE__EXECUTOR': 'CeleryExecutor'

Using the MWAA API

We can also make configuration changes using the MWAA API. For example, to change the default Airflow UI timezone, call the MWAA API’s update-environment command using the AWS CLI. Include the --airflow-configuration-option parameter, passing the core.default_ui_timezone key/value pair as a JSON blob.

aws mwaa update-environment \
–name <your_environment_name> \
–airflow-configuration-options """{
\"core.default_ui_timezone\": \"America/Los_Angeles\"
}"""

To review an environment’s configuration, use the get-environment command in combination with jq.

aws mwaa get-environment \
–name <your_environment_name> | \
jq -r '.Environment.AirflowConfigurationOptions'

Below, we see an example of the output.

{
"core.default_ui_timezone": "America/Los_Angeles"
}

Python Packages

Airflow is written in Python, and workflows are created via Python scripts. Python packages are a crucial part of an MWAA environment’s configuration. According to the documentation, an ‘extra package’, is a Python subpackage that is not included in the Apache Airflow base, installed on your MWAA environment. As part of setting up an MWAA environment, you can specify the location of the requirements.txt file in the Airflow S3 bucket. Extra packages are installed using the requirements.txt file.

Image for post
Amazon MWAA environment’s configuration

There are several ways to check your MWAA environment’s installed Python packages and versions. You could use Airflow’s BashOperator to call the command, python3 -m pip list. A sample DAG, dags/get_py_pkgs.py, is included in the project.

list_python_packages_operator = BashOperator(
task_id='list_python_packages',
bash_command='python3 -m pip list'
)
view raw get_py_pkgs.py hosted with ❤ by GitHub

The DAG’s task will output a list of all Python packages and package versions to the task’s log. Below is a snippet of an example task’s log.

[2020-12-26 21:53:06,310] {{bash_operator.py:136}} INFO – Temporary script location: /tmp/airflowtmp2whgp_p8/list_python_packagesxo8slhc6
[2020-12-26 21:53:06,350] {{bash_operator.py:146}} INFO – Running command: python3 -m pip list
[2020-12-26 21:53:06,395] {{bash_operator.py:153}} INFO – Output:
[2020-12-26 21:53:06,750] {{bash_operator.py:157}} INFO – Package Version
[2020-12-26 21:53:06,786] {{bash_operator.py:157}} INFO – ———————- ———
[2020-12-26 21:53:06,815] {{bash_operator.py:157}} INFO – alembic 1.4.2
[2020-12-26 21:53:06,856] {{bash_operator.py:157}} INFO – amqp 2.6.1
[2020-12-26 21:53:06,898] {{bash_operator.py:157}} INFO – apache-airflow 1.10.12
[2020-12-26 21:53:06,929] {{bash_operator.py:157}} INFO – apispec 1.3.3
[2020-12-26 21:53:06,960] {{bash_operator.py:157}} INFO – argcomplete 1.12.0
[2020-12-26 21:53:07,002] {{bash_operator.py:157}} INFO – attrs 19.3.0
[2020-12-26 21:53:07,036] {{bash_operator.py:157}} INFO – Babel 2.8.0
[2020-12-26 21:53:07,071] {{bash_operator.py:157}} INFO – billiard 3.6.3.0
[2020-12-26 21:53:07,960] {{bash_operator.py:157}} INFO – boto3 1.16.10
[2020-12-26 21:53:07,993] {{bash_operator.py:157}} INFO – botocore 1.19.10
[2020-12-26 21:53:08,028] {{bash_operator.py:157}} INFO – cached-property 1.5.1
[2020-12-26 21:53:08,061] {{bash_operator.py:157}} INFO – cattrs 1.0.0
[2020-12-26 21:53:08,096] {{bash_operator.py:157}} INFO – celery 4.4.7
[2020-12-26 21:53:08,130] {{bash_operator.py:157}} INFO – certifi 2020.6.20
[2020-12-26 21:53:12,260] {{bash_operator.py:157}} INFO – pandas 1.1.0
[2020-12-26 21:53:12,289] {{bash_operator.py:157}} INFO – pendulum 1.4.4
[2020-12-26 21:53:12,490] {{bash_operator.py:157}} INFO – pip 9.0.3
[2020-12-26 21:53:12,522] {{bash_operator.py:157}} INFO – prison 0.1.3
[2020-12-26 21:53:12,550] {{bash_operator.py:157}} INFO – prometheus-client 0.8.0
[2020-12-26 21:53:12,580] {{bash_operator.py:157}} INFO – psutil 5.7.2
[2020-12-26 21:53:12,613] {{bash_operator.py:157}} INFO – pycparser 2.20
[2020-12-26 21:53:12,641] {{bash_operator.py:157}} INFO – pycurl 7.43.0.5
[2020-12-26 21:53:12,676] {{bash_operator.py:157}} INFO – Pygments 2.6.1
[2020-12-26 21:53:12,710] {{bash_operator.py:157}} INFO – PyGreSQL 5.2.1
[2020-12-26 21:53:12,746] {{bash_operator.py:157}} INFO – PyJWT 1.7.1

Conclusion

Understanding your Amazon MWAA environment’s airflow.cfg file, environment variables, and Python packages are all important for proper Airflow platform management. This brief post learned more about Amazon MWAA’s configuration — how to inspect it using DAGs and how to modify it through the Amazon MWAA console.

, , , ,

Leave a comment

Running Spark Jobs on Amazon EMR with Apache Airflow: Using the new Amazon Managed Workflows for Apache Airflow (Amazon MWAA) Service on AWS

Introduction

In the first post of this series, we explored several ways to run PySpark applications on Amazon EMR using AWS services, including AWS CloudFormation, AWS Step Functions, and the AWS SDK for Python. This second post in the series will examine running Spark jobs on Amazon EMR using the recently announced Amazon Managed Workflows for Apache Airflow (Amazon MWAA) service.

Amazon EMR

According to AWS, Amazon Elastic MapReduce (Amazon EMR) is a Cloud-based big data platform for processing vast amounts of data using common open-source tools such as Apache SparkHiveHBaseFlinkHudi, and ZeppelinJupyter, and Presto. Using Amazon EMR, data analysts, engineers, and scientists are free to explore, process, and visualize data. EMR takes care of provisioning, configuring, and tuning the underlying compute clusters, allowing you to focus on running analytics.

Amazon EMR Console’s Cluster Summary tab

Users interact with EMR in a variety of ways, depending on their specific requirements. For example, you might create a transient EMR cluster, execute a series of data analytics jobs using Spark, Hive, or Presto, and immediately terminate the cluster upon job completion. You only pay for the time the cluster is up and running. Alternatively, for time-critical workloads or continuously high volumes of jobs, you could choose to create one or more persistent, highly available EMR clusters. These clusters automatically scale compute resources horizontally, including the use of EC2 Spot instances, to meet processing demands, maximizing performance and cost-efficiency.

AWS currently offers 5.x and 6.x versions of Amazon EMR. Each major and minor release of Amazon EMR offers incremental versions of nearly 25 different, popular open-source big-data applications to choose from, which Amazon EMR will install and configure when the cluster is created. The latest Amazon EMR releases are Amazon EMR Release 6.2.0 and Amazon EMR Release 5.32.0.

Amazon MWAA

Apache Airflow is a popular open-source platform designed to schedule and monitor workflows. According to Wikipedia, Airflow was created at Airbnb in 2014 to manage the company’s increasingly complex workflows. From the beginning, the project was made open source, becoming an Apache Incubator project in 2016 and a top-level Apache Software Foundation project (TLP) in 2019.

Many organizations build, manage, and maintain Apache Airflow on AWS using compute services such as Amazon EC2 or Amazon EKS. Amazon recently announced Amazon Managed Workflows for Apache Airflow (Amazon MWAA). With the announcement of Amazon MWAA in November 2020, AWS customers can now focus on developing workflow automation, while leaving the management of Airflow to AWS. Amazon MWAA can be used as an alternative to AWS Step Functions for workflow automation on AWS.

Apache Airflow’s UI

Apache recently announced the release of Airflow 2.0.0 on December 17, 2020. The latest 1.x version of Airflow is 1.10.14, released December 12, 2020. However, at the time of this post, Amazon MWAA was running Airflow 1.10.12, released August 25, 2020. Ensure that when you are developing workflows for Amazon MWAA, you are using the correct Apache Airflow 1.10.12 documentation.

The Amazon MWAA service is available using the AWS Management Console, as well as the Amazon MWAA API using the latest versions of the AWS SDK and AWS CLI.

Airflow has a mechanism that allows you to expand its functionality and integrate with other systems. Given its integration capabilities, Airflow has extensive support for AWS, including Amazon EMR, Amazon S3, AWS Batch, Amazon RedShift, Amazon DynamoDB, AWS Lambda, Amazon Kinesis, and Amazon SageMaker. Outside of support for Amazon S3, most AWS integrations can be found in the HooksSecretsSensors, and Operators of Airflow codebase’s contrib section.

Getting Started

Source Code

Using this git clone command, download a copy of this post’s GitHub repository to your local environment.

git clone --branch main --single-branch --depth 1 --no-tags \
    https://github.com/garystafford/aws-airflow-demo.git

Preliminary Steps

This post assumes the reader has completed the demonstration in the previous post, Running PySpark Applications on Amazon EMR Methods for Interacting with PySpark on Amazon Elastic MapReduce. This post will re-use many of the last post’s AWS resources, including the EMR VPC, Subnets, Security Groups, AWS Glue Data Catalog, Amazon S3 buckets, EMR Roles, EC2 key pair, AWS Systems Manager Parameter Store parameters, PySpark applications, and Kaggle datasets.

Configuring Amazon MWAA

The easiest way to create a new MWAA Environment is through the AWS Management Console. I strongly suggest that you review the pricing for Amazon MWAA before continuing. The service can be quite costly to operate, even when idle, with the smallest Environment class potentially running into the hundreds of dollars per month.

Amazon MWAA Environment Creation Process

Using the Console, create a new Amazon MWAA Environment. The Amazon MWAA interface will walk you through the creation process. Note the current ‘Airflow version’, 1.10.12.

Amazon MWAA Environment Creation Process

Amazon MWAA requires an Amazon S3 bucket to store Airflow assets. Create a new Amazon S3 bucket. According to the documentation, the bucket must start with the prefix airflow-. You must also enable Bucket Versioning on the bucket. Specify a dags folder within the bucket to store Airflow’s Directed Acyclic Graphs (DAG). You can leave the next two options blank since we have no additional Airflow plugins or additional Python packages to install.

Amazon MWAA Environment Creation Process

With Amazon MWAA, your data is secure by default as workloads run within their own Amazon Virtual Private Cloud (Amazon VPC). As part of the MWAA Environment creation process, you are given the option to have AWS create an MWAA VPC CloudFormation stack.

Amazon MWAA Environment Creation Process

For this demonstration, choose to have MWAA create a new VPC and associated networking resources.

AWS CloudFormation Create Stack Console

The MWAA CloudFormation stack contains approximately 22 AWS resources, including a VPC, a pair of public and private subnets, route tables, an Internet Gateway, two NAT Gateways, and associated Elastic IPs (EIP). See the MWAA documentation for more details.

AWS CloudFormation Create Stack Console
Amazon MWAA Environment Creation Process

As part of the Amazon MWAA Networking configuration, you must decide if you want web access to Airflow to be public or private. The details of the network configuration can be found in the MWAA documentation. I am choosing public webserver access for this demonstration, but the recommended choice is private for greater security. With the public option, AWS still requires IAM authentication to sign in to the AWS Management Console in order to access the Airflow UI.

You must select an existing VPC Security Group or have MWAA create a new one. For this demonstration, choose to have MWAA create a Security Group for you.

Lastly, select an appropriately-sized Environment class for Airflow based on the scale of your needs. The mw1.small class will be sufficient for this demonstration.

Amazon MWAA Environment Creation Process

Finally, for Permissions, you must select an existing Airflow execution service role or create a new role. For this demonstration, create a new Airflow service role. We will later add additional permissions.

Amazon MWAA Environment Creation Process

Airflow Execution Role

As part of this demonstration, we will be using Airflow to run Spark jobs on EMR (EMR Steps). To allow Airflow to interact with EMR, we must increase the new Airflow execution role’s default permissions. Additional permissions include allowing the new Airflow role to assume the EMR roles using iam:PassRole. For this demonstration, we will include the two default EMR Service and JobFlow roles, EMR_DefaultRole and EMR_EC2_DefaultRole. We will also include the corresponding custom EMR roles created in the previous post, EMR_DemoRole and EMR_EC2_DemoRole. For this demonstration, the Airflow service role also requires three specific EMR permissions as shown below. Later in the post, Airflow will also read files from S3, which requires s3:GetObject permission.

Create a new policy by importing the project’s JSON file, iam_policy/airflow_emr_policy.json, and attach the new policy to the Airflow service role. Be sure to update the AWS Account ID in the file with your own Account ID.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"elasticmapreduce:DescribeStep",
"elasticmapreduce:AddJobFlowSteps",
"elasticmapreduce:RunJobFlow"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": [
"arn:aws:iam::123412341234:role/EMR_DemoRole",
"arn:aws:iam::123412341234:role/EMR_EC2_DemoRole",
"arn:aws:iam::123412341234:role/EMR_EC2_DefaultRole",
"arn:aws:iam::123412341234:role/EMR_DefaultRole"
]
}
]
}

The Airflow service role, created by MWAA, is shown below with the new policy attached.

Airflow Execution Service Role with the new Policy Attached

Final Architecture

Below is the final high-level architecture for the post’s demonstration. The diagram shows the approximate route of a DAG Run request, in red. The diagram includes an optional S3 Gateway VPC endpoint, not detailed in the post, but recommended for additional security. According to AWS, a VPC endpoint enables you to privately connect your VPC to supported AWS services and VPC endpoint services powered by AWS PrivateLink without requiring an internet gateway. In this case a private connection between the MWAA VPC and Amazon S3. It is also possible to create an EMR Interface VPC Endpoint to securely route traffic directly to EMR from MWAA, instead of connecting over the Internet.

Demonstration’s Amazon MWAA and Amazon EMR Architecture

Amazon MWAA Environment

The new MWAA Environment will include a link to the Airflow UI.

Amazon MWAA Environment Console

Airflow UI

Using the supplied link, you should be able to access the Airflow UI using your web browser.

Apache Airflow UI

Our First DAG

The Amazon MWAA documentation includes an example DAG, which contains one of several sample programs, SparkPi, which comes with Spark. I have created a similar DAG that is included in the GitHub project, dags/emr_steps_demo.py. The DAG will create a minimally-sized single-node EMR cluster with no Core or Task nodes. The DAG will then use that cluster to submit the calculate_pi job to Spark. Once the job is complete, the DAG will terminate the EMR cluster.

import os
from datetime import timedelta
from airflow import DAG
from airflow.contrib.operators.emr_add_steps_operator import EmrAddStepsOperator
from airflow.contrib.operators.emr_create_job_flow_operator import EmrCreateJobFlowOperator
from airflow.contrib.sensors.emr_step_sensor import EmrStepSensor
from airflow.utils.dates import days_ago
DAG_ID = os.path.basename(__file__).replace('.py', '')
DEFAULT_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['airflow@example.com'],
'email_on_failure': False,
'email_on_retry': False,
}
SPARK_STEPS = [
{
'Name': 'calculate_pi',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['/usr/lib/spark/bin/run-example', 'SparkPi', '10'],
},
}
]
JOB_FLOW_OVERRIDES = {
'Name': 'demo-cluster-airflow',
'ReleaseLabel': 'emr-6.2.0',
'Applications': [
{
'Name': 'Spark'
},
],
'Instances': {
'InstanceGroups': [
{
'Name': 'Master nodes',
'Market': 'ON_DEMAND',
'InstanceRole': 'MASTER',
'InstanceType': 'm5.xlarge',
'InstanceCount': 1,
}
],
'KeepJobFlowAliveWhenNoSteps': False,
'TerminationProtected': False,
},
'VisibleToAllUsers': True,
'JobFlowRole': 'EMR_EC2_DefaultRole',
'ServiceRole': 'EMR_DefaultRole',
'Tags': [
{
'Key': 'Environment',
'Value': 'Development'
},
{
'Key': 'Name',
'Value': 'Airflow EMR Demo Project'
},
{
'Key': 'Owner',
'Value': 'Data Analytics Team'
}
]
}
with DAG(
dag_id=DAG_ID,
description='Run built-in Spark app on Amazon EMR',
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=2),
start_date=days_ago(1),
schedule_interval='@once',
tags=['emr'],
) as dag:
cluster_creator = EmrCreateJobFlowOperator(
task_id='create_job_flow',
job_flow_overrides=JOB_FLOW_OVERRIDES
)
step_adder = EmrAddStepsOperator(
task_id='add_steps',
job_flow_id="{{ task_instance.xcom_pull(task_ids='create_job_flow', key='return_value') }}",
aws_conn_id='aws_default',
steps=SPARK_STEPS,
)
step_checker = EmrStepSensor(
task_id='watch_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
step_id="{{ task_instance.xcom_pull(task_ids='add_steps', key='return_value')[0] }}",
aws_conn_id='aws_default',
)
cluster_creator >> step_adder >> step_checker

Upload the DAG to the Airflow S3 bucket’s dags directory. Substitute your Airflow S3 bucket name in the AWS CLI command below, then run it from the project’s root.

aws s3 cp dags/spark_pi_example.py \
s3://<your_airflow_bucket_name>/dags/

The DAG, spark_pi_example, should automatically appear in the Airflow UI. Click on ‘Trigger DAG’ to create a new EMR cluster and start the Spark job.

Apache Airflow UI’s DAGs tab

The DAG has no optional configuration to input as JSON. Select ‘Trigger’ to submit the job, as shown below.

Apache Airflow UI’s Trigger DAG Page

The DAG should complete all three tasks successfully, as shown in the DAG’s ‘Graph View’ tab below.

Apache Airflow UI’s DAG Graph View

Switching to the EMR Console, you should see the single-node EMR cluster being created.

Amazon EMR Console’s Summary tab

On the ‘Steps’ tab, you should see that the ‘calculate_pi’ Spark job has been submitted and is waiting for the cluster to be ready to be run.

Amazon EMR Console’s Steps tab

Triggering DAGs Programmatically

The Amazon MWAA service is available using the AWS Management Console, as well as the Amazon MWAA API using the latest versions of the AWS SDK and AWS CLI. To automate the DAG Run, we could use the AWS CLI and invoke the Airflow CLI via an endpoint on the Apache Airflow Webserver. The Amazon MWAA documentation and Airflow’s CLI documentation explains how.

Below is an example of triggering the spark_pi_example DAG programmatically using Airflow’s trigger_dag CLI command. You will need to replace the WEB_SERVER_HOSTNAME variable with your own Airflow Web Server’s hostname. The ENVIROMENT_NAME variable assumes only one MWAA environment is returned by jq.

export WEB_SERVER_HOSTNAME="<your_airflow_web_server.us-east-1.airflow.amazonaws.com>"
export ENVIRONMENT_NAME=$(aws mwaa list-environments | jq -r '.Environments | .[]')
export DAG_NAME=spark_pi_example
aws mwaa create-cli-token –name "${ENVIRONMENT_NAME}" | \
export CLI_TOKEN=$(jq -r .CliToken)
curl –request POST "https://${WEB_SERVER_HOSTNAME}/aws_mwaa/cli" \
–header "Authorization: Bearer ${CLI_TOKEN}" \
–header "Content-Type: text/plain" \
–data-raw "trigger_dag ${DAG_NAME}"
view raw trigger_dag.sh hosted with ❤ by GitHub

Analytics Job with Airflow

Next, we will submit an actual analytics job to EMR. If you recall from the previous post, we had four different analytics PySpark applications, which performed analyses on the three Kaggle datasets. For the next DAG, we will run a Spark job that executes the bakery_sales_ssm.py PySpark application. This job should already exist in the processed data S3 bucket.

The DAG, dags/bakery_sales.py, creates an EMR cluster identical to the EMR cluster created with the run_job_flow.py Python script in the previous post. All EMR configuration options available when using AWS Step Functions are available with Airflow’s airflow.contrib.operators and airflow.contrib.sensors packages for EMR.

Airflow leverages Jinja Templating and provides the pipeline author with a set of built-in parameters and macros. The Bakery Sales DAG contains eleven Jinja template variables. Seven variables will be configured in the Airflow UI by importing a JSON file into the ‘Admin’ ⇨ ‘Variables’ tab. These template variables are prefixed with var.value in the DAG. The other three variables will be passed as a DAG Run configuration as a JSON blob, similar to the previous DAG example. These template variables are prefixed with dag_run.conf.

import os
from datetime import timedelta
from airflow import DAG
from airflow.contrib.operators.emr_add_steps_operator import EmrAddStepsOperator
from airflow.contrib.operators.emr_create_job_flow_operator import EmrCreateJobFlowOperator
from airflow.contrib.sensors.emr_step_sensor import EmrStepSensor
from airflow.models import Variable
from airflow.utils.dates import days_ago
# ************** AIRFLOW VARIABLES **************
bootstrap_bucket = Variable.get('bootstrap_bucket')
emr_ec2_key_pair = Variable.get('emr_ec2_key_pair')
job_flow_role = Variable.get('job_flow_role')
logs_bucket = Variable.get('logs_bucket')
release_label = Variable.get('release_label')
service_role = Variable.get('service_role')
work_bucket = Variable.get('work_bucket')
# ***********************************************
DAG_ID = os.path.basename(__file__).replace('.py', '')
DEFAULT_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'email': ["{{ dag_run.conf['airflow_email'] }}"],
'email_on_failure': ["{{ dag_run.conf['email_on_failure'] }}"],
'email_on_retry': ["{{ dag_run.conf['email_on_retry'] }}"],
}
SPARK_STEPS = [
{
'Name': 'Bakery Sales',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': [
'spark-submit',
'–deploy-mode',
'cluster',
'–master',
'yarn',
'–conf',
'spark.yarn.submit.waitAppCompletion=true',
's3a://{{ var.value.work_bucket }}/analyze/bakery_sales_ssm.py'
]
}
}
]
JOB_FLOW_OVERRIDES = {
'Name': 'demo-cluster-airflow',
'ReleaseLabel': '{{ var.value.release_label }}',
'LogUri': 's3n://{{ var.value.logs_bucket }}',
'Applications': [
{
'Name': 'Spark'
},
],
'Instances': {
'InstanceFleets': [
{
'Name': 'MASTER',
'InstanceFleetType': 'MASTER',
'TargetSpotCapacity': 1,
'InstanceTypeConfigs': [
{
'InstanceType': 'm5.xlarge',
},
]
},
{
'Name': 'CORE',
'InstanceFleetType': 'CORE',
'TargetSpotCapacity': 2,
'InstanceTypeConfigs': [
{
'InstanceType': 'r5.xlarge',
},
],
},
],
'KeepJobFlowAliveWhenNoSteps': False,
'TerminationProtected': False,
'Ec2KeyName': '{{ var.value.emr_ec2_key_pair }}',
},
'BootstrapActions': [
{
'Name': 'string',
'ScriptBootstrapAction': {
'Path': 's3://{{ var.value.bootstrap_bucket }}/bootstrap_actions.sh',
}
},
],
'Configurations': [
{
'Classification': 'spark-hive-site',
'Properties': {
'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'
}
}
],
'VisibleToAllUsers': True,
'JobFlowRole': '{{ var.value.job_flow_role }}',
'ServiceRole': '{{ var.value.service_role }}',
'EbsRootVolumeSize': 32,
'StepConcurrencyLevel': 1,
'Tags': [
{
'Key': 'Environment',
'Value': 'Development'
},
{
'Key': 'Name',
'Value': 'Airflow EMR Demo Project'
},
{
'Key': 'Owner',
'Value': 'Data Analytics Team'
}
]
}
with DAG(
dag_id=DAG_ID,
description='Analyze Bakery Sales with Amazon EMR',
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=2),
start_date=days_ago(1),
schedule_interval='@once',
tags=['emr'],
) as dag:
cluster_creator = EmrCreateJobFlowOperator(
task_id='create_job_flow',
job_flow_overrides=JOB_FLOW_OVERRIDES
)
step_adder = EmrAddStepsOperator(
task_id='add_steps',
job_flow_id="{{ task_instance.xcom_pull(task_ids='create_job_flow', key='return_value') }}",
aws_conn_id='aws_default',
steps=SPARK_STEPS,
)
step_checker = EmrStepSensor(
task_id='watch_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
step_id="{{ task_instance.xcom_pull(task_ids='add_steps', key='return_value')[0] }}",
aws_conn_id='aws_default',
)
cluster_creator >> step_adder >> step_checker
view raw bakery_sales.py hosted with ❤ by GitHub

Import Variables into Airflow UI

First, to import the required variables, change the values in the project’s airflow_variables/admin_variables_bakery.json file. You will need to update the values for bootstrap_bucket, emr_ec2_key_pair, logs_bucket, and work_bucket. The three S3 buckets should all exist from the previous post.

{
"bootstrap_bucket": "emr-demo-bootstrap-123412341234-us-east-1",
"emr_ec2_key_pair": "emr-demo-123412341234-us-east-1",
"job_flow_role": "EMR_EC2_DemoRole",
"logs_bucket": "emr-demo-logs-123412341234-us-east-1",
"release_label": "emr-6.2.0",
"service_role": "EMR_DemoRole",
"work_bucket": "emr-demo-work-123412341234-us-east-1",
"ec2_subnet_id": "subnet-012abc456efg78900"
}

Next, import the variables file from the ‘Admin’ ⇨ ‘Variables’ tab of the Airflow UI.

Apache Airflow UI’s Admin > Variables tab

Upload the DAG, dags/bakery_sales.py, to the Airflow S3 bucket, similar to the first DAG.

aws s3 cp dags/bakery_sales.py \
s3://<your_airflow_bucket_name>/dags/

The second DAG, bakery_sales, should automatically appear in the Airflow UI. Click on ‘Trigger DAG’ to create a new EMR cluster and start the Spark job.

Apache Airflow UI’s DAGs tab

Input the three required parameters in the ‘Trigger DAG’ interface, used to pass the DAG Run configuration, and select ‘Trigger’. A sample of the JSON blob can be found in the project, airflow_variables/dag_run.conf_bakery.json.

{
    "airflow_email": "analytics_team@example.com",
    "email_on_failure": false,
    "email_on_retry": false
}

This is just for demonstration purposes. To send and receive emails, you will need to configure Airflow.

Apache Airflow UI’s Trigger DAG Screen

Switching to the EMR Console, you should see the ‘Bakery Sales’ Spark job in the ‘Steps’ tab.

Amazon EMR Console’s Steps tab

Multi-Step DAG

In our last example, we will use a single DAG to run four Spark jobs in parallel. The Spark job arguments (EmrAddStepsOperator steps parameter) will be loaded from an external JSON file residing in Amazon S3, instead of defined in the DAG, as in the previous two DAG examples. Additionally, the EMR cluster specifications (EmrCreateJobFlowOperator job_flow_overrides parameter) will also be loaded from an external JSON file. Using this method, we decouple the EMR provisioning and job details from the DAG. DataOps or DevOps Engineers might manage the EMR cluster specifications as code, while Data Analysts manage the Spark job arguments, separately. A third team might manage the DAG itself.

We still maintain the variables in the JSON files. The DAG will read the JSON file-based configuration into the tasks as JSON blobs, then replace the Jinja template variables (expressions) in the DAG with variable values defined in Airflow or input as parameters when the DAG is triggered.

Below we see a snippet of two of the four Spark submit-job job definitions (steps), which have been moved to a separate JSON file, emr_steps/emr_steps.json.

[
{
"Name": "Movie Ratings",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"spark-submit",
"–deploy-mode",
"cluster",
"–master",
"yarn",
"–conf",
"spark.yarn.submit.waitAppCompletion=true",
"s3a://{{ var.value.work_bucket }}/analyze/movies_avg_ratings_ssm.py",
"–start-date",
"2016-01-01 00:00:00",
"–end-date",
"2016-12-31 23:59:59"
]
}
},
{
"Name": "Stock Volatility",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"spark-submit",
"–deploy-mode",
"cluster",
"–master",
"yarn",
"–conf",
"spark.yarn.submit.waitAppCompletion=true",
"s3a://{{ var.value.work_bucket }}/analyze/stock_volatility_ssm.py",
"–start-date",
"2017-01-01",
"–end-date",
"2018-12-31"
]
}
}
]
view raw emr_steps.json hosted with ❤ by GitHub

Below are the EMR cluster specifications (job_flow_overrides), which have been moved to a separate JSON file, job_flow_overrides/job_flow_overrides.json.

{
"Name": "demo-cluster-airflow",
"ReleaseLabel": "{{ var.value.release_label }}",
"LogUri": "s3n://{{ var.value.logs_bucket }}",
"Applications": [
{
"Name": "Spark"
}
],
"Instances": {
"InstanceFleets": [
{
"Name": "MASTER",
"InstanceFleetType": "MASTER",
"TargetSpotCapacity": 1,
"InstanceTypeConfigs": [
{
"InstanceType": "m5.xlarge"
}
]
},
{
"Name": "CORE",
"InstanceFleetType": "CORE",
"TargetSpotCapacity": 2,
"InstanceTypeConfigs": [
{
"InstanceType": "r5.2xlarge"
}
]
}
],
"Ec2SubnetId": "{{ var.value.ec2_subnet_id }}",
"KeepJobFlowAliveWhenNoSteps": false,
"TerminationProtected": false,
"Ec2KeyName": "{{ var.value.emr_ec2_key_pair }}"
},
"BootstrapActions": [
{
"Name": "string",
"ScriptBootstrapAction": {
"Path": "s3://{{ var.value.bootstrap_bucket }}/bootstrap_actions.sh"
}
}
],
"Configurations": [
{
"Classification": "spark-hive-site",
"Properties": {
"hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
}
],
"VisibleToAllUsers": true,
"JobFlowRole": "{{ var.value.job_flow_role }}",
"ServiceRole": "{{ var.value.service_role }}",
"EbsRootVolumeSize": 32,
"StepConcurrencyLevel": 5,
"Tags": [
{
"Key": "Environment",
"Value": "Development"
},
{
"Key": "Name",
"Value": "Airflow EMR Demo Project"
},
{
"Key": "Owner",
"Value": "Data Analytics Team"
}
]
}

Decoupling the configurations reduces the DAG from well over 200 lines of code to less than 75 lines. Note lines 56 and 63 of the DAG below. Instead of referencing a local object variable, the parameters now reference the function, get_objects(key, bucket_name), which loads the JSON.

import json
import os
from datetime import timedelta
from airflow import DAG
from airflow.contrib.operators.emr_add_steps_operator import EmrAddStepsOperator
from airflow.contrib.operators.emr_create_job_flow_operator import EmrCreateJobFlowOperator
from airflow.contrib.sensors.emr_step_sensor import EmrStepSensor
from airflow.hooks.S3_hook import S3Hook
from airflow.models import Variable
from airflow.utils.dates import days_ago
# ************** AIRFLOW VARIABLES **************
bootstrap_bucket = Variable.get('bootstrap_bucket')
emr_ec2_key_pair = Variable.get('emr_ec2_key_pair')
job_flow_role = Variable.get('job_flow_role')
logs_bucket = Variable.get('logs_bucket')
release_label = Variable.get('release_label')
service_role = Variable.get('service_role')
work_bucket = Variable.get('work_bucket')
# ***********************************************
DAG_ID = os.path.basename(__file__).replace('.py', '')
DEFAULT_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'email': ["{{ dag_run.conf['airflow_email'] }}"],
'email_on_failure': ["{{ dag_run.conf['email_on_failure'] }}"],
'email_on_retry': ["{{ dag_run.conf['email_on_retry'] }}"],
}
def get_object(key, bucket_name):
"""
Load S3 object as JSON
"""
hook = S3Hook()
content_object = hook.get_key(key=key, bucket_name=bucket_name)
file_content = content_object.get()['Body'].read().decode('utf-8')
return json.loads(file_content)
with DAG(
dag_id=DAG_ID,
description='Run multiple Spark jobs with Amazon EMR',
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=2),
start_date=days_ago(1),
schedule_interval=None,
tags=['emr', 'spark', 'pyspark']
) as dag:
cluster_creator = EmrCreateJobFlowOperator(
task_id='create_job_flow',
job_flow_overrides=get_object('job_flow_overrides/job_flow_overrides.json', work_bucket)
)
step_adder = EmrAddStepsOperator(
task_id='add_steps',
job_flow_id="{{ task_instance.xcom_pull(task_ids='create_job_flow', key='return_value') }}",
aws_conn_id='aws_default',
steps=get_object('emr_steps/emr_steps.json', work_bucket)
)
step_checker = EmrStepSensor(
task_id='watch_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
step_id="{{ task_instance.xcom_pull(task_ids='add_steps', key='return_value')[0] }}",
aws_conn_id='aws_default'
)
cluster_creator >> step_adder >> step_checker

This time, we need to upload three files to S3, the DAG to the Airflow S3 bucket, and the two JSON files to the EMR Work S3 bucket. Change the bucket names to match your environment, then run the three AWS CLI commands shown below.

aws s3 cp emr_steps/emr_steps.json \
    s3://emr-demo-work-123412341234-us-east-1/emr_steps/
aws s3 cp job_flow_overrides/job_flow_overrides.json \
    s3://emr-demo-work-123412341234-us-east-1/job_flow_overrides/
aws s3 cp dags/multiple_steps.py \
s3://airflow-123412341234-us-east-1/dags/

The second DAG, multiple_steps, should automatically appear in the Airflow UI. Click on ‘Trigger DAG’ to create a new EMR cluster and start the Spark job. The three required input parameters in the ‘Trigger DAG’ interface are identical to the previous bakery_sales DAG. A sample of that JSON blob can be found in the project at airflow_variables/dag_run.conf_bakery.json.

Apache Airflow UI’s DAGs tab

Below we see that the EMR cluster has completed the four Spark jobs (EMR Steps) and has auto-terminated. Note that all four jobs were started at the exact same time. If you recall from the previous post, this is possible because we preset the ‘Concurrency’ level to 5.

Amazon EMR Console’s Steps tab showing four Steps running in parallel

Triggering DAGs Programmatically

AWS CLI

Similar to the previous example, below we can trigger the multiple_steps DAG programmatically using Airflow’s trigger_dag CLI command. Note the addition of the —-conf named argument, which passes the configuration, containing three key/value pairs, to the trigger command as a JSON blob.

ENVIRONMENT_NAME=$(aws mwaa list-environments | jq -r '.Environments | .[]')
DAG_NAME="multiple_steps"
CONFIG="""'{
\"airflow_email\": \"analytics_team@example.com\",
\"email_on_failure\": true,
\"email_on_retry\": false
}'"""
CLI_JSON=$(aws mwaa create-cli-token –name ${ENVIRONMENT_NAME}) \
&& CLI_TOKEN=$(echo $CLI_JSON | jq -r '.CliToken') \
&& WEB_SERVER_HOSTNAME=$(echo $CLI_JSON | jq -r '.WebServerHostname') \
curl –request POST "https://${WEB_SERVER_HOSTNAME}/aws_mwaa/cli" \
–header "Authorization: Bearer ${CLI_TOKEN}" \
–header "Content-Type: text/plain" \
–data-raw "trigger_dag ${DAG_NAME} –conf ${CONFIG}"

AWS SDK

Airflow DAGs can also be triggered using the AWS SDK. For example, with boto3 for Python, we could use a script, similar to the following to remotely trigger a DAG.

#!/usr/bin/env python3
# MWAA: Trigger an Apache Airflow DAG using SDK
# Author: Gary A. Stafford (February 2021)
import logging
import boto3
import requests
logging.basicConfig(
format='[%(asctime)s] %(levelname)s – %(message)s', level=logging.INFO)
mwaa_client = boto3.client('mwaa')
ENVIRONMENT_NAME = 'Your_Airflow_Environment_Name'
DAG_NAME = 'your_dag_name'
CONFIG = '{"foo": "bar"}'
def main():
response = mwaa_client.create_cli_token(
Name=ENVIRONMENT_NAME
)
logging.info('response: ' + str(response))
token = response['CliToken']
url = 'https://{0}/aws_mwaa/cli'.format(response['WebServerHostname'])
headers = {'Authorization': 'Bearer ' + token, 'Content-Type': 'text/plain'}
payload = 'trigger_dag {0} –conf {1}'.format(DAG_NAME, CONFIG)
response = requests.post(url, headers=headers, data=payload)
logging.info('response: ' + str(response)) # should be <Response [200]>
if __name__ == '__main__':
main()
view raw trigger_dag.py hosted with ❤ by GitHub

Cleaning Up

Once you are done with the MWAA Environment, be sure to delete it as soon as possible to save additional costs. Also, delete the MWAA-VPC CloudFormation stack. These resources, like the two NAT Gateways, will also continue to generate additional costs.

aws mwaa delete-environment --name <your_mwaa_environment_name>
aws cloudformation delete-stack --stack-name MWAA-VPC

Conclusion

In this second post in the series, we explored using the newly released Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to run PySpark applications on Amazon Elastic MapReduce (Amazon EMR). In future posts, we will explore the use of Jupyter and Zeppelin notebooks for data science, scientific computing, and machine learning on EMR.

If you are interested in learning more about configuring Amazon MWAA and Airflow, see my recent post, Amazon Managed Workflows for Apache Airflow — Configuration: Understanding Amazon MWAA’s Configuration Options.


This blog represents my own viewpoints and not of my employer, Amazon Web Services. All product names, logos, and brands are the property of their respective owners.

, , , , , , ,

3 Comments

Provision and Deploy a Consul Cluster on AWS, using Terraform, Docker, and Jenkins

Cover2

Introduction

Modern DevOps tools, such as HashiCorp’s Packer and Terraform, make it easier to provision and manage complex cloud architecture. Utilizing a CI/CD server, such as Jenkins, to securely automate the use of these DevOps tools, ensures quick and consistent results.

In a recent post, Distributed Service Configuration with Consul, Spring Cloud, and Docker, we built a Consul cluster using Docker swarm mode, to host distributed configurations for a Spring Boot application. The cluster was built locally with VirtualBox. This architecture is fine for development and testing, but not for use in Production.

In this post, we will deploy a highly available three-node Consul cluster to AWS. We will use Terraform to provision a set of EC2 instances and accompanying infrastructure. The instances will be built from a hybrid AMIs containing the new Docker Community Edition (CE). In a recent post, Baking AWS AMI with new Docker CE Using Packer, we provisioned an Ubuntu AMI with Docker CE, using Packer. We will deploy Docker containers to each EC2 host, containing an instance of Consul server.

All source code can be found on GitHub.

Jenkins

I have chosen Jenkins to automate all of the post’s build, provisioning, and deployment tasks. However, none of the code is written specifically to Jenkins; you may run all of it from the command line.

For this post, I have built four projects in Jenkins, as follows:

  1. Provision Docker CE AMI: Builds Ubuntu AMI with Docker CE, using Packer
  2. Provision Consul Infra AWS: Provisions Consul infrastructure on AWS, using Terraform
  3. Deploy Consul Cluster AWS: Deploys Consul to AWS, using Docker
  4. Destroy Consul Infra AWS: Destroys Consul infrastructure on AWS, using Terraform

Jenkins UI

We will primarily be using the ‘Provision Consul Infra AWS’, ‘Deploy Consul Cluster AWS’, and ‘Destroy Consul Infra AWS’ Jenkins projects in this post. The fourth Jenkins project, ‘Provision Docker CE AMI’, automates the steps found in the recent post, Baking AWS AMI with new Docker CE Using Packer, to build the AMI used to provision the EC2 instances in this post.

Consul AWS Diagram 2

Terraform

Using Terraform, we will provision EC2 instances in three different Availability Zones within the US East 1 (N. Virginia) Region. Using Terraform’s Amazon Web Services (AWS) provider, we will create the following AWS resources:

  • (1) Virtual Private Cloud (VPC)
  • (1) Internet Gateway
  • (1) Key Pair
  • (3) Elastic Cloud Compute (EC2) Instances
  • (2) Security Groups
  • (3) Subnets
  • (1) Route
  • (3) Route Tables
  • (3) Route Table Associations

The final AWS architecture should resemble the following:

Consul AWS Diagram

Production Ready AWS

Although we have provisioned a fairly complete VPC for this post, it is far from being ready for Production. I have created two security groups, limiting the ingress and egress to the cluster. However, to further productionize the environment would require additional security hardening. At a minimum, you should consider adding public/private subnets, NAT gateways, network access control list rules (network ACLs), and the use of HTTPS for secure communications.

In production, applications would communicate with Consul through local Consul clients. Consul clients would take part in the LAN gossip pool from different subnets, Availability Zones, Regions, or VPCs using VPC peering. Communications would be tightly controlled by IAM, VPC, subnet, IP address, and port.

Also, you would not have direct access to the Consul UI through a publicly exposed IP or DNS address. Access to the UI would be removed altogether or locked down to specific IP addresses, and accessed restricted to secure communication channels.

Consul

We will achieve high availability (HA) by clustering three Consul server nodes across the three Elastic Cloud Compute (EC2) instances. In this minimally sized, three-node cluster of Consul servers, we are protected from the loss of one Consul server node, one EC2 instance, or one Availability Zone(AZ). The cluster will still maintain a quorum of two nodes. An additional level of HA that Consul supports, multiple datacenters (multiple AWS Regions), is not demonstrated in this post.

Docker

Having Docker CE already installed on each EC2 instance allows us to execute remote Docker commands over SSH from Jenkins. These commands will deploy and configure a Consul server node, within a Docker container, on each EC2 instance. The containers are built from HashiCorp’s latest Consul Docker image pulled from Docker Hub.

Getting Started

Preliminary Steps

If you have built infrastructure on AWS with Terraform, these steps should be familiar to you:

  1. First, you will need an AMI with Docker. I suggest reading Baking AWS AMI with new Docker CE Using Packer.
  2. You will need an AWS IAM User with the proper access to create the required infrastructure. For this post, I created a separate Jenkins IAM User with PowerUser level access.
  3. You will need to have an RSA public-private key pair, which can be used to SSH into the EC2 instances and install Consul.
  4. Ensure you have your AWS credentials set. I usually source mine from a .env file, as environment variables. Jenkins can securely manage credentials, using secret text or files.
  5. Fork and/or clone the Consul cluster project from  GitHub.
  6. Change the aws_key_name and public_key_path variable values to your own RSA key, in the variables.tf file
  7. Change the aws_amis_base variable values to your own AMI ID (see step 1)
  8. If you are do not want to use the US East 1 Region and its AZs, modify the variables.tf, network.tf, and instances.tf files.
  9. Disable Terraform’s remote state or modify the resource to match your remote state configuration, in the main.tf file. I am using an Amazon S3 bucket to store my Terraform remote state.

Building an AMI with Docker

If you have not built an Amazon Machine Image (AMI) for use in this post already, you can do so using the scripts provided in the previous post’s GitHub repository. To automate the AMI build task, I built the ‘Provision Docker CE AMI’ Jenkins project. Identical to the other three Jenkins projects in this post, this project has three main tasks, which include: 1) SCM: clone the Packer AMI GitHub project, 2) Bindings: set up the AWS credentials, and 3) Build: run Packer.

The SCM and Bindings tasks are identical to the other projects (see below for details), except for the use of a different GitHub repository. The project’s Build step, which runs the packer_build_ami.sh script looks as follows:

jenkins_13

The resulting AMI ID will need to be manually placed in Terraform’s variables.tf file, before provisioning the AWS infrastructure with Terraform. The new AMI ID will be displayed in Jenkin’s build output.

jenkins_14

Provisioning with Terraform

Based on the modifications you made in the Preliminary Steps, execute the terraform validate command to confirm your changes. Then, run the terraform plan command to review the plan. Assuming are were no errors, finally, run the terraform apply command to provision the AWS infrastructure components.

In Jenkins, I have created the ‘Provision Consul Infra AWS’ project. This project has three tasks, which include: 1) SCM: clone the GitHub project, 2) Bindings: set up the AWS credentials, and 3) Build: run Terraform. Those tasks look as follows:

Jenkins_08.png

You will obviously need to use your modified GitHub project, incorporating the configuration changes detailed above, as the SCM source for Jenkins.

Jenkins Credentials

You will also need to configure your AWS credentials.

Jenkins_03.png

The provision_infra.sh script provisions the AWS infrastructure using Terraform. The script also updates Terraform’s remote state. Remember to update the remote state configuration in the script to match your personal settings.


cd tf_env_aws/
terraform remote config \
-backend=s3 \
-backend-config="bucket=your_bucket" \
-backend-config="key=terraform_consul.tfstate" \
-backend-config="region=your_region"
terraform plan
terraform apply

The Jenkins build output should look similar to the following:

jenkins_12.png

Although the build only takes about 90 seconds to complete, the EC2 instances could take a few extra minutes to complete their Status Checks and be completely ready. The final results in the AWS EC2 Management Console should look as follows:

EC2 Management Console

Note each EC2 instance is running in a different US East 1 Availability Zone.

Installing Consul

Once the AWS infrastructure is running and the EC2 instances have completed their Status Checks successfully, we are ready to deploy Consul. In Jenkins, I have created the ‘Deploy Consul Cluster AWS’ project. This project has three tasks, which include: 1) SCM: clone the GitHub project, 2) Bindings: set up the AWS credentials, and 3) Build: run an SSH remote Docker command on each EC2 instance to deploy Consul. The SCM and Bindings tasks are identical to the project above. The project’s Build step looks as follows:

Jenkins_04.png

First, the delete_containers.sh script deletes any previous instances of Consul containers. This is helpful if you need to re-deploy Consul. Next, the deploy_consul.sh script executes a series of SSH remote Docker commands to install and configure Consul on each EC2 instance.


# Advertised Consul IP
export ec2_server1_private_ip=$(aws ec2 describe-instances \
–filters Name='tag:Name,Values=tf-instance-consul-server-1' \
–output text –query 'Reservations[*].Instances[*].PrivateIpAddress')
echo "consul-server-1 private ip: ${ec2_server1_private_ip}"

view raw

consul_01.sh

hosted with ❤ by GitHub


# Deploy Consul Server 1
ec2_public_ip=$(aws ec2 describe-instances \
–filters Name='tag:Name,Values=tf-instance-consul-server-1' \
–output text –query 'Reservations[*].Instances[*].PublicIpAddress')
consul_server="consul-server-1"
ssh -oStrictHostKeyChecking=no -T \
-i ~/.ssh/consul_aws_rsa \
ubuntu@${ec2_public_ip} << EOSSH
docker run -d \
–net=host \
–hostname ${consul_server} \
–name ${consul_server} \
–env "SERVICE_IGNORE=true" \
–env "CONSUL_CLIENT_INTERFACE=eth0" \
–env "CONSUL_BIND_INTERFACE=eth0" \
–volume /home/ubuntu/consul/data:/consul/data \
–publish 8500:8500 \
consul:latest \
consul agent -server -ui -client=0.0.0.0 \
-bootstrap-expect=3 \
-advertise='{{ GetInterfaceIP "eth0" }}' \
-data-dir="/consul/data"
sleep 5
docker logs consul-server-1
docker exec -i consul-server-1 consul members
EOSSH

view raw

consul_02.sh

hosted with ❤ by GitHub


# Deploy Consul Server 2
ec2_public_ip=$(aws ec2 describe-instances \
–filters Name='tag:Name,Values=tf-instance-consul-server-2' \
–output text –query 'Reservations[*].Instances[*].PublicIpAddress')
consul_server="consul-server-2"
ssh -oStrictHostKeyChecking=no -T \
-i ~/.ssh/consul_aws_rsa \
ubuntu@${ec2_public_ip} << EOSSH
docker run -d \
–net=host \
–hostname ${consul_server} \
–name ${consul_server} \
–env "SERVICE_IGNORE=true" \
–env "CONSUL_CLIENT_INTERFACE=eth0" \
–env "CONSUL_BIND_INTERFACE=eth0" \
–volume /home/ubuntu/consul/data:/consul/data \
–publish 8500:8500 \
consul:latest \
consul agent -server -ui -client=0.0.0.0 \
-advertise='{{ GetInterfaceIP "eth0" }}' \
-retry-join="${ec2_server1_private_ip}" \
-data-dir="/consul/data"
sleep 5
docker logs consul-server-2
docker exec -i consul-server-2 consul members
EOSSH

view raw

consul_03.sh

hosted with ❤ by GitHub


# Deploy Consul Server 3
ec2_public_ip=$(aws ec2 describe-instances \
–filters Name='tag:Name,Values=tf-instance-consul-server-3' \
–output text –query 'Reservations[*].Instances[*].PublicIpAddress')
consul_server="consul-server-3"
ssh -oStrictHostKeyChecking=no -T \
-i ~/.ssh/consul_aws_rsa \
ubuntu@${ec2_public_ip} << EOSSH
docker run -d \
–net=host \
–hostname ${consul_server} \
–name ${consul_server} \
–env "SERVICE_IGNORE=true" \
–env "CONSUL_CLIENT_INTERFACE=eth0" \
–env "CONSUL_BIND_INTERFACE=eth0" \
–volume /home/ubuntu/consul/data:/consul/data \
–publish 8500:8500 \
consul:latest \
consul agent -server -ui -client=0.0.0.0 \
-advertise='{{ GetInterfaceIP "eth0" }}' \
-retry-join="${ec2_server1_private_ip}" \
-data-dir="/consul/data"
sleep 5
docker logs consul-server-3
docker exec -i consul-server-3 consul members
EOSSH

view raw

consul_04.sh

hosted with ❤ by GitHub


# Output Consul Web UI URL
ec2_public_ip=$(aws ec2 describe-instances \
–filters Name='tag:Name,Values=tf-instance-consul-server-1' \
–output text –query 'Reservations[*].Instances[*].PublicIpAddress')
echo " "
echo "*** Consul UI: http://${ec2_public_ip}:8500/ui/ ***"

view raw

consul_05.sh

hosted with ❤ by GitHub

The entire Jenkins build process only takes about 30 seconds. Afterward, the output from a successful Jenkins build should show that all three Consul server instances are running, have formed a quorum, and have elected a Leader.

Jenkins_05.png

Persisting State

The Consul Docker image exposes VOLUME /consul/data, which is a path were Consul will place its persisted state. Using Terraform’s remote-exec provisioner, we create a directory on each EC2 instance, at /home/ubuntu/consul/config. The docker run command bind-mounts the container’s /consul/data path to the EC2 host’s /home/ubuntu/consul/config directory.

According to Consul, the Consul server container instance will ‘store the client information plus snapshots and data related to the consensus algorithm and other state, like Consul’s key/value store and catalog’ in the /consul/data directory. That container directory is now bind-mounted to the EC2 host, as demonstrated below.

jenkins_15

Accessing Consul

Following a successful deployment, you should be able to use the public URL, displayed in the build output of the ‘Deploy Consul Cluster AWS’ project, to access the Consul UI. Clicking on the Nodes tab in the UI, you should see all three Consul server instances, one per EC2 instance, running and healthy.

Consul UI

Destroying Infrastructure

When you are finished with the post, you may want to remove the running infrastructure, so you don’t continue to get billed by Amazon. The ‘Destroy Consul Infra AWS’ project destroys all the AWS infrastructure, provisioned as part of this post, in about 60 seconds. The project’s SCM and Bindings tasks are identical to the both previous projects. The Build step calls the destroy_infra.sh script, which is included in the GitHub project. The script executes the terraform destroy -force command. It will delete all running infrastructure components associated with the post and update Terraform’s remote state.

Jenkins_09

Conclusion

This post has demonstrated how modern DevOps tooling, such as HashiCorp’s Packer and Terraform, make it easy to build, provision and manage complex cloud architecture. Using a CI/CD server, such as Jenkins, to securely automate the use of these tools, ensures quick and consistent results.

All opinions in this post are my own and not necessarily the views of my current employer or their clients.

, , , , , , , , , ,

5 Comments

Baking AWS AMI with new Docker CE Using Packer

AWS for Docker

Introduction

On March 2 (less than a week ago as of this post), Docker announced the release of Docker Enterprise Edition (EE), a new version of the Docker platform optimized for business-critical deployments. As part of the release, Docker also renamed the free Docker products to Docker Community Edition (CE). Both products are adopting a new time-based versioning scheme for both Docker EE and CE. The initial release of Docker CE and EE, the 17.03 release, is the first to use the new scheme.

Along with the release, Docker delivered excellent documentation on installing, configuring, and troubleshooting the new Docker EE and CE. In this post, I will demonstrate how to partially bake an existing Amazon Machine Image (Amazon AMI) with the new Docker CE, preparing it as a base for the creation of Amazon Elastic Compute Cloud (Amazon EC2) compute instances.

Adding Docker and similar tooling to an AMI is referred to as partially baking an AMI, often referred to as a hybrid AMI. According to AWS, ‘hybrid AMIs provide a subset of the software needed to produce a fully functional instance, falling in between the fully baked and JeOS (just enough operating system) options on the AMI design spectrum.

Installing Docker CE on an AWS AMI should not be confused with Docker’s also recently announced Docker Community Edition (CE) for AWS. Docker for AWS offers multiple CloudFormation templates for Docker EE and CE. According to Docker, Docker for AWS ‘provides a Docker-native solution that avoids operational complexity and adding unneeded additional APIs to the Docker stack.

Base AMI

Docker provides detailed directions for installing Docker CE and EE onto several major Linux distributions. For this post, we will choose a widely used Linux distro, Ubuntu. According to Docker, currently Docker CE and EE can be installed on three popular Ubuntu releases:

  • Yakkety 16.10
  • Xenial 16.04 (LTS)
  • Trusty 14.04 (LTS)

To provision a small EC2 instance in Amazon’s US East (N. Virginia) Region, I will choose Ubuntu 16.04.2 LTS Xenial Xerus . According to Canonical’s Amazon EC2 AMI Locator website, a Xenial 16.04 LTS AMI is available, ami-09b3691f, for US East 1, as a t2.micro EC2 instance type.

Packer

HashiCorp Packer will be used to partially bake the base Ubuntu Xenial 16.04 AMI with Docker CE 17.03. HashiCorp describes Packer as ‘a tool for creating machine and container images for multiple platforms from a single source configuration.’ The JSON-format Packer file is as follows:

{
  "variables": {
    "aws_access_key": "{{env `AWS_ACCESS_KEY_ID`}}",
    "aws_secret_key": "{{env `AWS_SECRET_ACCESS_KEY`}}",
    "us_east_1_ami": "ami-09b3691f",
    "name": "aws-docker-ce-base",
    "us_east_1_name": "ubuntu-xenial-docker-ce-base",
    "ssh_username": "ubuntu"
  },
  "builders": [
    {
      "name": "{{user `us_east_1_name`}}",
      "type": "amazon-ebs",
      "access_key": "{{user `aws_access_key`}}",
      "secret_key": "{{user `aws_secret_key`}}",
      "region": "us-east-1",
      "vpc_id": "",
      "subnet_id": "",
      "source_ami": "{{user `us_east_1_ami`}}",
      "instance_type": "t2.micro",
      "ssh_username": "{{user `ssh_username`}}",
      "ssh_timeout": "10m",
      "ami_name": "{{user `us_east_1_name`}} {{timestamp}}",
      "ami_description": "{{user `us_east_1_name`}} AMI",
      "run_tags": {
        "ami-create": "{{user `us_east_1_name`}}"
      },
      "tags": {
        "ami": "{{user `us_east_1_name`}}"
      },
      "ssh_private_ip": false,
      "associate_public_ip_address": true
    }
  ],
  "provisioners": [
    {
      "type": "file",
      "source": "bootstrap_docker_ce.sh",
      "destination": "/tmp/bootstrap_docker_ce.sh"
    },
    {
          "type": "file",
          "source": "cleanup.sh",
          "destination": "/tmp/cleanup.sh"
    },
    {
      "type": "shell",
      "execute_command": "echo 'packer' | sudo -S sh -c '{{ .Vars }} {{ .Path }}'",
      "inline": [
        "whoami",
        "cd /tmp",
        "chmod +x bootstrap_docker_ce.sh",
        "chmod +x cleanup.sh",
        "ls -alh /tmp",
        "./bootstrap_docker_ce.sh",
        "sleep 10",
        "./cleanup.sh"
      ]
    }
  ]
}

The Packer file uses Packer’s amazon-ebs builder type. This builder is used to create Amazon AMIs backed by Amazon Elastic Block Store (EBS) volumes, for use in EC2.

Bootstrap Script

To install Docker CE on the AMI, the Packer file executes a bootstrap shell script. The bootstrap script and subsequent cleanup script are executed using  Packer’s remote shell provisioner. The bootstrap is like the following:

#!/bin/sh

sudo apt-get remove docker docker-engine

sudo apt-get install \
    apt-transport-https \
    ca-certificates \
    curl \
    software-properties-common

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88

sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"
sudo apt-get update
sudo apt-get -y upgrade
sudo apt-get install -y docker-ce

sudo groupadd docker
sudo usermod -aG docker ubuntu

sudo systemctl enable docker

This script closely follows directions provided by Docker, for installing Docker CE on Ubuntu. After removing any previous copies of Docker, the script installs Docker CE. To ensure sudo is not required to execute Docker commands on any EC2 instance provisioned from resulting AMI, the script adds the ubuntu user to the docker group.

The bootstrap script also uses systemd to start the Docker daemon. Starting with Ubuntu 15.04, Systemd System and Service Manager is used by default instead of the previous init system, Upstart. Systemd ensures Docker will start on boot.

Cleaning Up

It is best good practice to clean up your activities after baking an AMI. I have included a basic clean up script. The cleanup script is as follows:

#!/bin/sh

set -e

echo 'Cleaning up after bootstrapping...'
sudo apt-get -y autoremove
sudo apt-get -y clean
sudo rm -rf /tmp/*
cat /dev/null > ~/.bash_history
history -c
exit

Partially Baking

Before running Packer to build the Docker CE AMI, I set both my AWS access key and AWS secret access key. The Packer file expects the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

Running the packer build ubuntu_docker_ce_ami.json command builds the AMI. The abridged output should look similar to the following:

$ packer build docker_ami.json
ubuntu-xenial-docker-ce-base output will be in this color.

==> ubuntu-xenial-docker-ce-base: Prevalidating AMI Name...
    ubuntu-xenial-docker-ce-base: Found Image ID: ami-09b3691f
==> ubuntu-xenial-docker-ce-base: Creating temporary keypair: packer_58bc7a49-9e66-7f76-ce8e-391a67d94987
==> ubuntu-xenial-docker-ce-base: Creating temporary security group for this instance...
==> ubuntu-xenial-docker-ce-base: Authorizing access to port 22 the temporary security group...
==> ubuntu-xenial-docker-ce-base: Launching a source AWS instance...
    ubuntu-xenial-docker-ce-base: Instance ID: i-0ca883ecba0c28baf
==> ubuntu-xenial-docker-ce-base: Waiting for instance (i-0ca883ecba0c28baf) to become ready...
==> ubuntu-xenial-docker-ce-base: Adding tags to source instance
==> ubuntu-xenial-docker-ce-base: Waiting for SSH to become available...
==> ubuntu-xenial-docker-ce-base: Connected to SSH!
==> ubuntu-xenial-docker-ce-base: Uploading bootstrap_docker_ce.sh => /tmp/bootstrap_docker_ce.sh
==> ubuntu-xenial-docker-ce-base: Uploading cleanup.sh => /tmp/cleanup.sh
==> ubuntu-xenial-docker-ce-base: Provisioning with shell script: /var/folders/kf/637b0qns7xb0wh9p8c4q0r_40000gn/T/packer-shell189662158
    ...
    ubuntu-xenial-docker-ce-base: Reading package lists...
    ubuntu-xenial-docker-ce-base: Building dependency tree...
    ubuntu-xenial-docker-ce-base: Reading state information...
    ubuntu-xenial-docker-ce-base: E: Unable to locate package docker-engine
    ubuntu-xenial-docker-ce-base: Reading package lists...
    ubuntu-xenial-docker-ce-base: Building dependency tree...
    ubuntu-xenial-docker-ce-base: Reading state information...
    ubuntu-xenial-docker-ce-base: ca-certificates is already the newest version (20160104ubuntu1).
    ubuntu-xenial-docker-ce-base: apt-transport-https is already the newest version (1.2.19).
    ubuntu-xenial-docker-ce-base: curl is already the newest version (7.47.0-1ubuntu2.2).
    ubuntu-xenial-docker-ce-base: software-properties-common is already the newest version (0.96.20.5).
    ubuntu-xenial-docker-ce-base: 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
    ubuntu-xenial-docker-ce-base: OK
    ubuntu-xenial-docker-ce-base: pub   4096R/0EBFCD88 2017-02-22
    ubuntu-xenial-docker-ce-base: Key fingerprint = 9DC8 5822 9FC7 DD38 854A  E2D8 8D81 803C 0EBF CD88
    ubuntu-xenial-docker-ce-base: uid                  Docker Release (CE deb) 
    ubuntu-xenial-docker-ce-base: sub   4096R/F273FCD8 2017-02-22
    ubuntu-xenial-docker-ce-base:
    ubuntu-xenial-docker-ce-base: Hit:1 http://us-east-1.ec2.archive.ubuntu.com/ubuntu xenial InRelease
    ubuntu-xenial-docker-ce-base: Get:2 http://us-east-1.ec2.archive.ubuntu.com/ubuntu xenial-updates InRelease [102 kB]
    ...
    ubuntu-xenial-docker-ce-base: Get:27 http://security.ubuntu.com/ubuntu xenial-security/universe amd64 Packages [89.5 kB]
    ubuntu-xenial-docker-ce-base: Fetched 10.6 MB in 2s (4,065 kB/s)
    ubuntu-xenial-docker-ce-base: Reading package lists...
    ubuntu-xenial-docker-ce-base: Reading package lists...
    ubuntu-xenial-docker-ce-base: Building dependency tree...
    ubuntu-xenial-docker-ce-base: Reading state information...
    ubuntu-xenial-docker-ce-base: Calculating upgrade...
    ubuntu-xenial-docker-ce-base: 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
    ubuntu-xenial-docker-ce-base: Reading package lists...
    ubuntu-xenial-docker-ce-base: Building dependency tree...
    ubuntu-xenial-docker-ce-base: Reading state information...
    ubuntu-xenial-docker-ce-base: The following additional packages will be installed:
    ubuntu-xenial-docker-ce-base: aufs-tools cgroupfs-mount libltdl7
    ubuntu-xenial-docker-ce-base: Suggested packages:
    ubuntu-xenial-docker-ce-base: mountall
    ubuntu-xenial-docker-ce-base: The following NEW packages will be installed:
    ubuntu-xenial-docker-ce-base: aufs-tools cgroupfs-mount docker-ce libltdl7
    ubuntu-xenial-docker-ce-base: 0 upgraded, 4 newly installed, 0 to remove and 0 not upgraded.
    ubuntu-xenial-docker-ce-base: Need to get 19.4 MB of archives.
    ubuntu-xenial-docker-ce-base: After this operation, 89.4 MB of additional disk space will be used.
    ubuntu-xenial-docker-ce-base: Get:1 http://us-east-1.ec2.archive.ubuntu.com/ubuntu xenial/universe amd64 aufs-tools amd64 1:3.2+20130722-1.1ubuntu1 [92.9 kB]
    ...
    ubuntu-xenial-docker-ce-base: Get:4 https://download.docker.com/linux/ubuntu xenial/stable amd64 docker-ce amd64 17.03.0~ce-0~ubuntu-xenial [19.3 MB]
    ubuntu-xenial-docker-ce-base: debconf: unable to initialize frontend: Dialog
    ubuntu-xenial-docker-ce-base: debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.)
    ubuntu-xenial-docker-ce-base: debconf: falling back to frontend: Readline
    ubuntu-xenial-docker-ce-base: debconf: unable to initialize frontend: Readline
    ubuntu-xenial-docker-ce-base: debconf: (This frontend requires a controlling tty.)
    ubuntu-xenial-docker-ce-base: debconf: falling back to frontend: Teletype
    ubuntu-xenial-docker-ce-base: dpkg-preconfigure: unable to re-open stdin:
    ubuntu-xenial-docker-ce-base: Fetched 19.4 MB in 1s (17.8 MB/s)
    ubuntu-xenial-docker-ce-base: Selecting previously unselected package aufs-tools.
    ubuntu-xenial-docker-ce-base: (Reading database ... 53844 files and directories currently installed.)
    ubuntu-xenial-docker-ce-base: Preparing to unpack .../aufs-tools_1%3a3.2+20130722-1.1ubuntu1_amd64.deb ...
    ubuntu-xenial-docker-ce-base: Unpacking aufs-tools (1:3.2+20130722-1.1ubuntu1) ...
    ...
    ubuntu-xenial-docker-ce-base: Setting up docker-ce (17.03.0~ce-0~ubuntu-xenial) ...
    ubuntu-xenial-docker-ce-base: Processing triggers for libc-bin (2.23-0ubuntu5) ...
    ubuntu-xenial-docker-ce-base: Processing triggers for systemd (229-4ubuntu16) ...
    ubuntu-xenial-docker-ce-base: Processing triggers for ureadahead (0.100.0-19) ...
    ubuntu-xenial-docker-ce-base: groupadd: group 'docker' already exists
    ubuntu-xenial-docker-ce-base: Synchronizing state of docker.service with SysV init with /lib/systemd/systemd-sysv-install...
    ubuntu-xenial-docker-ce-base: Executing /lib/systemd/systemd-sysv-install enable docker
    ubuntu-xenial-docker-ce-base: Cleanup...
    ubuntu-xenial-docker-ce-base: Reading package lists...
    ubuntu-xenial-docker-ce-base: Building dependency tree...
    ubuntu-xenial-docker-ce-base: Reading state information...
    ubuntu-xenial-docker-ce-base: 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
==> ubuntu-xenial-docker-ce-base: Stopping the source instance...
==> ubuntu-xenial-docker-ce-base: Waiting for the instance to stop...
==> ubuntu-xenial-docker-ce-base: Creating the AMI: ubuntu-xenial-docker-ce-base 1288227081
    ubuntu-xenial-docker-ce-base: AMI: ami-e9ca6eff
==> ubuntu-xenial-docker-ce-base: Waiting for AMI to become ready...
==> ubuntu-xenial-docker-ce-base: Modifying attributes on AMI (ami-e9ca6eff)...
    ubuntu-xenial-docker-ce-base: Modifying: description
==> ubuntu-xenial-docker-ce-base: Modifying attributes on snapshot (snap-058a26c0250ee3217)...
==> ubuntu-xenial-docker-ce-base: Adding tags to AMI (ami-e9ca6eff)...
==> ubuntu-xenial-docker-ce-base: Tagging snapshot: snap-043a16c0154ee3217
==> ubuntu-xenial-docker-ce-base: Creating AMI tags
==> ubuntu-xenial-docker-ce-base: Creating snapshot tags
==> ubuntu-xenial-docker-ce-base: Terminating the source AWS instance...
==> ubuntu-xenial-docker-ce-base: Cleaning up any extra volumes...
==> ubuntu-xenial-docker-ce-base: No volumes to clean up, skipping
==> ubuntu-xenial-docker-ce-base: Deleting temporary security group...
==> ubuntu-xenial-docker-ce-base: Deleting temporary keypair...
Build 'ubuntu-xenial-docker-ce-base' finished.

==> Builds finished. The artifacts of successful builds are:
--> ubuntu-xenial-docker-ce-base: AMIs were created:

us-east-1: ami-e9ca6eff

Results

The result is an Ubuntu 16.04 AMI in US East 1 with Docker CE 17.03 installed. To confirm the new AMI is now available, I will use the AWS CLI to examine the resulting AMI:

aws ec2 describe-images \
  --filters Name=tag-key,Values=ami Name=tag-value,Values=ubuntu-xenial-docker-ce-base \
  --query 'Images[*].{ID:ImageId}'

Resulting output:

{
    "Images": [
        {
            "VirtualizationType": "hvm",
            "Name": "ubuntu-xenial-docker-ce-base 1488747081",
            "Tags": [
                {
                    "Value": "ubuntu-xenial-docker-ce-base",
                    "Key": "ami"
                }
            ],
            "Hypervisor": "xen",
            "SriovNetSupport": "simple",
            "ImageId": "ami-e9ca6eff",
            "State": "available",
            "BlockDeviceMappings": [
                {
                    "DeviceName": "/dev/sda1",
                    "Ebs": {
                        "DeleteOnTermination": true,
                        "SnapshotId": "snap-048a16c0250ee3227",
                        "VolumeSize": 8,
                        "VolumeType": "gp2",
                        "Encrypted": false
                    }
                },
                {
                    "DeviceName": "/dev/sdb",
                    "VirtualName": "ephemeral0"
                },
                {
                    "DeviceName": "/dev/sdc",
                    "VirtualName": "ephemeral1"
                }
            ],
            "Architecture": "x86_64",
            "ImageLocation": "931066906971/ubuntu-xenial-docker-ce-base 1488747081",
            "RootDeviceType": "ebs",
            "OwnerId": "931066906971",
            "RootDeviceName": "/dev/sda1",
            "CreationDate": "2017-03-05T20:53:41.000Z",
            "Public": false,
            "ImageType": "machine",
            "Description": "ubuntu-xenial-docker-ce-base AMI"
        }
    ]
}

Finally, here is the new AMI as seen in the AWS EC2 Management Console:

EC2 Management Console - AMI

Terraform

To confirm Docker CE is installed and running, I can provision a new EC2 instance, using HashiCorp Terraform. This post is too short to detail all the Terraform code required to stand up a complete environment. I’ve included the complete code in the GitHub repo for this post. Not, the Terraform code is only used to testing. No security, including the use of a properly configured security groups, public/private subnets, and a NAT server, is configured.

Below is a greatly abridged version of the Terraform code I used to provision a new EC2 instance, using Terraform’s aws_instance resource. The resulting EC2 instance should have Docker CE available.

# test-docker-ce instance
resource "aws_instance" "test-docker-ce" {
  connection {
    user        = "ubuntu"
    private_key = "${file("~/.ssh/test-docker-ce")}"
    timeout     = "${connection_timeout}"
  }

  ami               = "ami-e9ca6eff"
  instance_type     = "t2.nano"
  availability_zone = "us-east-1a"
  count             = "1"

  key_name               = "${aws_key_pair.auth.id}"
  vpc_security_group_ids = ["${aws_security_group.test-docker-ce.id}"]
  subnet_id              = "${aws_subnet.test-docker-ce.id}"

  tags {
    Owner       = "Gary A. Stafford"
    Terraform   = true
    Environment = "test-docker-ce"
    Name        = "tf-instance-test-docker-ce"
  }
}

By using the AWS CLI, once again, we can confirm the new EC2 instance was built using the correct AMI:

aws ec2 describe-instances \
  --filters Name='tag:Name,Values=tf-instance-test-docker-ce' \
  --output text --query 'Reservations[*].Instances[*].ImageId'

Resulting output looks good:

ami-e9ca6eff

Finally, here is the new EC2 as seen in the AWS EC2 Management Console:

EC2 Management Console - EC2

SSHing into the new EC2 instance, I should observe that the operating system is Ubuntu 16.04.2 LTS and that Docker version 17.03.0-ce is installed and running:

Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-64-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  Get cloud support with Ubuntu Advantage Cloud Guest:
    http://www.ubuntu.com/business/services/cloud

0 packages can be updated.
0 updates are security updates.

Last login: Sun Mar  5 22:06:01 2017 from 

ubuntu@ip-:~$ docker --version
Docker version 17.03.0-ce, build 3a232c8

Conclusion

Docker EE and CE represent a significant step forward in expanding Docker’s enterprise-grade toolkit. Replacing or installing Docker EE or CE on your AWS AMIs is easy, using Docker’s guide along with HashiCorp Packer.

All source code for this post can be found on GitHub.

All opinions in this post are my own and not necessarily the views of my current employer or their clients.

, , , , , , , , , , ,

2 Comments

Scaffolding Modern Web Applications

Scaffold Three Full-Stack Modern Web Application Examples using Yeoman with npm, Bower, and Gruntarticle background

Introduction

The capabilities of modern web applications have quickly matched and surpassed those of most traditional desktop applications. However, with the increase in capabilities, comes an increase in architectural complexity of web applications. To help deal with the increase in complexity, developers have benefitted from a plethora of popular support libraries, frameworks, API’s, and similar tooling. Examples of these include AngularJSReact.js, Play!Node.js, Express, npmYeoman, Bower, Grunt, and Gulp.

With so many choices, selecting an optimal toolset to construct a modern web application can be overwhelming. In this post, we will examine three distinct modern web application architectures, predominantly JavaScript-based. The post will discuss how to select and install the required tools, and how to use those tools to scaffold the applications. By the end of the post, we will have three working web applications, ready for further development.

Generators

The post’s examples use Yeoman generators. What is a generator? According to Wikipedia, Yeoman’s generator concept was inspired by Ruby on Rails. Using a generator’s blueprint, Yeoman scaffolds an application’s project directory and file structure, and installs required vendor libraries and dependencies. Yeoman generators usually run interactively, guiding the developer through a series of configuration questions. The configuration choices determine the project’s physical structure and components installed by Yeoman.

Generators are inherently opinionated. They dictate a particular application architecture they feel represents current best practices. However, good generators also allow developers to select from a range of architectural choices to meet the requirements of a developer’s environment. For example, a generator might allow Grunt or Gulp for task automation, or allow either Less or Sass for styling the UI. Similar to npm, RubyGems, Bower, Docker Hub, and Puppet Forge, Yeoman provides a searchable public repository. This allows developers to choose from a variety of generators to meet their specific needs.

Preparing the Development Environment

The examples in this post were built on Mac OS X. However, all the tools discussed in the post are available on the three major platforms, Linux, Mac, and Windows. Installation and configuration will vary.

An IDE is not required to scaffold the post’s application examples. However, for further development of the applications, I strongly recommend JetBrain’s WebStorm. According to Slant, WebStorm is a popular, highly rated IDE for building modern web applications. WebStorm is available on all three major platforms. A paid license is required, but well worth the reasonable investment based on the IDE’s rich feature set. WebStorm is integrated with many popular JavaScript frameworks. Additionally, there are hundreds of plug-ins available to extend WebStorm’s functionality.

Each of the post’s examples varies, architecturally. However, each also shares several common components, which we will install. They include:

npm
We will use npm (aka Node Package Manager), a leading server-side package manager, to manage the application’s server-side JavaScript dependencies.

Node.js
We will use Node.js, a JavaScript runtime, to power the JavaScript-based web application examples.

Bower
Similar to npm, Bower, is a popular client-side package manager. We will use Bower to manage the application’s client-side JavaScript dependencies.

Yeoman
We will use Yeoman, a leading web application scaffolding tool, to quickly build the frameworks for the example applications based on best practices and tooling.

Grunt
We will use Grunt, a leading JavaScript task runner, to automate common tasks such as minification, compilation, unit-testing, linting, and packaging of applications for deployment. At least two of the three examples also offer Gulp as an alternative.

Express
We will use Express, a Node.js web application framework that provides a robust set of features for web applications development, to support our example applications.

MongoDB
We will use MongoDB, a leading open-source NoSQL document database, for all three examples. For two of the examples, you can easily substitute alternate databases, such as MySQL, when configuring the application with Yeoman. The choice of database is of secondary importance in this post.

First install Node.js, which comes packaged with npm. Then, use npm to install Bower, Yeoman, and Grunt. Make sure you run the command to fix the permissions for npm. If permissions are set correctly, you should not have to use sudo with your npm commands.

The global mode option (-g) installs packages globally. Packages are usually installed globally, only if they are used as a command line tool, such as with Bower, Yeoman, and Grunt.

The easiest way to install Node.js and npm on OS X, is with Homebrew:

# install homebrew and confirm
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew update
brew doctor
export PATH="/usr/local/bin:$PATH"
brew --version
# install node and npm with brew and confirm
brew install node
node --version; npm --version

Alternatively, install Node and npm by downloading the package installer for Mac OS X (x64) from nodejs.org. If you are not a currently a Homebrew user, I suggest this route. At the time of this post, you have the choice of Node.js version v4.2.3 LTS or v5.1.1 Stable.

Fix the potential permission problem with npm, so sudo is not required:

sudo chown -R `whoami` $(npm config get prefix)
view raw fix_npm.bash hosted with ❤ by GitHub

Then, install Yeoman, Bower, and Grunt, globally:

npm install -g bower yo grunt-cli
bower --version; yo --version; grunt --version

Lastly, install MongoDB. Similar to Node, we can use Homebrew, or download and install MongoDB from the MongoDB website. Review this page for detailed installation and configuration instructions. To install MongoDB with Homebrew, we would issue the following commands:

brew update
brew install mongod
mongod --version
sudo mkdir -p /data/db
sudo chown -R `whoami` /data/db/
mongod

Example #1: Server-Centric Express Application

For our first example, we will scaffold a server-centric JavaScript web application using Pete Cooper’s Express Generator (v2.9.2). According to the project’s GitHub site, the Express Generator is ‘An Expressjs generator for Yeoman, based on the express command line tool.’ I suggest reading the project’s documentation before continuing; it describes the generator’s functionality in greater detail.

To install, download the Express Generator with npm, and install with Yeoman, as follows:

npm install -g generator-express
yo express

As part of the Express Generator’s configuration process, Yeoman will ask a series of configuration questions. The Express generator offers several choices for scaffolding the application. For this example, we will choose the following options: MVC, Marko, Sass, MongoDB, and Grunt.

Express-Generator with Yeoman'

Using Express-Generator with Yeoman

For those not as familiar with developing full-stack JavaScript applications, some of the generator’s choices may be unfamiliar, such as view engines, css preprocessors, and build tools. For this example, we will select Marko, a highly regarded JavaScript templating engine (aka view engine), for the first application. You can compare different engines on Slant. For CSS preprocessors, you can also refer to Slant for a comparison of leading candidates. We will choose Sass.

Lastly, for a build tool (aka task runner) we will choose Grunt. Grunt and Gulp are the two most popular choices. Either is a proven tool for automation tasks such as minification, compilation, unit-testing, linting, and packaging applications for deployment.

As shown below, Yeoman creates a series of files and directories and installs JavaScript libraries with npm and Bower. Choices are based on best practices, as prescribed by the generator.

Default Express-Generator File Structure

Default Express-Generator File Structure

npm

Yeoman uses npm and bower to install the generator’s required packages. Based on our five configuration choices for the Express Generator, npm installed over 225 packages in the project’s local node_module directory. This includes primary and secondary npm package dependencies. For example, Marko, one of the choices, which npm installed, has 24 dependencies it requires. In turn, each of those packages may have more dependencies. You quickly see why npm, and other similar package managers, are invaluable to building and managing a modern web application. The npm dependencies are declared in the package.json file, in the project’s root directory.

Partial List of npm Packages Installed

Partial List of npm Packages Installed

We will still need to install a few more items. We chose Sass as an Express generator option. Sass requires Ruby, which comes preinstalled on Mac OS X. If you wish, you can upgrade your pre-installed version of Ruby with Homebrew, but it is not required. Sass is installed with RubyGems, a package manager for Ruby. To automate the Sass-related tasks with Grunt, we also need to install the Grunt plug-in for Sass, grunt-contrib-sass, using npm:

# optional install updated version of Ruby and confirm
brew install ruby
ruby --version
# install grunt-contrib-sass
npm install grunt-contrib-sass --save-dev
# install sass and confirm
gem install sass
gem list sass

The Express Generator’s test are written in Mocha. Mocha is an asynchronous JavaScript test framework running on Node.js. The website suggests installing Mocha globally with npm. Mocha can be run from the command line:

npm install -g mocha
mocha --version

Up and Running

Simply running the grunt command will start the default Generator-Express MVC application. While in development, I prefer to run Grunt with the debug option (grunt -d) and/or with the verbose option (grunt -v or grunt -dv). These options offer enhanced feedback, especially on which tasks are run by Grunt. Review the terminal output to make sure the application started properly.

Starting the Application with Grunt

Starting the Application with Grunt

To confirm the Express application started correctly, in a second terminal window, curl the application using curl -I localhost:3000. Easier yet, point your web browser to localhost:3000. You should see the following default web page.

Default Express-Generator Application

Default Express-Generator Application

Example #2: Full-Stack MEAN Application

In our second example, we will scaffold a true full-stack JavaScript web application using Tyler Henkel’s popular AngularJS Full-Stack Generator for Yeoman (v3.0.2). According to the project’s GitHub site, the generator is a ‘Yeoman generator for creating MEAN stack applications, using MongoDB, Express, AngularJS, and Node – lets you quickly set up a project following best practices.’ As with the earlier example, I suggest you read the project’s documentation before continuing.

To install theAngularJS Full-Stack Generator, download the with npm and install with Yeoman:

npm install -g generator-angular-fullstack
mkdir demo-web-app-2; cd $_
yo angular-fullstack meanapp

Similar to the Express example, Yeoman will ask a series of configuration questions. We will choose the following options: Jade, Less, ngRoute, Bootstrap, UI Bootstrap, and MongoDB with Mongoose. AngularJS, Express, and Grunt are installed by the generator, automatically. For the sake of brevity, we will not include other available options, including Babel for ES6, OAuth authentication, or socket.io.

AngularJS Full-Stack Generator Config Options

AngularJS Full-Stack Generator Config Options

PhantomJS

After generating the AngularJS Full-Stack project, I received errors regarding PhantomJS. According to several sources, this is not uncommon. The AngularJS Full-Stack project uses PhantomJS as the default browser for Karma, the popular test runner, designed by the AngularJS team. Although npm installed PhantomJS locally, as part of the project, Karma complained about missing the path to the PhantomJS binary. To eliminate the issue, I installed PhantomJS globally with npm. I then manually added the PhantomJS binary path to the $PATH environment variable:

npm install -g phantomjs
sudo vi ~/.bash_profile
# add PHANTOMJS_BIN environment variable (next two lines)
export PHANTOMJS_BIN=/usr/local/lib/node_modules/phantomjs/bin
# your file/line may vary
resetexport PATH=${PHANTOMJS_BIN}:$PATH
phantomjs --version

To test Karma, with PhantomJS, run the grunt test command. This should result in error-free output, similar to the following.

Running "karma:unit" (karma) task

Running “karma:unit” (karma) task

Client/Server Architecture

Similar to the previous example, Yeoman creates a series of files and directories, and installs JavaScript packages on the server and client sides with npm and Bower.

AngularJS Full-Stack Generator File Structure

AngularJS Full-Stack Generator File Structure

Both the Express and AngularJS examples share several common files and directories. However, one major difference between the two is the client/server oriented directory structure of the AngularJS generator. Unlike the Express example, the AngularJS example has both a client and a server directory. The server-side of the application (aka back-end) is driven primarily by Express and Node. Mongoose serves as an interface between our application’s domain model and MongoDB, on the server-side. Also, on the server-side, Jade is used for HTML templating. The client-side of the application (aka front-end) is driven primarily by AngularJS. Twitter’s Bootstrap and Bootstrap UI offer a responsive web interface for our example application.

Full-Stack Server/Client File Structure

Full-Stack Server/Client File Structure

Up and Running

Running the grunt serve command will eventually start the default AngularJS Full-Stack application, after running a series of pre-defined Grunt tasks.

AngularJS Full-Stack App Starting with Grunt

AngularJS Full-Stack App Starting with Grunt

Review the terminal output to make sure the application started properly. You may see some warnings, suggesting the installation of several dependencies globally. You may also see warnings about dependency versions being outdated. Outdated versions are one of the challenges with generators that are not constantly kept refreshed and tested with the latest package dependencies. Warnings shouldn’t prevent the application from starting, only Errors.

AngularJS Full-Stack App Started with Grunt

AngularJS Full-Stack App Started with Grunt

To confirm the application started, in a second terminal window, curl the application using curl -I localhost:9000. Easier yet, point your web browser at localhost:9000. The default web page for the AngularJS Full-Stack web application is much more elaborate than the previous Express example. This is thanks to the Bootstrap and AngularJS client-side components.

AngularJS Full-Stack App Running in Browser

AngularJS Full-Stack App Running in Browser

Additional Generator Features

The AngularJS Full-Stack generator is capable of generating more than just the default application project. The AngularJS Full-Stack generator contains a set of generators. Beyond generating the basic application framework, you may use the generator to create boilerplate code for AngularJS and Node.js components for endpoints, services, routes, models, controllers, directives, and filters. The generators also provide the ability to prepare your application for deployment to OpenStack and Heroku.

The best place to review available options for the generators is on the GitHub sites. You can display a high-level list of the generator’s features using the yo --help command. Below are the three generators used in this post.

Below, is an example of generating additional application components using the AngularJS Full-Stack generator. First scaffold a server-side Express RESTful API endpoint, called ‘user’. The single command generates a server-side directory structure and several boilerplate files, including an Express model, controller, and router, and Mocha tests.

List of Installed Yeoman Generators

List of Installed Yeoman Generators

Next, generate a client-side AngularJS service, which connects to the server-side, Express RESTful ‘user’ endpoint above. The command creates a boilerplate AngularJS service and Mocha test. Lastly, create an AngularJS route. This generator command creates a boilerplate AngularJS route and controller, Mocha test, Jade view template, and less file.

AngularJS Full-Stack Component Generators

AngularJS Full-Stack Component Generators

Example #3: Java Hipster Application

In the third and last example, we will scaffold another full-stack web application. However, this time, we will use a generator that relies on Java EE as the primary development platform on the server-side, as opposed to JavaScript. JavaScript will be relegated largely to the client-side.

Again, we will use a Yeoman generator, JHipster, built by Julien Dubois and team, to scaffold the application (v2.25.0). According to the project’s GitHub site, JHipster uses a robust server-side Java EE stack with Spring Boot and Maven. JHipster’s mobile-first front-end is enabled with AngularJS and Bootstrap. Being the most complex of the three examples, it’s important to review the project documentation.

JHipster offers three ways to install the application, which are 1) locally, 2) a Docker container, or 3) a Vagrant VM. We will install the application framework locally as not to introduce additional complexity. To install the generator, download the JHipster generator with npm, and install with Yeoman:

npm install -g generator-jhipster
mkdir demo-web-app-3; cd $_
yo jhipster

Again, Yeoman will ask a series of configuration questions on behalf of JHipster. For this example, we will choose the following options: token-based authentication, MongoDB, Maven, Grunt, LibSass (Sass), and Gatling for testing. AngularJS and Bootstrap are installed automatically. We have chosen not to include other configuration options in this example, such as Angular Translate, WebSockets, and clustered HTTP sessions.

Default JHipster Generator Options

Default JHipster Generator Options

Once Yeoman finishes scaffolding the application, you should see the following output.

JHipster Generator Install Complete

JHipster Generator Install Complete

Maven Project Structure

The file and directory structure of JHipster is very different from the previous two examples. The first two example’s project structure is typical of a JavaScript project. In contrast, the JHipster example’s structure is more typical of a Maven-based Java project. In the JHipster project, the client-side JavaScript files are in the /src/main/webapp/ directory. The presence of the webapp directory is based on part of the project’s reliance on the Spring MVC web framework. Additionally, npm has loaded required server-side JavaScript packages into the node_modules directory, in the project’s root directory.

Default JHipster File Structure

Default JHipster File Structure

Up and Running

Running the mvn command will start the default JHipster application. The URL for the JHipster application is included in the terminal output.

JHipster Application Running with Maven

JHipster Application Running with Maven

To confirm the application has started, curl the application in a second terminal window, using curl -I localhost:8080. Easier yet, point your web browser to localhost:8080. Again, thanks to Bootstrap and AngularJS, the application presents a rich client UI.

Default JHipster Application Running

Default JHipster Application Running

Conclusion

The post’s examples represent a narrow sampling of available modern web application stacks, which can be easily scaffolded with generators. The JavaScript space continues to evolve rapidly. Even within the realm of JavaScript-based solutions, we didn’t examine several other popular frameworks, such as Meteor, FaceBook’s ReactJS, Ember, Backbone, and Polymer. They are all worth exploring, along with the hundreds of popular supporting frameworks, libraries, and API’s.

Useful Links

  • Post: JavaScript Frameworks: The Best 10 for Modern Web Apps (link)
  • Post: Best Web Frameworks (link)
  • Post: State Of Web Development 2014 (link)
  • YouTube: Modern Front-end Engineering (link)
  • YouTube: WebStorm – Things You Probably Didn’t Know (link)
  • Website: MEAN Stack (link)
  • Website: Full Stack Python (link)
  • Post: Best Full Stack Web Framework (link)

, , , , , , , , , , , , ,

1 Comment

Installing Foreman and Puppet Agent on Multiple VMs Using Vagrant and VirtualBox

Automatically install and configure Foreman, the open source infrastructure lifecycle management tool, and multiple Puppet Agent VMs using Vagrant and VirtualBox.

Foreman - Overview

Introduction

In the last post, Installing Puppet Master and Agents on Multiple VM Using Vagrant and VirtualBox, we installed Puppet Master/Agent on VirtualBox VMs using Vagrant. Puppet Master is an excellent tool, but lacks the ease-of-use of Puppet Enterprise or Foreman. In this post, we will build an almost identical environment, substituting Foreman for Puppet Master.

According to Foreman’s website, “Foreman is an open source project that helps system administrators manage servers throughout their lifecycle, from provisioning and configuration to orchestration and monitoring. Using Puppet or Chef and Foreman’s smart proxy architecture, you can easily automate repetitive tasks, quickly deploy applications, and proactively manage change, both on-premise with VMs and bare-metal or in the cloud.

Combined with Puppet Labs’ Open Source Puppet, Foreman is an effective solution to manage infrastructure and system configuration. Again, according to Foreman’s website, the Foreman installer is a collection of Puppet modules that installs everything required for a full working Foreman setup. The installer uses native OS packaging and adds necessary configuration for the complete installation. By default, the Foreman installer will configure:

  • Apache HTTP with SSL (using a Puppet-signed certificate)
  • Foreman running under mod_passenger
  • Smart Proxy configured for Puppet, TFTP and SSL
  • Puppet master running under mod_passenger
  • Puppet agent configured
  • TFTP server (under xinetd on Red Hat platforms)

For the average Systems Engineer or Software Developer, installing and configuring Foreman, Puppet Master, Apache, Puppet Agent, and the other associated software packages listed above, is daunting. If the installation doesn’t work properly, you must troubleshooting, or trying to remove and reinstall some or all the components.

A better solution is to automate the installation of Foreman into a Docker container, or on to a VM using Vagrant. Automating the installation process guarantees accuracy and consistency. The Vagrant VirtualBox VM can be snapshotted, moved to another host, or simply destroyed and recreated, if needed.

All code for this post is available on GitHub. However, it been updated as of 8/23/2015. Changes were required to fix compatibility issues with the latest versions of Puppet 4.x and Foreman. Additionally, the version of CentOS on all VMs was updated from 6.6 to 7.1 and the version of Foreman was updated from 1.7 to 1.9.

The Post’s Example

In this post, we will use Vagrant and VirtualBox to create three VMs. The VMs in this post will be build from a standard CentOS 6.5 x64 base Vagrant Box, located on Atlas. We will use a single JSON-format configuration file to automatically build all three VMs with Vagrant. As part of the provisioning process, using Vagrant’s shell provisioner, we will execute a bootstrap shell script. The script will install Foreman and it’s associated software on the first VM, and Puppet Agent on the two remaining VMs (aka Puppet ‘agent nodes’ or Foreman ‘hosts’).

Foreman does have the ability to provision on bare-metal infrastructure and public or private clouds. However, this example would simulate an environment where you have existing nodes you want to manage with Foreman.

The Foreman bootstrap script will also download several Puppet modules. To test Foreman once the provisioning is complete, import those module’s classes into Foreman and assign the classes to the hosts. The hosts will fetch and apply the configurations. You can then test for the installed instances of those module’s components on the puppet agent hosts.

Vagrant

To begin the process, we will use the JSON-format configuration file to create the three VMs, using Vagrant and VirtualBox.

{
  "nodes": {
    "theforeman.example.com": {
      ":ip": "192.168.35.5",
      "ports": [],
      ":memory": 1024,
      ":bootstrap": "bootstrap-foreman.sh"
    },
    "agent01.example.com": {
      ":ip": "192.168.35.10",
      "ports": [],
      ":memory": 1024,
      ":bootstrap": "bootstrap-node.sh"
    },
    "agent02.example.com": {
      ":ip": "192.168.35.20",
      "ports": [],
      ":memory": 1024,
      ":bootstrap": "bootstrap-node.sh"
    }
  }
}

The Vagrantfile uses the JSON-format configuration file, to provision the three VMs, using a single ‘vagrant up‘ command. That’s it, less than 30 lines of actual code in the Vagrantfile to create as many VMs as you want. For this post’s example, we will not need to add any VirtualBox port mappings. However, that can also done from the JSON configuration file (see the READM.md for more directions).

 

Vagrant Provisioning the VMs

Vagrant Provisioning the VMs

If you have not used the CentOS Vagrant Box, it will take a few minutes the first time for Vagrant to download the it to the local Vagrant Box repository.

# -*- mode: ruby -*-
# vi: set ft=ruby :

# Builds single Foreman server and
# multiple Puppet Agent Nodes using JSON config file
# Gary A. Stafford - 01/15/2015

# read vm and chef configurations from JSON files
nodes_config = (JSON.parse(File.read("nodes.json")))['nodes']

VAGRANTFILE_API_VERSION = "2"

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
  config.vm.box = "chef/centos-6.5"

  nodes_config.each do |node|
    node_name   = node[0] # name of node
    node_values = node[1] # content of node

    config.vm.define node_name do |config|
      # configures all forwarding ports in JSON array
      ports = node_values['ports']
      ports.each do |port|
        config.vm.network :forwarded_port,
          host:  port[':host'],
          guest: port[':guest'],
          id:    port[':id']
      end

      config.vm.hostname = node_name
      config.vm.network :private_network, ip: node_values[':ip']

      config.vm.provider :virtualbox do |vb|
        vb.customize ["modifyvm", :id, "--memory", node_values[':memory']]
        vb.customize ["modifyvm", :id, "--name", node_name]
      end

      config.vm.provision :shell, :path => node_values[':bootstrap']
    end
  end
end

Once provisioned, the three VMs, also called ‘Machines’ by Vagrant, should appear in Oracle VM VirtualBox Manager.

Oracle VM VirtualBox Manager View

Oracle VM VirtualBox Manager View

The name of the VMs, referenced in Vagrant commands, is the parent node name in the JSON configuration file (node_name), such as, ‘vagrant ssh theforeman.example.com‘.

Vagrant Status

Bootstrapping Foreman

As part of the Vagrant provisioning process (‘vagrant up‘ command), a bootstrap script is executed on the VMs (shown below). This script will do almost of the installation and configuration work. Below is script for bootstrapping the Foreman VM.

#!/bin/sh

# Run on VM to bootstrap Foreman server
# Gary A. Stafford - 01/15/2015

if ps aux | grep "/usr/share/foreman" | grep -v grep 2> /dev/null
then
    echo "Foreman appears to all already be installed. Exiting..."
else
    # Configure /etc/hosts file
    echo "" | sudo tee --append /etc/hosts 2> /dev/null && \
    echo "192.168.35.5    theforeman.example.com   theforeman" | sudo tee --append /etc/hosts 2> /dev/null

    # Update system first
    sudo yum update -y

    # Install Foreman for CentOS 6
    sudo rpm -ivh http://yum.puppetlabs.com/puppetlabs-release-el-6.noarch.rpm && \
    sudo yum -y install epel-release http://yum.theforeman.org/releases/1.7/el6/x86_64/foreman-release.rpm && \
    sudo yum -y install foreman-installer && \
    sudo foreman-installer

    # First run the Puppet agent on the Foreman host which will send the first Puppet report to Foreman,
    # automatically creating the host in Foreman's database
    sudo puppet agent --test --waitforcert=60

    # Install some optional puppet modules on Foreman server to get started...
    sudo puppet module install -i /etc/puppet/environments/production/modules puppetlabs-ntp
    sudo puppet module install -i /etc/puppet/environments/production/modules puppetlabs-git
    sudo puppet module install -i /etc/puppet/environments/production/modules puppetlabs-docker
fi

Bootstrapping Puppet Agent Nodes

Below is script for bootstrapping the puppet agent nodes. The agent node bootstrap script was executed as part of the Vagrant provisioning process.

#!/bin/sh

# Run on VM to bootstrap Puppet Agent nodes
# Gary A. Stafford - 01/15/2015

if ps aux | grep "puppet agent" | grep -v grep 2> /dev/null
then
    echo "Puppet Agent is already installed. Moving on..."
else
    # Update system first
    sudo yum update -y

    # Install Puppet for CentOS 6
    sudo rpm -ivh http://yum.puppetlabs.com/puppetlabs-release-el-6.noarch.rpm && \
    sudo yum -y install puppet

    # Configure /etc/hosts file
    echo "" | sudo tee --append /etc/hosts 2> /dev/null && \
    echo "192.168.35.5    theforeman.example.com   theforeman" | sudo tee --append /etc/hosts 2> /dev/null

    # Add agent section to /etc/puppet/puppet.conf (sets run interval to 120 seconds)
    echo "" | sudo tee --append /etc/puppet/puppet.conf 2> /dev/null && \
    echo "    server = theforeman.example.com" | sudo tee --append /etc/puppet/puppet.conf 2> /dev/null && \
    echo "    runinterval = 120" | sudo tee --append /etc/puppet/puppet.conf 2> /dev/null

    sudo service puppet stop
    sudo service puppet start

    sudo puppet resource service puppet ensure=running enable=true
    sudo puppet agent --enable
fi

Now that the Foreman is running, use the command, ‘vagrant ssh agent01.example.com‘, to ssh into the first puppet agent node. Run the command below.

sudo puppet agent --test --waitforcert=60

The command above manually starts Puppet’s Certificate Signing Request (CSR) process, to generate the certificates and security credentials (private and public keys) generated by Puppet’s built-in certificate authority (CA). Each puppet agent node must have it certificate signed by the Foreman, first. According to Puppet’s website, “Before puppet agent nodes can retrieve their configuration catalogs, they need a signed certificate from the local Puppet certificate authority (CA). When using Puppet’s built-in CA (that is, not using an external CA), agents will submit a certificate signing request (CSR) to the CA Puppet Master (Foreman) and will retrieve a signed certificate once one is available.

Waiting for Certificate to be Signed by Foreman

Waiting for Certificate to be Signed by Foreman

Open the Foreman browser-based interface, running at https://theforeman.example.com. Proceed to the ‘Infrastructure’ -> ‘Smart Proxies’ tab. Sign the certificate(s) from the agent nodes (shown below). The agent node will wait for the Foreman to sign the certificate, before continuing with the initial configuration.

Certificate Waiting to be Signed in Foreman

Certificate Waiting to be Signed in Foreman

Once the certificate signing process is complete, the host retrieves the client configuration from the Foreman and applies it to the hosts.

Foreman Puppet Configuration Applied to Agent Node

Foreman Puppet Configuration Applied to Agent Node

That’s it, you should now have one host running Foreman and two puppet agent nodes.

Testing Foreman

To test Foreman, import the classes from the Puppet modules installed with the Foreman bootstrap script.

Foreman - Puppet Classes

Foreman – Puppet Classes

Next, apply  ntp, git, and Docker classes to both agent nodes (aka, Foreman ‘hosts’), as well as the Foreman node, itself.

Foreman - Agents Puppet Classes

Foreman – Agents Puppet Classes

Every two minutes, the two agent node hosts should fetch their latest configuration from Foreman and apply it. In a few minutes, check the times reported in the ‘Last report’ column on the ‘All Hosts’ tab. If the times are two minutes or less, Foreman and Puppet Agent are working. Note we changed the runinterval to 120 seconds (‘120s’) in the bootstrap script to speed up the Puppet Agent updates for the sake of the demo. The normal default interval is 30 minutes. I recommend changing the agent node’s runinterval back to 30 minutes (’30m’) on the hosts, once everything is working to save unnecessary use of resources.

Foreman - Hosts Reporting Back

Foreman – Hosts Reporting Back

Finally, to verify that the configuration was successfully applied to the hosts, check if ntp, git, and Docker are now running on the hosts.

Agent Node with ntp and git Now Installed

Agent Node with ntp and git Now Installed

Helpful Links

All the source code this project is on Github.

Foreman:
http://theforeman.org

Atlas – Discover Vagrant Boxes:
https://atlas.hashicorp.com/boxes/search

Learning Puppet – Basic Agent/Master Puppet
https://docs.puppetlabs.com/learning/agent_master_basic.html

Puppet Glossary (of terms):
https://docs.puppetlabs.com/references/glossary.html

, , , , , , , , ,

9 Comments

Installing Puppet Master and Agents on Multiple VM Using Vagrant and VirtualBox

 Automatically provision multiple VMs with Vagrant and VirtualBox. Automatically install, configure, and test Puppet Master and Puppet Agents on those VMs.

Puppet Master Agent Vagrant (3)

Introduction

Note this post and accompanying source code was updated on 12/16/2014 to v0.2.1. It contains several improvements to improve and simplify the install process.

Puppet Labs’ Open Source Puppet Agent/Master architecture is an effective solution to manage infrastructure and system configuration. However, for the average System Engineer or Software Developer, installing and configuring Puppet Master and Puppet Agent can be challenging. If the installation doesn’t work properly, the engineer’s stuck troubleshooting, or trying to remove and re-install Puppet.

A better solution, automate the installation of Puppet Master and Puppet Agent on Virtual Machines (VMs). Automating the installation process guarantees accuracy and consistency. Installing Puppet on VMs means the VMs can be snapshotted, cloned, or simply destroyed and recreated, if needed.

In this post, we will use Vagrant and VirtualBox to create three VMs. The VMs will be build from a  Ubuntu 14.04.1 LTS (Trusty Tahr) Vagrant Box, previously on Vagrant Cloud, now on Atlas. We will use a single JSON-format configuration file to build all three VMs, automatically. As part of the Vagrant provisioning process, we will run a bootstrap shell script to install Puppet Master on the first VM (Puppet Master server) and Puppet Agent on the two remaining VMs (agent nodes).

Lastly, to test our Puppet installations, we will use Puppet to install some basic Puppet modules, including ntp and git on the server, and ntpgitDocker and Fig, on the agent nodes.

All the source code this project is on Github.

Vagrant

To begin the process, we will use the JSON-format configuration file to create the three VMs, using Vagrant and VirtualBox.

{
  "nodes": {
    "puppet.example.com": {
      ":ip": "192.168.32.5",
      "ports": [],
      ":memory": 1024,
      ":bootstrap": "bootstrap-master.sh"
    },
    "node01.example.com": {
      ":ip": "192.168.32.10",
      "ports": [],
      ":memory": 1024,
      ":bootstrap": "bootstrap-node.sh"
    },
    "node02.example.com": {
      ":ip": "192.168.32.20",
      "ports": [],
      ":memory": 1024,
      ":bootstrap": "bootstrap-node.sh"
    }
  }
}

The Vagrantfile uses the JSON-format configuration file, to provision the three VMs, using a single ‘vagrant up‘ command. That’s it, less than 30 lines of actual code in the Vagrantfile to create as many VMs as we need. For this post’s example, we will not need to add any port mappings, which can be done from the JSON configuration file (see the READM.md for more directions). The Vagrant Box we are using already has the correct ports opened.

If you have not previously used the Ubuntu Vagrant Box, it will take a few minutes the first time for Vagrant to download the it to the local Vagrant Box repository.

# vi: set ft=ruby :

# Builds Puppet Master and multiple Puppet Agent Nodes using JSON config file
# Author: Gary A. Stafford

# read vm and chef configurations from JSON files
nodes_config = (JSON.parse(File.read("nodes.json")))['nodes']

VAGRANTFILE_API_VERSION = "2"

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
  config.vm.box = "ubuntu/trusty64"

  nodes_config.each do |node|
    node_name   = node[0] # name of node
    node_values = node[1] # content of node

    config.vm.define node_name do |config|
      # configures all forwarding ports in JSON array
      ports = node_values['ports']
      ports.each do |port|
        config.vm.network :forwarded_port,
          host:  port[':host'],
          guest: port[':guest'],
          id:    port[':id']
      end

      config.vm.hostname = node_name
      config.vm.network :private_network, ip: node_values[':ip']

      config.vm.provider :virtualbox do |vb|
        vb.customize ["modifyvm", :id, "--memory", node_values[':memory']]
        vb.customize ["modifyvm", :id, "--name", node_name]
      end

      config.vm.provision :shell, :path => node_values[':bootstrap']
    end
  end
end

Once provisioned, the three VMs, also referred to as ‘Machines’ by Vagrant, should appear, as shown below, in Oracle VM VirtualBox Manager.

Vagrant Machines in VM VirtualBox Manager

Vagrant Machines in VM VirtualBox Manager

The name of the VMs, referenced in Vagrant commands, is the parent node name in the JSON configuration file (node_name), such as, ‘vagrant ssh puppet.example.com‘.

Vagrant Machine Names

Vagrant Machine Names

Bootstrapping Puppet Master Server

As part of the Vagrant provisioning process, a bootstrap script is executed on each of the VMs (script shown below). This script will do 98% of the required work for us. There is one for the Puppet Master server VM, and one for each agent node.

#!/bin/sh

# Run on VM to bootstrap Puppet Master server

if ps aux | grep "puppet master" | grep -v grep 2> /dev/null
then
    echo "Puppet Master is already installed. Exiting..."
else
    # Install Puppet Master
    wget https://apt.puppetlabs.com/puppetlabs-release-trusty.deb && \
    sudo dpkg -i puppetlabs-release-trusty.deb && \
    sudo apt-get update -yq && sudo apt-get upgrade -yq && \
    sudo apt-get install -yq puppetmaster

    # Configure /etc/hosts file
    echo "" | sudo tee --append /etc/hosts 2> /dev/null && \
    echo "# Host config for Puppet Master and Agent Nodes" | sudo tee --append /etc/hosts 2> /dev/null && \
    echo "192.168.32.5    puppet.example.com  puppet" | sudo tee --append /etc/hosts 2> /dev/null && \
    echo "192.168.32.10   node01.example.com  node01" | sudo tee --append /etc/hosts 2> /dev/null && \
    echo "192.168.32.20   node02.example.com  node02" | sudo tee --append /etc/hosts 2> /dev/null

    # Add optional alternate DNS names to /etc/puppet/puppet.conf
    sudo sed -i 's/.*\[main\].*/&\ndns_alt_names = puppet,puppet.example.com/' /etc/puppet/puppet.conf

    # Install some initial puppet modules on Puppet Master server
    sudo puppet module install puppetlabs-ntp
    sudo puppet module install garethr-docker
    sudo puppet module install puppetlabs-git
    sudo puppet module install puppetlabs-vcsrepo
    sudo puppet module install garystafford-fig

    # symlink manifest from Vagrant synced folder location
    ln -s /vagrant/site.pp /etc/puppet/manifests/site.pp
fi

There are a few last commands we need to run ourselves, from within the VMs. Once the provisioning process is complete,  ‘vagrant ssh puppet.example.com‘ into the newly provisioned Puppet Master server. Below are the commands we need to run within the ‘puppet.example.com‘ VM.

sudo service puppetmaster status # test that puppet master was installed
sudo service puppetmaster stop
sudo puppet master --verbose --no-daemonize
# Ctrl+C to kill puppet master
sudo service puppetmaster start
sudo puppet cert list --all # check for 'puppet' cert

According to Puppet’s website, ‘these steps will create the CA certificate and the puppet master certificate, with the appropriate DNS names included.

Bootstrapping Puppet Agent Nodes

Now that the Puppet Master server is running, open a second terminal tab (‘Shift+Ctrl+T‘). Use the command, ‘vagrant ssh node01.example.com‘, to ssh into the new Puppet Agent node. The agent node bootstrap script should have already executed as part of the Vagrant provisioning process.

#!/bin/sh

# Run on VM to bootstrap Puppet Agent nodes
# http://blog.kloudless.com/2013/07/01/automating-development-environments-with-vagrant-and-puppet/

if ps aux | grep "puppet agent" | grep -v grep 2> /dev/null
then
    echo "Puppet Agent is already installed. Moving on..."
else
    sudo apt-get install -yq puppet
fi

if cat /etc/crontab | grep puppet 2> /dev/null
then
    echo "Puppet Agent is already configured. Exiting..."
else
    sudo apt-get update -yq && sudo apt-get upgrade -yq

    sudo puppet resource cron puppet-agent ensure=present user=root minute=30 \
        command='/usr/bin/puppet agent --onetime --no-daemonize --splay'

    sudo puppet resource service puppet ensure=running enable=true

    # Configure /etc/hosts file
    echo "" | sudo tee --append /etc/hosts 2> /dev/null && \
    echo "# Host config for Puppet Master and Agent Nodes" | sudo tee --append /etc/hosts 2> /dev/null && \
    echo "192.168.32.5    puppet.example.com  puppet" | sudo tee --append /etc/hosts 2> /dev/null && \
    echo "192.168.32.10   node01.example.com  node01" | sudo tee --append /etc/hosts 2> /dev/null && \
    echo "192.168.32.20   node02.example.com  node02" | sudo tee --append /etc/hosts 2> /dev/null

    # Add agent section to /etc/puppet/puppet.conf
    echo "" && echo "[agent]\nserver=puppet" | sudo tee --append /etc/puppet/puppet.conf 2> /dev/null

    sudo puppet agent --enable
fi

Run the two commands below within both the ‘node01.example.com‘ and ‘node02.example.com‘ agent nodes.

sudo service puppet status # test that agent was installed
sudo puppet agent --test --waitforcert=60 # initiate certificate signing request (CSR)

The second command above will manually start Puppet’s Certificate Signing Request (CSR) process, to generate the certificates and security credentials (private and public keys) generated by Puppet’s built-in certificate authority (CA). Each Puppet Agent node must have it certificate signed by the Puppet Master, first. According to Puppet’s website, “Before puppet agent nodes can retrieve their configuration catalogs, they need a signed certificate from the local Puppet certificate authority (CA). When using Puppet’s built-in CA (that is, not using an external CA), agents will submit a certificate signing request (CSR) to the CA Puppet Master and will retrieve a signed certificate once one is available.

Agent Node Starting Puppet's Certificate Signing Request (CSR) Process

Agent Node Starting Puppet’s Certificate Signing Request (CSR) Process

Back on the Puppet Master Server, run the following commands to sign the certificate(s) from the agent node(s). You may sign each node’s certificate individually, or wait and sign them all at once. Note the agent node(s) will wait for the Puppet Master to sign the certificate, before continuing with the Puppet Agent configuration run.

sudo puppet cert list # should see 'node01.example.com' cert waiting for signature
sudo puppet cert sign --all # sign the agent node certs
sudo puppet cert list --all # check for signed certs

Puppet Master Completing Puppet's Certificate Signing Request (CSR) Process

Puppet Master Completing Puppet’s Certificate Signing Request (CSR) Process

Once the certificate signing process is complete, the Puppet Agent retrieves the client configuration from the Puppet Master and applies it to the local agent node. The Puppet Agent will execute all applicable steps in the site.pp manifest on the Puppet Master server, designated for that specific Puppet Agent node (ie.’node node02.example.com {...}‘).

Configuration Run Completed on Puppet Agent Node

Configuration Run Completed on Puppet Agent Node

Below is the main site.pp manifest on the Puppet Master server, applied by Puppet Agent on the agent nodes.

node default {
# Test message
  notify { "Debug output on ${hostname} node.": }

  include ntp, git
}

node 'node01.example.com', 'node02.example.com' {
# Test message
  notify { "Debug output on ${hostname} node.": }

  include ntp, git, docker, fig
}

That’s it! You should now have one server VM running Puppet Master, and two agent node VMs running Puppet Agent. Both agent nodes should have successfully been registered with Puppet Master, and configured themselves based on the Puppet Master’s main manifest. Agent node configuration includes installing ntp, git, Fig, and Docker.

Helpful Links

All the source code this project is on Github.

Puppet Glossary (of terms):
https://docs.puppetlabs.com/references/glossary.html

Puppet Labs Open Source Automation Tools:
http://puppetlabs.com/misc/download-options

Puppet Master Overview:
http://ci.openstack.org/puppet.html

Install Puppet on Ubuntu:
https://docs.puppetlabs.com/guides/install_puppet/install_debian_ubuntu.html

Installing Puppet Master:
http://andyhan.linuxdict.com/index.php/sys-adm/item/273-puppet-371-on-centos-65-quick-start-i

Regenerating Node Certificates:
https://docs.puppetlabs.com/puppet/latest/reference/ssl_regenerate_certificates.html

Automating Development Environments with Vagrant and Puppet:
http://blog.kloudless.com/2013/07/01/automating-development-environments-with-vagrant-and-puppet

, , , , , , , , , ,

8 Comments

Managing Windows Servers with Chef, Book Review

Harness the power of Chef to automate management of Windows-based systems using hands-on examples.

Managing Windows Servers with Chef

Recently, I had the opportunity to read, ‘Managing Windows Servers with Chef’, authored John Ewart, and published in May, 2014 by Packt Publishing. At a svelte 110 pages in paperback form, ‘Managing Windows Servers with Chef’, is a quick read, packed with concise information, relevant examples, and excellent code samples. Available on Packt Publishing’s website for a mere $11.90 for the ebook, it a worthwhile investment for anyone considering Chef Software’s Chef product for automating their Windows-based infrastructure.

As an IT professional, I use Chef for both Windows and Linux-based IT automation, on a regular basis. In my experience, there is a plethora of information on the Internet about properly implementing and scaling Chef. There is seldom a topic I can’t find the answers to, online. However, it has also been my experience, information is often Linux-centric. That is one reason I really appreciated Ewart’s book, concentrating almost exclusively on Windows-based implementations of Chef.

IT professionals, just getting starting with Chef, or migrating from Puppet, will find the ‘Managing Windows Servers with Chef’ invaluable. Ewart does a good job building the user’s understanding of the Chef ecosystem, before beginning to explain its application to a Windows-based environment. If you are considering Chef versus Puppet Lab’s Puppet for Windows-based IT automation, reading this book will give you a solid overview of Chef.

Seasoned users of Chef will also find the ‘Managing Windows Servers with Chef’ useful. Professionals quickly master the Chef principles, and develop the means to automate their specific tasks with Chef. But inevitably, there comes the day when they must automate something new with Chef. That is where the book can serve as a handy reference.

Of all the books topics, I especially found value in Chapter 5 (Managing Cloud Services with Chef) and Chapter 6 (Going Beyond the Basics – Testing Recipes). Even large enterprise-scale corporations are moving infrastructure to cloud providers. Ewart demonstrates Chef’s Windows-based integration with Microsoft’s Azure, Amazon’s EC2, and Rackspace’s Cloud offerings. Also, Ewart’s section on testing is a reminder to all of us, of the importance of unit testing. I admit I more often practice TAD (‘Testing After Development’) than TDD (Test Driven Development), LOL. Ewart introduces both RSpec and ChefSpec for testing Chef recipes.

I recommend ‘Managing Windows Servers with Chef’ for anyone considering Chef, or who is seeking a good introductory guide to getting started with Chef for Windows-based systems.

 

, , , , , ,

Leave a comment

Windows PowerShell 4.0 for .NET Developers, Book Review

A brief review of ‘Windows PowerShell 4.0 for .NET Developers’, a fast-paced PowerShell guide, enabling you to efficiently administer and maintain your development environment.

Windows PowerShell 4.0 for .NET Developers

Introduction

Recently, I had the opportunity to review ‘Windows PowerShell 4.0 for .NET Developers‘, published by Packt Publishing. According to its author, Sherif Talaat, the book is ‘a fast-paced PowerShell guide, enabling you to efficiently administer and maintain your development environment.‘ Working in a large and complex software development organization, technologies such as PowerShell, which enable increased speed and automation, are essential to our success. Having used PowerShell on a regular basis as a .NET developer for the past few years, I was excited to see what Sherif’s newest book offered.

Requirements

The book recommends the following minimal software configuration to work through the code samples:

  • Windows Server 2012 R2 (includes PowerShell 4.0 and .NET 4.5)
  • SQL Server 2012
  • Visual Studio 2012/2013
  • Visual Studio Team Foundation Server (TFS) 2012/2013

To test the book’s samples, I provisioned a fresh VM, and using my MSDN subscription, installed the required Windows Server, SQL Server, and Team Foundation Server. I worked directly on the VM, as well as remotely from a Windows 7 Enterprise-based development machine with Visual Studio 2012 installed. The code samples worked fairly well, with only a few minor problems I found. There is still no errata published for the book as of the time of review.

A key aspect many authors do not address, is the complexities of using PowerShell in a corporate environment. Working individually or on a small network, developers don’t always experience the added burden of restrictive network security, LDAP, proxy servers, proxy authentication, XML gateways, firewalls, and centralized computer administration. Any code that requires access to remote servers and systems, often requires additional coding to work within a corporate environment. It can be frustrating to debug and extend simple examples to work successfully within an enterprise setting.

Contents

Windows PowerShell 4.0 for .NET Developers, at 115 pages in length, is divided into five chapters:

  • Chapter 1: Getting Started with Windows PowerShell
  • Chapter 2: Unleashing Your Development Skills with PowerShell
  • Chapter 3: PowerShell for Your Daily Administration Tasks
  • Chapter 4: PowerShell and Web Technologies
  • Chapter 5: PowerShell and Team Foundation Server

Chapter 1 provides a brief introduction to PowerShell. At a scant 30 pages, I would not recommend this book as a way to learn PowerShell for the beginner. For learning PowerShell, I recommend Instant Windows PowerShell, by Vinith Menon, also published by Packt Publishing. Alternatively, I recommend a few books by Manning Publications, including Learn Windows PowerShell in a Month of Lunches, Second Edition.

Chapter 2 discusses PowerShell in relationship to several key Microsoft technologies, including Windows Management Instrumentation (WMI), Common Information Model (CIM), Component Object Model (COM) and Extensible Markup Language (XML). As a .NET developer, it’s almost impossible not to have worked with one, or all of these technologies. Chapter 2 discusses how PowerShell works with .NET objects, and extend the .NET framework. The chapter also includes an easy-to-follow example of creating, importing, and calling a PowerShell binary module (compiled .NET class library), using Visual Studio.

Chapter 3 explores areas where .NET developer can start leveraging PowerShell for daily administrative tasks. In particular, I found the sections on PowerShell Remoting and administering IIS and SQL Server particularly useful. Being able to easily connect to remote web, application, and database servers from the command line (or, PowerShell prompt) and do basic system administration is a huge time savings in an agile development environment.

Chapters 4 focuses on how PowerShell interfaces with SOAP and REST based services, web requests, and JSON. Windows Communication Foundation (WCF) based service-oriented application development has been a trend for the last few years. Being able to manage, test, and monitor SOAP and RESTful services and HTTP requests/responses is important to .NET developers. PowerShell can often quicker and easier than writing and compiling service utilities in Visual Studio, or using proprietary third-party applications.

Chapter 5 is dedicated to Visual Studio Team Foundation Server (TFS), Microsoft’s end-to-end, Application Lifecycle Management (ALM) solution. Chapter 5 details the installation and use of TFS Power Tools and TFS PowerShell snap-in. Having held the roles of lead developer and Scrum Master, I have personally found some of the best uses for PowerShell in automating various aspects of TFS. Managing TFS often requires repetitive tasks, the place where PowerShell excels. You will need to explore additional resources beyond the scope of this book to really start automating TFS with PowerShell.

Conclusion

Overall, I enjoyed the book and felt it was well worth the time to explore. I applaud Sherif for targeting a PowerShell book specifically to developers. Due to its short length, the book did leave me wanting more information on a few subjects that were barely skimmed. I also found myself expecting guidance on a few subjects the book did not touch upon, such as PowerShell for cloud-based development (Azure), test automation, and build and deployment automation. For more information on some of those subjects, I recommend Sherif’s other book, also published by Packt Publishing, PowerShell 3.0 Advanced Administration Handbook.

, , , , , , , ,

Leave a comment