Posts Tagged Infrastructure
Managing AWS Infrastructure as Code using Ansible, CloudFormation, and CodeBuild
Posted by Gary A. Stafford in AWS, Build Automation, Cloud, Continuous Delivery, DevOps, Python, Technology Consulting on July 30, 2019
Introduction
When it comes to provisioning and configuring resources on the AWS cloud platform, there is a wide variety of services, tools, and workflows you could choose from. You could decide to exclusively use the cloud-based services provided by AWS, such as CodeBuild, CodePipeline, CodeStar, and OpsWorks. Alternatively, you could choose open-source software (OSS) for provisioning and configuring AWS resources, such as community editions of Jenkins, HashiCorp Terraform, Pulumi, Chef, and Puppet. You might also choose to use licensed products, such as Octopus Deploy, TeamCity, CloudBees Core, Travis CI Enterprise, and XebiaLabs XL Release. You might even decide to write your own custom tools or scripts in Python, Go, JavaScript, Bash, or other common languages.
The reality in most enterprises I have worked with, teams integrate a combination of AWS services, open-source software, custom scripts, and occasionally licensed products to construct complete, end-to-end, infrastructure as code-based workflows for provisioning and configuring AWS resources. Choices are most often based on team experience, vendor relationships, and an enterprise’s specific business use cases.
In the following post, we will explore one such set of easily-integrated tools for provisioning and configuring AWS resources. The tool-stack is comprised of Red Hat Ansible, AWS CloudFormation, and AWS CodeBuild, along with several complementary AWS technologies. Using these tools, we will provision a relatively simple AWS environment, then deploy, configure, and test a highly-available set of Apache HTTP Servers. The demonstration is similar to the one featured in a previous post, Getting Started with Red Hat Ansible for Google Cloud Platform.
Why Ansible?
With its simplicity, ease-of-use, broad compatibility with most major cloud, database, network, storage, and identity providers amongst other categories, Ansible has been a popular choice of Engineering teams for configuration-management since 2012. Given the wide variety of polyglot technologies used within modern Enterprises and the growing predominance of multi-cloud and hybrid cloud architectures, Ansible provides a common platform for enabling mature DevOps and infrastructure as code practices. Ansible is easily integrated with higher-level orchestration systems, such as AWS CodeBuild, Jenkins, or Red Hat AWX and Tower.
Technologies
The primary technologies used in this post include the following.
Red Hat Ansible
Ansible, purchased by Red Hat in October 2015, seamlessly provides workflow orchestration with configuration management, provisioning, and application deployment in a single platform. Unlike similar tools, Ansible’s workflow automation is agentless, relying on Secure Shell (SSH) and Windows Remote Management (WinRM). If you are interested in learning more on the advantages of Ansible, they’ve published a whitepaper on The Benefits of Agentless Architecture.
According to G2 Crowd, Ansible is a clear leader in the Configuration Management Software category, ranked right behind GitLab. Competitors in the category include GitLab, AWS Config, Puppet, Chef, Codenvy, HashiCorp Terraform, Octopus Deploy, and JetBrains TeamCity.
AWS CloudFormation
According to AWS, CloudFormation provides a common language to describe and provision all the infrastructure resources within AWS-based cloud environments. CloudFormation allows you to use a JSON- or YAML-based template to model and provision, in an automated and secure manner, all the resources needed for your applications across all AWS regions and accounts.
Codifying your infrastructure, often referred to as ‘Infrastructure as Code,’ allows you to treat your infrastructure as just code. You can author it with any IDE, check it into a version control system, and review the files with team members before deploying it.
AWS CodeBuild
According to AWS, CodeBuild is a fully managed continuous integration service that compiles your source code, runs tests, and produces software packages that are ready to deploy. With CodeBuild, you don’t need to provision, manage, and scale your own build servers. CodeBuild scales continuously and processes multiple builds concurrently, so your builds are not left waiting in a queue.
CloudBuild integrates seamlessly with other AWS Developer tools, including CodeStar, CodeCommit, CodeDeploy, and CodePipeline.
According to G2 Crowd, the main competitors to AWS CodeBuild, in the Build Automation Software category, include Jenkins, CircleCI, CloudBees Core and CodeShip, Travis CI, JetBrains TeamCity, and Atlassian Bamboo.
Other Technologies
In addition to the major technologies noted above, we will also be leveraging the following services and tools to a lesser extent, in the demonstration:
- AWS CodeCommit
- AWS CodePipeline
- AWS Systems Manager Parameter Store
- Amazon Simple Storage Service (S3)
- AWS Identity and Access Management (IAM)
- AWS Command Line Interface (CLI)
- CloudFormation Linter
- Apache HTTP Server
Demonstration
Source Code
All source code for this post is contained in two GitHub repositories. The CloudFormation templates and associated files are in the ansible-aws-cfn GitHub repository. The Ansible Roles and related files are in the ansible-aws-roles GitHub repository. Both repositories may be cloned using the following commands.
git clone --branch master --single-branch --depth 1 --no-tags \ https://github.com/garystafford/ansible-aws-cfn.git git clone --branch master --single-branch --depth 1 --no-tags \ https://github.com/garystafford/ansible-aws-roles.git
Development Process
The general process we will follow for provisioning and configuring resources in this demonstration are as follows:
- Create an S3 bucket to store the validated CloudFormation templates
- Create an Amazon EC2 Key Pair for Ansible
- Create two AWS CodeCommit Repositories to store the project’s source code
- Put parameters in Parameter Store
- Write and test the CloudFormation templates
- Configure Ansible and AWS Dynamic Inventory script
- Write and test the Ansible Roles and Playbooks
- Write the CodeBuild build specification files
- Create an IAM Role for CodeBuild and CodePipeline
- Create and test CodeBuild Projects and CodePipeline Pipelines
- Provision, deploy, and configure the complete web platform to AWS
- Test the final web platform
Prerequisites
For this demonstration, I will assume you already have an AWS account, the AWS CLI, Python, and Ansible installed locally, an S3 bucket to store the final CloudFormation templates and an Amazon EC2 Key Pair for Ansible to use for SSH.
Continuous Integration and Delivery Overview
In this demonstration, we will be building multiple CI/CD pipelines for provisioning and configuring our resources to AWS, using several AWS services. These services include CodeCommit, CodeBuild, CodePipeline, Systems Manager Parameter Store, and Amazon Simple Storage Service (S3). The diagram below shows the complete CI/CD workflow we will build using these AWS services, along with Ansible.
AWS CodeCommit
According to Amazon, AWS CodeCommit is a fully-managed source control service that makes it easy to host secure and highly scalable private Git repositories. CodeCommit eliminates the need to operate your own source control system or worry about scaling its infrastructure.
Start by creating two AWS CodeCommit repositories to hold the two GitHub projects your cloned earlier. Commit both projects to your own AWS CodeCommit repositories.
Configuration Management
We have several options for storing the configuration values necessary to provision and configure the resources on AWS. We could set configuration values as environment variables directly in CodeBuild. We could set configuration values from within our Ansible Roles. We could use AWS Systems Manager Parameter Store to store configuration values. For this demonstration, we will use a combination of all three options.
AWS Systems Manager Parameter Store
According to Amazon, AWS Systems Manager Parameter Store provides secure, hierarchical storage for configuration data management and secrets management. You can store data such as passwords, database strings, and license codes as parameter values, as either plain text or encrypted.
The demonstration uses two CloudFormation templates. The two templates have several parameters. A majority of those parameter values will be stored in Parameter Store, retrieved by CloudBuild, and injected into the CloudFormation template during provisioning.
The Ansible GitHub project includes a shell script, parameter_store_values.sh
, to put the necessary parameters into Parameter Store. The script requires the AWS Command Line Interface (CLI) to be installed locally. You will need to change the KEY_PATH
key value in the script (snippet shown below) to match the location your private key, part of the Amazon EC2 Key Pair you created earlier for use by Ansible.
KEY_PATH="/path/to/private/key" # put encrypted parameter to Parameter Store aws ssm put-parameter \ --name $PARAMETER_PATH/ansible_private_key \ --type SecureString \ --value "file://${KEY_PATH}" \ --description "Ansible private key for EC2 instances" \ --overwrite
SecureString
Whereas all other parameters are stored in Parameter Store as String datatypes, the private key is stored as a SecureString datatype. Parameter Store uses an AWS Key Management Service (KMS) customer master key (CMK) to encrypt the SecureString parameter value. The IAM Role used by CodeBuild (discussed later) will have the correct permissions to use the KMS key to retrieve and decrypt the private key SecureString parameter value.
CloudFormation
The demonstration uses two CloudFormation templates. The first template, network-stack.template
, contains the AWS network stack resources. The template includes one VPC, one Internet Gateway, two NAT Gateways, four Subnets, two Elastic IP Addresses, and associated Route Tables and Security Groups. The second template, compute-stack.template
, contains the webserver compute stack resources. The template includes an Auto Scaling Group, Launch Configuration, Application Load Balancer (ALB), ALB Listener, ALB Target Group, and an Instance Security Group. Both templates originated from the AWS CloudFormation template sample library, and were modified for this demonstration.
The two templates are located in the cfn_templates
directory of the CloudFormation project, as shown below in the tree view.
. ├── LICENSE.md ├── README.md ├── buildspec_files │ ├── build.sh │ └── buildspec.yml ├── cfn_templates │ ├── compute-stack.template │ └── network-stack.template ├── codebuild_projects │ ├── build.sh │ └── cfn-validate-s3.json ├── codepipeline_pipelines │ ├── build.sh │ └── cfn-validate-s3.json └── requirements.txt
The templates require no modifications for the demonstration. All parameters are in Parameter store or set by the Ansible Roles, and consumed by the Ansible Playbooks via CodeBuild.
Ansible
We will use Red Hat Ansible to provision the network and compute resources by interacting directly with CloudFormation, deploy and configure Apache HTTP Server, and finally, perform final integration tests of the system. In my opinion, the closest equivalent to Ansible on the AWS platform is AWS OpsWorks. OpsWorks lets you use Chef and Puppet (direct competitors to Ansible) to automate how servers are configured, deployed, and managed across Amazon EC2 instances or on-premises compute environments.
Ansible Config
To use Ansible with AWS and CloudFormation, you will first want to customize your project’s ansible.cfg
file to enable the aws_ec2
inventory plugin. Below is part of my configuration file as a reference.
[defaults] gathering = smart fact_caching = jsonfile fact_caching_connection = /tmp fact_caching_timeout = 300 host_key_checking = False roles_path = roles inventory = inventories/hosts remote_user = ec2-user private_key_file = ~/.ssh/ansible [inventory] enable_plugins = host_list, script, yaml, ini, auto, aws_ec2
Ansible Roles
According to Ansible, Roles are ways of automatically loading certain variable files, tasks, and handlers based on a known file structure. Grouping content by roles also allows easy sharing of roles with other users. For the demonstration, I have written four roles, located in the roles
directory, as shown below in the project tree view. The default, common
role is not used in this demonstration.
. ├── LICENSE.md ├── README.md ├── ansible.cfg ├── buildspec_files │ ├── buildspec_compute.yml │ ├── buildspec_integration_tests.yml │ ├── buildspec_network.yml │ └── buildspec_web_config.yml ├── codebuild_projects │ ├── ansible-test.json │ ├── ansible-web-config.json │ ├── build.sh │ ├── cfn-compute.json │ ├── cfn-network.json │ └── notes.md ├── filter_plugins ├── group_vars ├── host_vars ├── inventories │ ├── aws_ec2.yml │ ├── ec2.ini │ ├── ec2.py │ └── hosts ├── library ├── module_utils ├── notes.md ├── parameter_store_values.sh ├── playbooks │ ├── 10_cfn_network.yml │ ├── 20_cfn_compute.yml │ ├── 30_web_config.yml │ └── 40_integration_tests.yml ├── production ├── requirements.txt ├── roles │ ├── cfn_compute │ ├── cfn_network │ ├── common │ ├── httpd │ └── integration_tests ├── site.yml └── staging
The four roles include a role for provisioning the network, the cfn_network
role. A role for configuring the compute resources, the cfn_compute
role. A role for deploying and configuring the Apache servers, the httpd
role. Finally, a role to perform final integration tests of the platform, the integration_tests
role. The individual roles help separate the project’s major parts, network, compute, and middleware, into logical code files. Each role was initially built using Ansible Galaxy (ansible-galaxy init
). They follow Galaxy’s standard file structure, as shown in the tree view below, of the cfn_network
role.
. ├── README.md ├── defaults │ └── main.yml ├── files ├── handlers │ └── main.yml ├── meta │ └── main.yml ├── tasks │ ├── create.yml │ ├── delete.yml │ └── main.yml ├── templates ├── tests │ ├── inventory │ └── test.yml └── vars └── main.yml
Testing Ansible Roles
In addition to checking each role during development and on each code commit with Ansible Lint, each role contains a set of unit tests, in the tests
directory, to confirm the success or failure of the role’s tasks. Below we see a basic set of tests for the cfn_compute
role. First, we gather Facts about the deployed EC2 instances. Facts information Ansible can automatically derive from your remote systems. We check the facts for expected properties of the running EC2 instances, including timezone, Operating System, major OS version, and the UserID. Note the use of the failed_when
conditional. This Ansible playbook error handling conditional is used to confirm the success or failure of tasks.
--- - name: Test cfn_compute Ansible role gather_facts: True hosts: tag_Group_webservers pre_tasks: - name: List all ansible facts debug: msg: "{{ ansible_facts }}" tasks: - name: Check if EC2 instance's timezone is set to 'UTC' debug: msg: Timezone is UTC failed_when: ansible_facts['date_time']['tz'] != 'UTC' - name: Check if EC2 instance's OS is 'Amazon' debug: msg: OS is Amazon failed_when: ansible_facts['distribution_file_variety'] != 'Amazon' - name: Check if EC2 instance's OS major version is '2018' debug: msg: OS major version is 2018 failed_when: ansible_facts['distribution_major_version'] != '2018' - name: Check if EC2 instance's UserID is 'ec2-user' debug: msg: UserID is ec2-user failed_when: ansible_facts['user_id'] != 'ec2-user'
If we were to run the test on their own, against the two correctly provisioned and configured EC2 web servers, we would see results similar to the following.
In the cfn_network
role unit tests, below, note the use of the Ansible cloudformation_facts module. This module allows us to obtain facts about the successfully completed AWS CloudFormation stack. We can then use these facts to drive additional provisioning and configuration, or testing. In the task below, we get the network CloudFormation stack’s Outputs. These are the exact same values we would see in the stack’s Output tab, from the AWS CloudFormation management console.
--- - name: Test cfn_network Ansible role gather_facts: False hosts: localhost pre_tasks: - name: Get facts about the newly created cfn network stack cloudformation_facts: stack_name: "ansible-cfn-demo-network" register: cfn_network_stack_facts - name: List 'stack_outputs' from cached facts debug: msg: "{{ cloudformation['ansible-cfn-demo-network'].stack_outputs }}" tasks: - name: Check if the AWS Region of the VPC is {{ lookup('env','AWS_REGION') }} debug: msg: "AWS Region of the VPC is {{ lookup('env','AWS_REGION') }}" failed_when: cloudformation['ansible-cfn-demo-network'].stack_outputs['VpcRegion'] != lookup('env','AWS_REGION')
Similar to the CloudFormation templates, the Ansible roles require no modifications. Most of the project’s parameters are decoupled from the code and stored in Parameter Store or CodeBuild buildspec files (discussed next). The few parameters found in the roles, in the defaults/main.yml
files are neither account- or environment-specific.
Ansible Playbooks
The roles will be called by our Ansible Playbooks. There is a create
and a delete
set of tasks for the cfn_network
and cfn_compute
roles. Either create
or delete
tasks are accessible through the role, using the main.yml
file and referencing the create
or delete
Ansible Tags.
--- - import_tasks: create.yml tags: - create - import_tasks: delete.yml tags: - delete
Below, we see the create
tasks for the cfn_network
role, create.yml
, referenced above by main.yml
. The use of the cloudcormation module in the first task allows us to create or delete AWS CloudFormation stacks and demonstrates the real power of Ansible—the ability to execute complex AWS resource provisioning, by extending its core functionality via a module. By switching the Cloud module, we could just as easily provision resources on Google Cloud, Azure, AliCloud, OpenStack, or VMWare, to name but a few.
--- - name: create a stack, pass in the template via an S3 URL cloudformation: stack_name: "{{ stack_name }}" state: present region: "{{ lookup('env','AWS_REGION') }}" disable_rollback: false template_url: "{{ lookup('env','TEMPLATE_URL') }}" template_parameters: VpcCIDR: "{{ lookup('env','VPC_CIDR') }}" PublicSubnet1CIDR: "{{ lookup('env','PUBLIC_SUBNET_1_CIDR') }}" PublicSubnet2CIDR: "{{ lookup('env','PUBLIC_SUBNET_2_CIDR') }}" PrivateSubnet1CIDR: "{{ lookup('env','PRIVATE_SUBNET_1_CIDR') }}" PrivateSubnet2CIDR: "{{ lookup('env','PRIVATE_SUBNET_2_CIDR') }}" TagEnv: "{{ lookup('env','TAG_ENVIRONMENT') }}" tags: Stack: "{{ stack_name }}"
The CloudFormation parameters in the above task are mainly derived from environment variables, whose values were retrieved from the Parameter Store by CodeBuild and set in the environment. We obtain these external values using Ansible’s Lookup Plugins. The stack_name
variable’s value is derived from the role’s defaults/main.yml
file. The task variables use the Python Jinja2 templating system style of encoding.
The associated Ansible Playbooks, which call the tasks, are located in the playbooks
directory, as shown previously in the tree view. The playbooks define a few required parameters, like where the list of hosts will be derived and calls the appropriate roles. For our simple demonstration, only a single role is called per playbook. Typically, in a larger project, you would call multiple roles from a single playbook. Below, we see the Network playbook, playbooks/10_cfn_network.yml
, which calls the cfn_network
role.
--- - name: Provision VPC and Subnets hosts: localhost connection: local gather_facts: False roles: - role: cfn_network
Dynamic Inventory
Another principal feature of Ansible is demonstrated in the Web Server Configuration playbook, playbooks/30_web_config.yml
, shown below. Note the hosts to which we want to deploy and configure Apache HTTP Server is based on an AWS tag value, indicated by the reference to tag_Group_webservers
. This indirectly refers to an AWS tag, named Group, with the value of webservers
, which was applied to our EC2 hosts by CloudFormation. The ability to generate a Dynamic Inventory, using a dynamic external inventory system, is a key feature of Ansible.
--- - name: Configure Apache Web Servers hosts: tag_Group_webservers gather_facts: False become: yes become_method: sudo roles: - role: httpd
To generate a dynamic inventory of EC2 hosts, we are using the Ansible AWS EC2 Dynamic Inventory script, inventories/ec2.py
and inventories/ec2.ini
files. The script dynamically queries AWS for all the EC2 hosts containing specific AWS tags, belonging to a particular Security Group, Region, Availability Zone, and so forth.
I have customized the AWS EC2 Dynamic Inventory script’s configuration in the inventories/aws_ec2.yml
file. Amongst other configuration items, the file defines keyed_groups
. This instructs the script to inventory EC2 hosts according to their unique AWS tags and tag values.
plugin: aws_ec2 remote_user: ec2-user private_key_file: ~/.ssh/ansible regions: - us-east-1 keyed_groups: - key: tags.Name prefix: tag_Name_ separator: '' - key: tags.Group prefix: tag_Group_ separator: '' hostnames: - dns-name - ip-address - private-dns-name - private-ip-address compose: ansible_host: ip_address
Once you have built the CloudFormation compute stack in the proceeding section of the demonstration, to build the dynamic EC2 inventory of hosts, you would use the following command.
ansible-inventory -i inventories/aws_ec2.yml --graph
You would then see an inventory of all your EC2 hosts, resembling the following.
@all: |--@aws_ec2: | |--ec2-18-234-137-73.compute-1.amazonaws.com | |--ec2-3-95-215-112.compute-1.amazonaws.com |--@tag_Group_webservers: | |--ec2-18-234-137-73.compute-1.amazonaws.com | |--ec2-3-95-215-112.compute-1.amazonaws.com |--@tag_Name_Apache_Web_Server: | |--ec2-18-234-137-73.compute-1.amazonaws.com | |--ec2-3-95-215-112.compute-1.amazonaws.com |--@ungrouped:
Note the two EC2 web servers instances, listed under tag_Group_webservers
. They represent the target inventory onto which we will install Apache HTTP Server. We could also use the tag, Name, with the value tag_Name_Apache_Web_Server
.
AWS CodeBuild
Recalling our diagram, you will note the use of CodeBuild is a vital part of each of our five DevOps workflows. CodeBuild is used to 1) validate the CloudFormation templates, 2) provision the network resources, 3) provision the compute resources, 4) install and configure the web servers, and 5) run integration tests.
Splitting these processes into separate workflows, we can redeploy the web servers without impacting the compute resources or redeploy the compute resources without affecting the network resources. Often, different teams within a large enterprise are responsible for each of these resources categories—architecture, security (IAM), network, compute, web servers, and code deployments. Separating concerns makes a shared ownership model easier to manage.
Build Specifications
CodeBuild projects rely on a build specification or buildspec file for its configuration, as shown below. CodeBuild’s buildspec file is synonymous to Jenkins’ Jenkinsfile. Each of our five workflows will use CodeBuild. Each CodeBuild project references a separate buildspec file, included in the two GitHub projects, which by now you have pushed to your two CodeCommit repositories.
Below we see an example of the buildspec file for the CodeBuild project that deploys our AWS network resources, buildspec_files/buildspec_network.yml
.
version: 0.2 env: variables: TEMPLATE_URL: "https://s3.amazonaws.com/garystafford_cloud_formation/cf_demo/network-stack.template" AWS_REGION: "us-east-1" TAG_ENVIRONMENT: "ansible-cfn-demo" parameter-store: VPC_CIDR: "/ansible_demo/vpc_cidr" PUBLIC_SUBNET_1_CIDR: "/ansible_demo/public_subnet_1_cidr" PUBLIC_SUBNET_2_CIDR: "/ansible_demo/public_subnet_2_cidr" PRIVATE_SUBNET_1_CIDR: "/ansible_demo/private_subnet_1_cidr" PRIVATE_SUBNET_2_CIDR: "/ansible_demo/private_subnet_2_cidr" phases: install: runtime-versions: python: 3.7 commands: - pip install -r requirements.txt -q build: commands: - ansible-playbook -i inventories/aws_ec2.yml playbooks/10_cfn_network.yml --tags create -v post_build: commands: - ansible-playbook -i inventories/aws_ec2.yml roles/cfn_network/tests/test.yml
There are several distinct sections to the buildspec file. First, in the variables
section, we define our variables. They are a combination of three static variable values and five variable values retrieved from the Parameter Store. Any of these may be overwritten at build-time, using the AWS CLI, SDK, or from the CodeBuild management console. You will need to update some of the variables to match your particular environment, such as the TEMPLATE_URL
to match your S3 bucket path.
Next, the phases
of our build. Again, if you are familiar with Jenkins, think of these as Stages with multiple Steps. The first phase, install
, builds a Docker container, in which the build process is executed. Here we are using Python 3.7. We also run a pip command to install the required Python packages from our requirements.txt
file. Next, we perform our build
phase by executing an Ansible command.
ansible-playbook \ -i inventories/aws_ec2.yml \ playbooks/10_cfn_network.yml --tags create -v
The command calls our playbook, playbooks/10_cfn_network.yml
. The command references the create
tag. This causes the playbook to run to cfn_network
role’s create tasks (roles/cfn_network/tasks/create.yml
), as defined in the main.yml
file (roles/cfn_network/tasks/main.yml
). Lastly, in our post_build
phase, we execute our role’s unit tests (roles/cfn_network/tests/test.yml
), using a second Ansible command.
CodeBuild Projects
Next, we need to create CodeBuild projects. You can do this using the AWS CLI or from the CodeBuild management console (shown below). I have included individual templates and a creation script in each project, in the codebuild_projects
directory, which you could use to build the projects, using the AWS CLI. You would have to modify the JSON templates, replacing all references to my specific, unique AWS resources, with your own. For the demonstration, I suggest creating the five projects manually in the CodeBuild management console, using the supplied CodeBuild project templates as a guide.
CodeBuild IAM Role
To execute our CodeBuild projects, we need an IAM Role or Roles CodeBuild with permission to such resources as CodeCommit, S3, and CloudWatch. For this demonstration, I chose to create a single IAM Role for all workflows. I then allowed CodeBuild to assign the required policies to the Role as needed, which is a feature of CodeBuild.
CodePipeline Pipeline
In addition to CodeBuild, we are using CodePipeline for our first of five workflows. CodePipeline validates the CloudFormation templates and pushes them to our S3 bucket. The pipeline calls the corresponding CodeBuild project to validate each template, then deploys the valid CloudFormation templates to S3.
In true CI/CD fashion, the pipeline is automatically executed every time source code from the CloudFormation project is committed to the CodeCommit repository.
CodePipeline calls CodeBuild, which performs a build, based its buildspec file. This particular CodeBuild buildspec file also demonstrates another ability of CodeBuild, executing an external script. When we have a complex build phase, we may choose to call an external script, such as a Bash or Python script, verses embedding the commands in the buildspec.
version: 0.2 phases: install: runtime-versions: python: 3.7 pre_build: commands: - pip install -r requirements.txt -q - cfn-lint -v build: commands: - sh buildspec_files/build.sh artifacts: files: - '**/*' base-directory: 'cfn_templates' discard-paths: yes
Below, we see the script that is called. Here we are using both the CloudFormation Linter, cfn-lint
, and the cloudformation validate-template
command to validate our templates for comparison. The two tools give slightly different, yet relevant, linting results.
#!/usr/bin/env bash set -e for filename in cfn_templates/*.*; do cfn-lint -t ${filename} aws cloudformation validate-template \ --template-body file://${filename} done
Similar to the CodeBuild project templates, I have included a CodePipeline template, in the codepipeline_pipelines
directory, which you could modify and create using the AWS CLI. Alternatively, I suggest using the CodePipeline management console to create the pipeline for the demo, using the supplied CodePipeline template as a guide.
Below, the stage view of the final CodePipleine pipeline.
Build the Platform
With all the resources, code, and DevOps workflows in place, we should be ready to build our platform on AWS. The CodePipeline project comes first, to validate the CloudFormation templates and place them into your S3 bucket. Since you are probably not committing new code to the CloudFormation file CodeCommit repository, which would trigger the pipeline, you can start the pipeline using the AWS CLI (shown below) or via the management console.
# list names of pipelines aws codepipeline list-pipelines # execute the validation pipeline aws codepipeline start-pipeline-execution --name cfn-validate-s3
The pipeline should complete within a few seconds.
Next, execute each of the four CodeBuild projects in the following order.
# list the names of the projects aws codebuild list-projects # execute the builds in order aws codebuild start-build --project-name cfn-network aws codebuild start-build --project-name cfn-compute # ensure EC2 instance checks are complete before starting # the ansible-web-config build! aws codebuild start-build --project-name ansible-web-config aws codebuild start-build --project-name ansible-test
As the code comment above states, be careful not to start the ansible-web-config build until you have confirmed the EC2 instance Status Checks have completed and have passed, as shown below. The previous, cfn-compute
build will complete when CloudFormation finishes building the new compute stack. However, the fact CloudFormation finished does not indicate that the EC2 instances are fully up and running. Failure to wait will result in a failed build of the ansible-web-config
CodeBuild project, which installs and configures the Apache HTTP Servers.
Below, we see the cfn_network
CodeBuild project first building a Python-based Docker container, within which to perform the build. Each build is executed in a fresh, separate Docker container, something that can trip you up if you are expecting a previous cache of Ansible Facts or previously defined environment variables, persisted across multiple builds.
Below, we see the two completed CloudFormation Stacks, a result of our CodeBuild projects and Ansible.
The fifth and final CodeBuild build tests our platform by attempting to hit the Apache HTTP Server’s default home page, using the Application Load Balancer’s public DNS name.
Below, we see an example of what happens when a build fails. In this case, one of the final integration tests failed to return the expected results from the ALB endpoint.
Below, with the bug is fixed, we rerun the build, which re-executed the tests, successfully.
We can manually confirm the platform is working by hitting the same public DNS name of the ALB as our tests in our browser. The request should load-balance our request to one of the two running web server’s default home page. Normally, at this point, you would deploy your application to Apache, using a software continuous deployment tool, such as Jenkins, CodeDeploy, Travis CI, TeamCity, or Bamboo.
Cleaning Up
To clean up the running AWS resources from the demonstration, first delete the CloudFormation compute stack, then delete the network stack. To do so, execute the following commands, one at a time. The commands call the same playbooks we called to create the stacks, except this time, we use the delete
tag, as opposed to the create
tag.
# first delete cfn compute stack ansible-playbook \ -i inventories/aws_ec2.yml \ playbooks/20_cfn_compute.yml -t delete -v # then delete cfn network stack ansible-playbook \ -i inventories/aws_ec2.yml \ playbooks/10_cfn_network.yml -t delete -v
You should observe the following output, indicating both CloudFormation stacks have been deleted.
Confirm the stacks were deleted from the CloudFormation management console or from the AWS CLI.
All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.
Azure Kubernetes Service (AKS) Observability with Istio Service Mesh
Posted by Gary A. Stafford in Azure, Bash Scripting, Cloud, DevOps, Go, JavaScript, Kubernetes, Software Development on March 31, 2019
In the last two-part post, Kubernetes-based Microservice Observability with Istio Service Mesh, we deployed Istio, along with its observability tools, Prometheus, Grafana, Jaeger, and Kiali, to Google Kubernetes Engine (GKE). Following that post, I received several questions about using Istio’s observability tools with other popular managed Kubernetes platforms, primarily Azure Kubernetes Service (AKS). In most cases, including with AKS, both Istio and the observability tools are compatible.
In this short follow-up of the last post, we will replace the GKE-specific cluster setup commands, found in part one of the last post, with new commands to provision a similar AKS cluster on Azure. The new AKS cluster will run Istio 1.1.3, released 4/15/2019, alongside the latest available version of AKS (Kubernetes), 1.12.6. We will replace Google’s Stackdriver logging with Azure Monitor logs. We will retain the external MongoDB Atlas cluster and the external CloudAMQP cluster dependencies.
Previous articles about AKS include First Impressions of AKS, Azure’s New Managed Kubernetes Container Service (November 2017) and Architecting Cloud-Optimized Apps with AKS (Azure’s Managed Kubernetes), Azure Service Bus, and Cosmos DB (December 2017).
Source Code
All source code for this post is available on GitHub in two projects. The Go-based microservices source code, all Kubernetes resources, and all deployment scripts are located in the k8s-istio-observe-backend project repository.
git clone \ --branch master --single-branch \ --depth 1 --no-tags \ https://github.com/garystafford/k8s-istio-observe-backend.git
The Angular UI TypeScript-based source code is located in the k8s-istio-observe-frontend repository. You will not need to clone the Angular UI project for this post’s demonstration.
Setup
This post assumes you have a Microsoft Azure account with the necessary resource providers registered, and the Azure Command-Line Interface (CLI), az
, installed and available to your command shell. You will also need Helm and Istio 1.1.3 installed and configured, which is covered in the last post.
Start by logging into Azure from your command shell.
az login \ --username {{ your_username_here }} \ --password {{ your_password_here }}
Resource Providers
If you are new to Azure or AKS, you may need to register some additional resource providers to complete this demonstration.
az provider list --output table
If you are missing required resource providers, you will see errors similar to the one shown below. Simply activate the particular provider corresponding to the error.
Operation failed with status:'Bad Request'. Details: Required resource provider registrations Microsoft.Compute, Microsoft.Network are missing.
To register the necessary providers, use the Azure CLI or the Azure Portal UI.
az provider register --namespace Microsoft.ContainerService az provider register --namespace Microsoft.Network az provider register --namespace Microsoft.Compute
Resource Group
AKS requires an Azure Resource Group. According to Azure, a resource group is a container that holds related resources for an Azure solution. The resource group includes those resources that you want to manage as a group. I chose to create a new resource group associated with my closest geographic Azure Region, East US, using the Azure CLI.
az group create \ --resource-group aks-observability-demo \ --location eastus
Create the AKS Cluster
Before creating the GKE cluster, check for the latest versions of AKS. At the time of this post, the latest versions of AKS was 1.12.6.
az aks get-versions \ --location eastus \ --output table
Using the latest GKE version, create the GKE managed cluster. There are many configuration options available with the az aks create
command. For this post, I am creating three worker nodes using the Azure Standard_DS3_v2 VM type, which will give us a total of 12 vCPUs and 42 GB of memory. Anything smaller and all the Pods may not be schedulable. Instead of supplying an existing SSH key, I will let Azure create a new one. You should have no need to SSH into the worker nodes. I am also enabling the monitoring add-on. According to Azure, the add-on sets up Azure Monitor for containers, announced in December 2018, which monitors the performance of workloads deployed to Kubernetes environments hosted on AKS.
time az aks create \ --name aks-observability-demo \ --resource-group aks-observability-demo \ --node-count 3 \ --node-vm-size Standard_DS3_v2 \ --enable-addons monitoring \ --generate-ssh-keys \ --kubernetes-version 1.12.6
Using the time
command, we observe that the cluster took approximately 5m48s to provision; I have seen times up to almost 10 minutes. AKS provisioning is not as fast as GKE, which in my experience is at least 2x-3x faster than AKS for a similarly sized cluster.
After the cluster creation completes, retrieve your AKS cluster credentials.
az aks get-credentials \ --name aks-observability-demo \ --resource-group aks-observability-demo \ --overwrite-existing
Examine the Cluster
Use the following command to confirm the cluster is ready by examining the status of three worker nodes.
kubectl get nodes --output=wide
Observe that Azure currently uses Ubuntu 16.04.5 LTS for the worker node’s host operating system. If you recall, GKE offers both Ubuntu as well as a Container-Optimized OS from Google.
Kubernetes Dashboard
Unlike GKE, there is no custom AKS dashboard. Therefore, we will use the Kubernetes Web UI (dashboard), which is installed by default with AKS, unlike GKE. According to Azure, to make full use of the dashboard, since the AKS cluster uses RBAC, a ClusterRoleBinding must be created before you can correctly access the dashboard.
kubectl create clusterrolebinding kubernetes-dashboard \ --clusterrole=cluster-admin \ --serviceaccount=kube-system:kubernetes-dashboard
Next, we must create a proxy tunnel on local port 8001
to the dashboard running on the AKS cluster. This CLI command creates a proxy between your local system and the Kubernetes API and opens your web browser to the Kubernetes dashboard.
az aks browse \ --name aks-observability-demo \ --resource-group aks-observability-demo
Although you should use the Azure CLI, PowerShell, or SDK for all your AKS configuration tasks, using the dashboard for monitoring the cluster and the resources running on it, is invaluable.
The Kubernetes dashboard also provides access to raw container logs. Azure Monitor provides the ability to construct complex log queries, but for quick troubleshooting, you may just want to see the raw logs a specific container is outputting, from the dashboard.
Azure Portal
Logging into the Azure Portal, we can observe the AKS cluster, within the new Resource Group.
In addition to the Azure Resource Group we created, there will be a second Resource Group created automatically during the creation of the AKS cluster. This group contains all the resources that compose the AKS cluster. These resources include the three worker node VM instances, and their corresponding storage disks and NICs. The group also includes a network security group, route table, virtual network, and an availability set.
Deploy Istio
From this point on, the process to deploy Istio Service Mesh and the Go-based microservices platform follows the previous post and use the exact same scripts. After modifying the Kubernetes resource files, to deploy Istio, use the bash script, part4_install_istio.sh. I have added a few more pauses in the script to account for the apparently slower response times from AKS as opposed to GKE. It definitely takes longer to spin up the Istio resources on AKS than on GKE, which can result in errors if you do not pause between each stage of the deployment process.
Using the Kubernetes dashboard, we can view the Istio resources running in the istio-system
Namespace, as shown below. Confirm that all resource Pods are running and healthy before deploying the Go-based microservices platform.
Deploy the Platform
Deploy the Go-based microservices platform, using bash deploy script, part5a_deploy_resources.sh.
The script deploys two replicas (Pods) of each of the eight microservices, Service-A through Service-H, and the Angular UI, to the dev
and test
Namespaces, for a total of 36 Pods. Each Pod will have the Istio sidecar proxy (Envoy Proxy) injected into it, alongside the microservice or UI.
Azure Load Balancer
If we return to the Resource Group created automatically when the AKS cluster was created, we will now see two additional resources. There is now an Azure Load Balancer and Public IP Address.
Similar to the GKE cluster in the last post, when the Istio Ingress Gateway is deployed as part of the platform, it is materialized as an Azure Load Balancer. The front-end of the load balancer is the new public IP address. The back-end of the load-balancer is a pool containing the three AKS worker node VMs. The load balancer is associated with a set of rules and health probes.
DNS
I have associated the new Azure public IP address, connected with the front-end of the load balancer, with the four subdomains I am using to represent the UI and the edge service, Service-A, in both Namespaces. If Azure is your primary Cloud provider, then Azure DNS is a good choice to manage your domain’s DNS records. For this demo, you will require your own domain.
Testing the Platform
With everything deployed, test the platform is responding and generate HTTP traffic for the observability tools to record. Similar to last time, I have chosen hey, a modern load generator and benchmarking tool, and a worthy replacement for Apache Bench (ab
). Unlike ab
, hey
supports HTTP/2. Below, I am running hey
directly from Azure Cloud Shell. The tool is simulating 10 concurrent users, generating a total of 500 HTTP GET requests to Service A.
# quick setup from Azure Shell using Bash go get -u github.com/rakyll/hey cd go/src/github.com/rakyll/hey/ go build ./hey -n 500 -c 10 -h2 http://api.dev.example-api.com/api/ping
We had 100% success with all 500 calls resulting in an HTTP 200 OK success status response code. Based on the results, we can observe the platform was capable of approximately 4 requests/second, with an average response time of 2.48 seconds and a mean time of 2.80 seconds. Almost all of that time was the result of waiting for the response, as the details indicate.
Logging
In this post, we have replaced GCP’s Stackdriver logging with Azure Monitor logs. According to Microsoft, Azure Monitor maximizes the availability and performance of applications by delivering a comprehensive solution for collecting, analyzing, and acting on telemetry from Cloud and on-premises environments. In my opinion, Stackdriver is a superior solution for searching and correlating the logs of distributed applications running on Kubernetes. I find the interface and query language of Stackdriver easier and more intuitive than Azure Monitor, which although powerful, requires substantial query knowledge to obtain meaningful results. For example, here is a query to view the log entries from the services in the dev
Namespace, within the last day.
let startTimestamp = ago(1d); KubePodInventory | where TimeGenerated > startTimestamp | where ClusterName =~ "aks-observability-demo" | where Namespace == "dev" | where Name contains "service-" | distinct ContainerID | join ( ContainerLog | where TimeGenerated > startTimestamp ) on ContainerID | project LogEntrySource, LogEntry, TimeGenerated, Name | order by TimeGenerated desc | render table
Below, we see the Logs interface with the search query and log entry results.
Below, we see a detailed view of a single log entry from Service A.
Observability Tools
The previous post goes into greater detail on the features of each of the observability tools provided by Istio, including Prometheus, Grafana, Jaeger, and Kiali.
We can use the exact same kubectl port-forward
commands to connect to the tools on AKS as we did on GKE. According to Google, Kubernetes port forwarding allows using a resource name, such as a service name, to select a matching pod to port forward to since Kubernetes v1.10. We forward a local port to a port on the tool’s pod.
# Grafana kubectl port-forward -n istio-system \ $(kubectl get pod -n istio-system -l app=grafana \ -o jsonpath='{.items[0].metadata.name}') 3000:3000 & # Prometheus kubectl -n istio-system port-forward \ $(kubectl -n istio-system get pod -l app=prometheus \ -o jsonpath='{.items[0].metadata.name}') 9090:9090 & # Jaeger kubectl port-forward -n istio-system \ $(kubectl get pod -n istio-system -l app=jaeger \ -o jsonpath='{.items[0].metadata.name}') 16686:16686 & # Kiali kubectl -n istio-system port-forward \ $(kubectl -n istio-system get pod -l app=kiali \ -o jsonpath='{.items[0].metadata.name}') 20001:20001 &
Prometheus and Grafana
Prometheus is a completely open source and community-driven systems monitoring and alerting toolkit originally built at SoundCloud, circa 2012. Interestingly, Prometheus joined the Cloud Native Computing Foundation (CNCF) in 2016 as the second hosted-project, after Kubernetes.
Grafana describes itself as the leading open source software for time series analytics. According to Grafana Labs, Grafana allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. You can easily create, explore, and share visually-rich, data-driven dashboards. Grafana also users to visually define alert rules for your most important metrics. Grafana will continuously evaluate rules and can send notifications.
According to Istio, the Grafana add-on is a pre-configured instance of Grafana. The Grafana Docker base image has been modified to start with both a Prometheus data source and the Istio Dashboard installed. Below, we see one of the pre-configured dashboards, the Istio Service Dashboard.
Jaeger
According to their website, Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing system released as open source by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems, including distributed context propagation, distributed transaction monitoring, root cause analysis, service dependency analysis, and performance and latency optimization. The Jaeger website contains a good overview of Jaeger’s architecture and general tracing-related terminology.
Below, we see a typical, distributed trace of the services, starting ingress gateway and passing across the upstream service dependencies.
Kaili
According to their website, Kiali provides answers to the questions: What are the microservices in my Istio service mesh, and how are they connected? Kiali works with Istio, in OpenShift or Kubernetes, to visualize the service mesh topology, to provide visibility into features like circuit breakers, request rates and more. It offers insights about the mesh components at different levels, from abstract Applications to Services and Workloads.
There is a common Kubernetes Secret that controls access to the Kiali API and UI. The default login is admin
, the password is 1f2d1e2e67df
.
Below, we see a detailed view of our platform, running in the dev
namespace, on AKS.
Delete AKS Cluster
Once you are finished with this demo, use the following two commands to tear down the AKS cluster and remove the cluster context from your local configuration.
time az aks delete \ --name aks-observability-demo \ --resource-group aks-observability-demo \ --yes kubectl config delete-context aks-observability-demo
Conclusion
In this brief, follow-up post, we have explored how the current set of observability tools, part of the latest version of Istio Service Mesh, integrates with Azure Kubernetes Service (AKS).
All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.
Getting Started with Red Hat Ansible for Google Cloud Platform
Posted by Gary A. Stafford in Bash Scripting, Build Automation, DevOps, GCP on January 30, 2019
In this post, we will explore the use of Ansible, the open source community project sponsored by Red Hat, for automating the provisioning, configuration, deployment, and testing of resources on the Google Cloud Platform (GCP). We will start by using Ansible to configure and deploy applications to existing GCP compute resources. We will then expand our use of Ansible to provision and configure GCP compute resources using the Ansible/GCP native integration with GCP modules.
Red Hat Ansible
Ansible, purchased by Red Hat in October 2015, seamlessly provides workflow orchestration with configuration management, provisioning, and application deployment in a single platform. Unlike similar tools, Ansible’s workflow automation is agentless, relying on Secure Shell (SSH) and Windows Remote Management (WinRM). Ansible has published a whitepaper on The Benefits of Agentless Architecture.
According to G2 Crowd, Ansible is a clear leader in the Configuration Management Software category, ranked right behind GitLab. Some of Ansible’s main competitors in the category include GitLab, AWS Config, Puppet, Chef, Codenvy, HashiCorp Terraform, Octopus Deploy, and TeamCity. There are dozens of published articles, comparing Ansible to Puppet, Chef, SaltStack, and more recently, Terraform.
Google Compute Engine
According to Google, Google Compute Engine (GCE) delivers virtual machines (VMs) running in Google’s data centers and on their worldwide fiber network. Compute Engine’s tooling and workflow support enables scaling from single instances to global, load-balanced cloud computing.
Comparable products to GCE in the IaaS category include Amazon Elastic Compute Cloud (EC2), Azure Virtual Machines, IBM Cloud Virtual Servers, and Oracle Compute Cloud Service.
Apache HTTP Server
According to Apache, the Apache HTTP Server (“httpd”) is an open-source HTTP server for modern operating systems including Linux and Windows. The Apache HTTP Server provides a secure, efficient, and extensible server that provides HTTP services in sync with the current HTTP standards. The Apache HTTP Server was launched in 1995 and it has been the most popular web server on the Internet since 1996. We will deploy Apache HTTP Server to GCE VMs, using Ansible.
Demonstration
In this post, we will demonstrate two different workflows with Ansible on GCP. First, we will use Ansible to configure and deploy the Apache HTTP Server to an existing GCE instance.
- Provision and configure a GCE VM instance, disk, firewall rule, and external IP, using the Google Cloud (
gcloud
) CLI tool; - Deploy and configure the Apache HTTP Server and associated packages, using an Ansible Playbook containing an
httpd
Ansible Role; - Manually test the GCP resources and Apache HTTP Server;
- Clean up the GCP resources using the
gcloud
CLI tool;
In the second workflow, we will use Ansible to provision and configure the GCP resources, as well as deploy the Apache HTTP Server the new GCE VM.
- Provision and configure a VM instance, disk, VPC global network, subnetwork, firewall rules, and external IP address, using an Ansible Playbook containing an Ansible Role, as opposed to the
gcloud
CLI tool; - Deploy and configure the Apache HTTP Server and associated packages, using an Ansible Playbook containing an
httpd
Ansible Role; - Test the GCP resources and Apache HTTP Server using role-based test tasks;
- Clean up all the GCP resources using an Ansible Playbook containing an Ansible Role;
Source Code
The source code for this post may be found on the master
branch of the ansible-gcp-demo GitHub repository.
git clone --branch master --single-branch --depth 1 --no-tags \ https://github.com/garystafford/ansible-gcp-demo.git
The project has the following file structure.
. ├── LICENSE ├── README.md ├── _unused │ ├── httpd_playbook.yml ├── ansible │ ├── ansible.cfg │ ├── group_vars │ │ └── webservers.yml │ ├── inventories │ │ ├── hosts │ │ └── webservers_gcp.yml │ ├── playbooks │ │ ├── 10_webserver_infra.yml │ │ └── 20_webserver_config.yml │ ├── roles │ │ ├── gcpweb │ │ └── httpd │ └── site.yml ├── part0_source_creds.sh ├── part1_create_vm.sh └── part2_clean_up.sh
Source code samples in this post are displayed as GitHub Gists which may not display correctly on all mobile and social media browsers, such as LinkedIn.
Setup New GCP Project
For this demonstration, I have created a new GCP Project containing a new service account and public SSH key. The project’s service account will be used the gcloud
CLI tool and Ansible to access and provision compute resources within the project. The SSH key will be used by both tools to SSH into GCE VM within the project. Start by creating a new GCP Project.
Add a new service account to the project on the IAM & admin ⇒ Service accounts tab.
Grant the new service account permission to the ‘Compute Admin’ Role, within the project, using the Role drop-down menu. The principle of least privilege (PoLP) suggests we should limit the service account’s permissions to only the role(s) necessary to provision the required compute resources.
Create a private key for the service account, on the IAM & admin ⇒ Service accounts tab. This private key is different than the SSH key will add to the project, next. This private key contains the credentials for the service account.
Choose the JSON key type.
Download the private key JSON file and place it in a safe location, accessible to Ansible. Be careful not to check this file into source control. Again, this file contains the service account’s credentials used to programmatically access GCP and administer compute resources.
We should now have a service account, associated with the new GCP project, with permissions to the ‘Compute Admin’ role, and a private key which has been downloaded and accessible to Ansible. Note the Email address of the service account, in my case, ansible@ansible-gce-demo.iam.gserviceaccount.com
; you will need to reference this later in your configuration.
Next, create an SSH public/private key pair. The SSH key will be used to programmatically access the GCE VM. Creating a separate key pair allows you to limit its use to just the new GCP project. If compromised, the key pair is easily deleted and replaced in the GCP project and in the Ansible configuration. On a Mac, you can use the following commands to create a new key pair and copy the public key to the clipboard.
ssh-keygen -t rsa -b 4096 -C "ansible" cat ~/.ssh/ansible.pub | pbcopy
Add your new public key clipboard contents to the project, on the Compute Engine ⇒ Metadata ⇒ SSH Keys tab. Adding the key here means it is usable by any VM in the project unless you explicitly block this option when provisioning a new VM and configure a key specifically for that VM.
Note the name, ansible
, associated with the key, you will need to reference this later in your configuration.
Setup Ansible
Although this post is not a primer on Ansible, I will cover a few setup steps I have done to prepare for this demo. On my Mac, I am running Python 3.7, pip 18.1, and Ansible 2.7.6. With Python and pip installed, the easiest way to install Ansible in Mac or Linux is using pip.
pip install ansible
You will also need to install two additional packages in order to gather information about GCP-based hosts using GCE Dynamic Inventory, explained later in the post.
pip install requests google-auth
Ansible Configuration
I created a simple Ansible ansible.cfg
file for this project, located in the /ansible/inventories/
sub-directory. The Ansible configuration file contains the location of the project’s roles and inventory, which is explained later. The file also contains two configuration items associated with an SSH key pair, which we just created. If your key is named differently or in a different location, update the file (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[defaults] | |
host_key_checking = False | |
roles_path = roles | |
inventory = inventories/hosts | |
remote_user = ansible | |
private_key_file = ~/.ssh/ansible | |
[inventory] | |
enable_plugins = host_list, script, yaml, ini, auto, gcp_compute |
Ansible has a complete example of a configuration file parameters on GitHub.
Ansible Environment Variables
To decouple our specific GCP project’s credentials from the Ansible playbooks and roles, Ansible recommends setting those required module parameters as environment variables, as opposed to including them in the playbooks. Additionally, I have set the GCP project name as an environment variable, in order to also decouple it from the playbooks. To set those environment variables, source the script in the project’s root directory, using the
source
command (gist).
source ./
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# | |
# author: Gary A. Stafford | |
# site: https://programmaticponderings.com | |
# license: MIT License | |
# purpose: Source Ansible/GCP credentials | |
# usage: source ./ansible_gcp_creds.sh | |
# Constants – CHANGE ME! | |
export GCP_PROJECT='ansible-gce-demo' | |
export GCP_AUTH_KIND='serviceaccount' | |
export GCP_SERVICE_ACCOUNT_FILE='path/to/your/credentials/file.json' | |
export GCP_SCOPES='https://www.googleapis.com/auth/compute' |
GCP CLI/Ansible Hybrid Workflow
Oftentimes, enterprises employ a mix of DevOps tooling to provision, configure, and deploy to compute resources. In this first workflow, we will use Ansible to configure and deploy a web server to an existing GCE VM, created in advance with the gcloud
CLI tool.
Create GCP Resources
First, use the gcloud
CLI tool to create a GCE VM and associated resources, including an external IP address and firewall rule for port 80 (HTTP). For simplicity, we will use the existing GCP default
Virtual Private Cloud (VPC) network and the default
us-east1 subnetwork. Execute the part1_create_vm.sh
script in the project’s root directory. The default
network should already have port 22 (SSH) open on the firewall. Note the SERVICE_ACCOUNT
variable, in the script, is the service account email found on the IAM & admin ⇒ Service accounts tab, shown in the previous section (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# | |
# author: Gary A. Stafford | |
# site: https://programmaticponderings.com | |
# license: MIT License | |
# purpose: Create GCP VM instance and associated resources | |
# usage: sh ./part1_create_vm.sh | |
# Constants – CHANGE ME! | |
readonly PROJECT='ansible-gce-demo' | |
readonly SERVICE_ACCOUNT='ansible@ansible-gce-demo.iam.gserviceaccount.com' | |
readonly ZONE='us-east1-b' | |
# Create GCE VM with disk storage | |
time gcloud compute instances create web-1 \ | |
–project $PROJECT \ | |
–zone $ZONE \ | |
–machine-type n1-standard-1 \ | |
–network default \ | |
–subnet default \ | |
–network-tier PREMIUM \ | |
–maintenance-policy MIGRATE \ | |
–service-account $SERVICE_ACCOUNT \ | |
–scopes https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append \ | |
–tags apache-http-server \ | |
–image centos-7-v20190116 \ | |
–image-project centos-cloud \ | |
–boot-disk-size 200GB \ | |
–boot-disk-type pd-standard \ | |
–boot-disk-device-name compute-disk | |
# Create firewall rule to allow ingress traffic from port 80 | |
time gcloud compute firewall-rules create default-allow-http \ | |
–project $PROJECT \ | |
–description 'Allow HTTP from anywhere' \ | |
–direction INGRESS \ | |
–priority 1000 \ | |
–network default \ | |
–action ALLOW \ | |
–rules tcp:80 \ | |
–source-ranges 0.0.0.0/0 \ | |
–target-tags apache-http-server |
The output from the script should look similar to the following. Note the external IP address associated with the VM, you will need to reference this later in the post.
Using the gcloud
CLI tool or Google Cloud Console, we should be able to view our newly provisioned resources on GCP. First, our new GCE VM, using the Compute Engine ⇒ VM instances ⇒ Details tab.
Next, examine the Network interface details tab. Here we see details about the network and subnetwork our VM is running within. We see the internal and external IP addresses of the VM. We also see the firewall rules, including our new rule, allowing TCP ingress traffic on port 80.
Lastly, examine the new firewall rule, which will allow TCP traffic on port 80 from any IP address to our VM, located in the default
network. Note the other, pre-existing rules controlling access to the default
network.
The final GCP architecture looks as follows.
GCE Dynamic Inventory
Two core concepts in Ansible are hosts and inventory. We need an inventory of the hosts on which to run our Ansible playbooks. If we had long-lived hosts, often referred to as ‘pets’, who had long-lived static IP addresses or DNS entries, then we could manually add the hosts to a static hosts file, similar to the example below.
[webservers] 34.73.171.5 34.73.170.97 34.73.172.153 [dbservers] db1.example.com db2.example.com
However, given the ephemeral nature of the cloud, where hosts (often referred to as ‘cattle’), IP addresses, and even DNS entries are often short-lived, we will use the Ansible concept of Dynamic Inventory.
If you recall we pip
installed two packages, requests
and google-auth
, during our Ansible setup for use with GCE Dynamic Inventory. According to Ansible, the best way to interact with your GCE VM hosts is to use the gcp_compute
inventory plugin. The plugin allows Ansible to dynamically query GCE for the nodes that can be managed. With the gcp_compute
inventory plugin, we can also selectively classify the hosts we find into Groups. We will then run playbooks, containing roles, on a group or groups of hosts.
To demonstrate how to dynamically find the new GCE host, and add it to a group, execute the following command, using the Ansible Inventory CLI.
ansible-inventory --graph -i inventories/webservers_gcp.yml
The command calls the webservers_gcp.yml
file, which contains logic necessary to associate the GCE hosts with the webservers
host group. Ansible’s current documentation is pretty sparse on this subject. Thanks to Matthieu Remy for his great post, How to Use Ansible GCP Compute Inventory Plugin. For this demo, we are only looking for hosts in us-east1-b, which have ‘web-’ in their name. (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— | |
plugin: gcp_compute | |
zones: | |
– us-east1-b | |
projects: | |
– ansible-gce-demo | |
filters: [] | |
groups: | |
webservers: "'web-' in name" | |
scopes: | |
– https://www.googleapis.com/auth/compute | |
service_account_file: ~/Documents/Personal/gcp_creds/ansible-gce-demo-a0dbb4ac2ff7.json | |
auth_kind: serviceaccount |
The output from the command should look similar to the following. We should observe our new VM, as indicated by its external IP address, is assigned to the part of the webservers
group. We will use the power of Dynamic Inventory to apply a playlist to all the hosts within the webservers
group.
We can also view details about hosts by modifying the inventory command.
ansible-inventory --list -i inventories/webservers_gcp.yml --yaml
The output from the command should look similar to the following. This particular example was run against an earlier host, with a different external IP address.
Apache HTTP Server Playbook
For our first taste of Ansible on GCP, we will run an Ansible Playbook to install and configure the Apache HTTP Server on the new CentOS-based VM. According to Ansible, Playbooks, which are YAML-based, can declare configurations, they can also orchestrate steps of any manual ordered process, even as different steps must bounce back and forth between sets of machines in particular orders. They can launch tasks synchronously or asynchronously. Playbooks are used to orchestrate tasks, as opposed to using Ansible’s ad-hoc task execution mode.
A playbook can be ‘monolithic’ in nature, containing all the required Variables, Tasks, and Handlers, to achieve the desired outcome. If we wrote a single playbook to deploy and configure our Apache HTTP Server, it might look like the httpd_playbook.yml
, playbook, below (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— | |
– name: Install Apache HTTP Server | |
hosts: webservers | |
become: yes | |
vars: | |
greeting: 'Hello Anisble on GCP!' | |
tasks: | |
– name: upgrade all packages | |
yum: | |
name: '*' | |
state: latest | |
– name: ensure the latest list of packages are installed | |
yum: | |
name: "{{ packages }}" | |
state: latest | |
vars: | |
packages: | |
– httpd | |
– httpd-tools | |
– php | |
– name: deploy apache config file | |
template: | |
src: server-status.conf | |
dest: /etc/httpd/conf.d/server-status.conf | |
notify: | |
– restart apache | |
– name: deploy php document to DocumentRoot | |
template: | |
src: info.php | |
dest: /var/www/html/info.php | |
– name: deploy html document to DocumentRoot | |
template: | |
src: index.html.j2 | |
dest: /var/www/html/index.html | |
vars: | |
greeting: "{{ gretting }}" | |
– name: ensure apache is running | |
service: | |
name: httpd | |
state: started | |
handlers: | |
– name: restart apache | |
service: | |
name: httpd | |
state: restarted |
We could run this playbook with the following command to deploy the Apache HTTP Server, but we won’t. Instead, next, we will run a playbook that applies the httpd
role.
ansible-playbook \
-i inventories/webservers_gcp.yml \
playbooks/httpd_playbook.yml
Ansible Roles
According to Ansible, Roles are ways of automatically loading certain vars_files, tasks, and handlers based on a known file structure. Grouping content by roles also allows easy sharing of roles with other users. The usage of roles is preferred as it provides a nice organizational system.
The httpd
role is identical in functionality to the httpd_playbook.yml
, used in the first workflow. However, the primary parts of the playbook have been decomposed into individual resource files, as described by Ansible. This structure is created using the Ansible Galaxy CLI. Ansible Galaxy is Ansible’s official hub for sharing Ansible content.
ansible-galaxy init httpd
This ansible-galaxy
command creates the following structure. I added the files and Jinja2 template, afterward.
. ├── README.md ├── defaults │ └── main.yml ├── files │ ├── info.php │ └── server-status.conf ├── handlers │ └── main.yml ├── meta │ └── main.yml ├── tasks │ └── main.yml ├── templates │ └── index.html.j2 ├── tests │ ├── inventory │ └── test.yml └── vars └── main.yml
Within the httpd
role:
- Variables are stored in the
defaults/main.yml
file; - Tasks are stored in the
tasks/main.yml
file; - Handles are stored in the
handlers/main.yml
file; - Files are stored in the
files/
sub-directory; - Jinja2 templates are stored in the
templates/
sub-directory; - Test are stored in the
tests/
sub-directory; - Other sub-directories and files contain metadata about the role;
To apply the httpd
role, we will run the 20_webserver_config.yml
playbook. Compare this playbook, below, with the previous, monolithic httpd_playbook.yml
playbook. All of the logic has now been decomposed across the httpd
role’s separate backing files (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— | |
– name: Configure GCP webserver(s) | |
hosts: webservers | |
gather_facts: no | |
become: yes | |
roles: | |
– role: httpd |
We can start by running our playbook using Ansible’s Check Mode (“Dry Run”). When ansible-playbook
is run with --check
, Ansible will not make any actual changes to the remote systems. According to Ansible, Check mode is just a simulation, and if you have steps that use conditionals that depend on the results of prior commands, it may be less useful for you. However, it is great for one-node-at-time basic configuration management use cases. Execute the following command using Check mode.
ansible-playbook \ -i inventories/webservers_gcp.yml \ playbooks/20_webserver_config.yml --check
The output from the command should look similar to the following. It shows that if we execute the actual command, we should expect seven changes to occur.
If everything looks good, then run the same command without using Check mode.
ansible-playbook \ -i inventories/webservers_gcp.yml \ playbooks/20_webserver_config.yml
The output from the command should look similar to the following. Note the number of items changed, seven, is identical to the results of using Check mode, above.
If we were to execute the command using Check mode for a second time, we should observe zero changed items. This means the last command successfully applied all changes and no new changes are present in the playbook.
Testing the Results
There are a number of methods and tools we could use to test the deployments of the Apache HTTP Server and server tools. First, we can use an ad-hoc ansible
CLI command to confirm the httpd
process is running on the VM, by calling systemctl
. The systemctl
application is used to introspect and control the state of the systemd
system and service manager, running on the CentOS-based VM.
ansible webservers \ -i inventories/webservers_gcp.yml \ -a "systemctl status httpd"
The output from the command should look similar to the following. We see the Apache HTTP Server service details. We also see it being stopped and started as required by the tasks and handler in the role.
We can also check that the home page and PHP info documents, we deployed as part of the playbook, are in the correct location on the VM.
ansible webservers \ -i inventories/webservers_gcp.yml \ -a "ls -al /var/www/html"
The output from the command should look similar to the following. We see the two documents we deployed are in the root of the website directory.
Next, view our website’s home page by pointing your web browser to the external IP address we created earlier and associated with the VM, on port 80 (HTTP). We should observe the variable value in the playbook, ‘Hello Ansible on GCP!’, was injected into the Jinja2 template file, index.html.j2
, and the page deployed correctly to the VM.
If you recall from the httpd
role, we had a task to deploy the server status configuration file. This configuration file exposes the /server-status
endpoint, as shown below. The status page shows the internal and the external IP addresses assigned to the VM. It also shows the current version of Apache HTTP Server and PHP, server uptime, traffic, load, CPU usage, number of requests, number of running processes, and so forth.
Testing with Apache Bench
Apache Bench (ab
) is the Apache HTTP server benchmarking tool. We can use Apache Bench locally, to generate CPU, memory, file, and network I/O loads on the VM. For example, using the following command, we can generate 100K requests to the server-status page, simulating 100 concurrent users.
ab -kc 100 -n 100000 http://your_vms_external_ip/server-status
The output from the command should look similar to the following. Observe this command successfully resulted in a sustained load on the web server for approximately 17.5 minutes.
Using the Compute Engine ⇒ VM instances ⇒ Monitoring tab, we see the corresponding Apache Bench CPU, memory, file, and network load on the VM, starting at about 10:03 AM, soon after running the playbook to install Apache HTTP Server.
Destroy GCP Resources
After exploring the results of our workflow, tear down the existing GCE resources before we continue to the next workflow. To delete resources, execute the part2_clean_up.sh
script in the project’s root directory (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# | |
# author: Gary A. Stafford | |
# site: https://programmaticponderings.com | |
# license: MIT License | |
# purpose: Delete GCP VM instance, IP address, and firewall rule | |
# usage: sh ./part2_clean_up.sh | |
# Constants – CHANGE ME! | |
readonly PROJECT='ansible-gce-demo' | |
readonly ZONE='us-east1-b' | |
time yes | gcloud compute instances delete web-1 \ | |
–project $PROJECT –zone $ZONE | |
time yes | gcloud compute firewall-rules delete default-allow-http \ | |
–project $PROJECT |
The output from the script should look similar to the following.
Ansible Workflow
In the second workflow, we will provision and configure the GCP resources, and deploy Apache HTTP Server to the new GCE VM using Ansible. We will be using the same Project, Region, and Zone as the previous example. However this time, we will create a new global VPC network instead of using the default
network as before, a new subnetwork instead of using the default
subnetwork as before, and a new firewall with ingress rules to open ports 22 and 80. Lastly, will create an external IP address and assign it to the VM.
Provision GCP Resources
Instead of using the gcloud
CLI tool, we will use Ansible to provision the GCP resources. To accomplish this, I have created one playbook, 10_webserver_infra.yml
, with one role, gcpweb
, but two sets of tasks, one to create the GCE resources, create.yml
, and one to delete the GCP resources, delete.yml
. This is a typical Ansible playbook pattern. The standard file directory structure of the role looks as follows, similar to the httpd
role.
. ├── README.md ├── defaults │ └── main.yml ├── files ├── handlers │ └── main.yml ├── meta │ └── main.yml ├── tasks │ ├── create.yml │ ├── delete.yml │ └── main.yml ├── templates ├── tests │ ├── inventory │ └── test.yml └── vars └── main.yml
To provision the GCE resources, we run the 10_webserver_infra.yml
playbook (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— | |
– name: Create GCP webserver(s) resources | |
hosts: localhost | |
gather_facts: no | |
connection: local | |
roles: | |
– role: gcpweb |
This playbook runs the gcpweb
role. The role’s default main.yml
task file imports two other sets of tasks, one for create and one for delete. Each set of tasks have a corresponding tag associated with them (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— | |
– import_tasks: create.yml | |
tags: | |
– create | |
– import_tasks: delete.yml | |
tags: | |
– delete |
By calling the playbook and passing the ‘create’ tag, the role will run apply the associated set of create tasks. Tags are a powerful construct in Ansible. Execute the following command, passing the create
tag.
ansible-playbook -t create playbooks/10_webserver_infra.yml
In the case of this playbook, the Check mode, used earlier, would fail here. If you recall, this feature is not designed to work with playbooks that have steps that use conditionals that depend on the results of prior commands, such as with this playbook.
The create.yml
file contains six tasks, which leverage Ansible GCP Modules. The tasks create a global VPC network, subnetwork in the us-east1 Region, firewall and rules, external IP address, disk, and VM instance (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— | |
– name: create a network | |
gcp_compute_network: | |
name: ansible-network | |
auto_create_subnetworks: yes | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: present | |
register: network | |
– name: create a subnetwork | |
gcp_compute_subnetwork: | |
name: ansible-subnet | |
region: "{{ region }}" | |
network: "{{ network }}" | |
ip_cidr_range: "{{ ip_cidr_range }}" | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: present | |
register: subnet | |
– name: create a firewall | |
gcp_compute_firewall: | |
name: ansible-firewall | |
network: "projects/{{ lookup('env','GCP_PROJECT') }}/global/networks/{{ network.name }}" | |
allowed: | |
– ip_protocol: tcp | |
ports: ['80','22'] | |
target_tags: | |
– apache-http-server | |
source_ranges: ['0.0.0.0/0'] | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: present | |
register: firewall | |
– name: create an address | |
gcp_compute_address: | |
name: "{{ instance_name }}" | |
region: "{{ region }}" | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: present | |
register: address | |
– name: create a disk | |
gcp_compute_disk: | |
name: "{{ instance_name }}" | |
size_gb: "{{ size_gb }}" | |
source_image: 'projects/centos-cloud/global/images/centos-7-v20190116' | |
zone: "{{ zone }}" | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: present | |
register: disk | |
– name: create an instance | |
gcp_compute_instance: | |
state: present | |
name: "{{ instance_name }}" | |
machine_type: "{{ machine_type }}" | |
disks: | |
– auto_delete: true | |
boot: true | |
source: "{{ disk }}" | |
network_interfaces: | |
– network: "{{ network }}" | |
subnetwork: "{{ subnet }}" | |
access_configs: | |
– name: External NAT | |
nat_ip: "{{ address }}" | |
type: ONE_TO_ONE_NAT | |
zone: "{{ zone }}" | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
tags: | |
items: | |
– apache-http-server | |
– webserver | |
register: instance |
If your interested in what is actually happening during the execution of the playbook, add the verbose option (-v
or -vv
) to the above command. This can be very helpful in learning Ansible.
The output from the command should look similar to the following. Note the changes applied to localhost. Since no GCE VM host(s) exist on GCP until the resources are provisioned, we reference localhost. The entire process took less than two minutes to create a global VPC network, subnetwork, firewall rules, VM, attached disk, and assign a public IP address.
All GCP resources are now provisioned and configured. Below, we see the new GCE VM created by Ansible.
Below, we see the new GCE VM’s network interface details console page, showing details about the VM, NIC, internal and external IP addresses, network, subnetwork, and ingress firewall rules.
Below, we see the VPC details showing each of the automatically-created regional subnets, and our new ‘ansible-subnet’, in the us-east1 region, and spanning 14 IP addresses in the 172.16.0.0/28 CIDR (Classless Inter-Domain Routing) block.
To deploy and configure Apache HTTP Server, run the httpd
role exactly the same way we did in the first workflow.
ansible-playbook \ -i inventories/webservers_gcp.yml \ playbooks/20_webserver_config.yml
Role-based Testing
In the first workflow, we manually tested our results using a number of ad-hoc commands and by viewing web pages in our browser. These methods of testing do not lend themselves to DevOps automation. A more effective strategy is writing tests, which are part of the role, and maybe run each time the role is applied, as part of a CI/CD pipeline. Each role in this project contains a few simple tests to confirm the success of the tasks in the role. First, run the gcpweb
role’s tests with the following command.
ansible-playbook \ -i inventories/webservers_gcp.yml \ roles/gcpweb/tests/test.yml
The playbook gathers facts about the GCE hosts in the host group and runs a total of five test tasks against those hosts. The tasks confirm the host’s timezone, vCPU count, OS type, OS major version, and hostname, using the facts gathered (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— | |
– name: Test gcpweb Ansible role | |
hosts: webservers | |
gather_facts: yes | |
tasks: | |
# – name: List all ansible facts | |
# debug: | |
# msg: "{{ ansible_facts }}" | |
– name: Check if timezone is UTC | |
debug: | |
msg: Timezone is UTC | |
failed_when: ansible_facts['date_time']['tz'] != 'UTC' | |
– name: Check if processor vCPUs count is 1 | |
debug: | |
msg: Processor vCPUs count is 1 | |
failed_when: ansible_facts['processor_vcpus'] != 1 | |
– name: Check if distribution is CentOS | |
debug: | |
msg: Distribution is CentOS | |
failed_when: ansible_facts['distribution'] != 'CentOS' | |
– name: Check if distribution major version is 7 | |
debug: | |
msg: Distribution major version is 7 | |
failed_when: ansible_facts['distribution_major_version'] != '7' | |
– name: Check if hostname contains 'web-' | |
debug: | |
msg: Hostname contains 'web-' | |
failed_when: "'web-' not in ansible_facts['hostname']" |
The output from the command should look similar to the following. Observe that all five tasks ran successfully.
Next, run the the httpd
role’s tests.
ansible-playbook \ -i inventories/webservers_gcp.yml \ roles/httpd/tests/test.yml
Similarly, the output from the command should look similar to the following. The playbook runs four test tasks this time. The tasks confirm both files are present, the home page is accessible, and that the server-status page displays properly. Below, we all four ran successfully.
Making a Playbook Change
To observe what happens if we apply a change to a playbook, let’s change the greeting
variable value in the /roles/httpd/defaults/main.yml
file in the httpd
role. Recall, the original home page looked as follows.
Change the greeting
variable value and re-run the playbook, using the same command.
ansible-playbook \ -i inventories/webservers_gcp.yml \ playbooks/20_webserver_config.yml
The output from the command should look similar to the following. As expected, we should observe that only one task, deploying the home page, was changed.
Viewing the home page again, or by modifying the associated test task, we should observe the new value is injected into the Jinja2 template file, index.html.j2
, and the new page deployed correctly.
Destroy GCP Resources with Ansible
Once you are finished, you can destroy all the GCP resources by calling the 10_webserver_infra.yml
playbook and passing the delete
tag, the role will run apply the associated set of delete tasks.
ansible-playbook -t delete playbooks/10_webserver_infra.yml
With Ansible, we delete GCP resources by changing the state
from present
to absent
. The playbook will delete the resources in a particular order, to avoid dependency conflicts, such as trying to delete the network before the VM. Note we do not have to explicitly delete the disk since, if you recall, we provisioned the VM instance with the disks.auto_delete=true
option (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— | |
– name: delete an instance | |
gcp_compute_instance: | |
name: "{{ instance_name }}" | |
zone: "{{ zone }}" | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: absent | |
– name: delete an address | |
gcp_compute_address: | |
name: "{{ instance_name }}" | |
region: "{{ region }}" | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: absent | |
– name: delete a firewall | |
gcp_compute_firewall: | |
name: ansible-firewall | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: absent | |
– name: register the existing network | |
gcp_compute_network: | |
name: ansible-network | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
register: network | |
# – debug: | |
# var: network | |
– name: delete a subnetwork | |
gcp_compute_subnetwork: | |
name: ansible-subnet | |
region: "{{ region }}" | |
network: "{{ network }}" | |
ip_cidr_range: "{{ ip_cidr_range }}" | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: absent | |
– name: delete a network | |
gcp_compute_network: | |
name: ansible-network | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: absent |
The output from the command should look similar to the following. We see the VM instance, attached disk, firewall, rules, external IP address, subnetwork, and finally, the network, each being deleted.
Conclusion
In this post, we saw how easy it is to get started with Ansible on the Google Cloud Platform. Using Ansible’s 300+ cloud modules, provisioning, configuring, deploying to, and testing a wide range of GCP, Azure, and AWS resources are easy, repeatable, and completely automatable.
All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.
Docker Enterprise Edition: Multi-Environment, Single Control Plane Architecture for AWS
Posted by Gary A. Stafford in AWS, Cloud, Continuous Delivery, DevOps, Enterprise Software Development, Software Development on September 6, 2017
Designing a successful, cloud-based containerized application platform requires a balance of performance and security with cost, reliability, and manageability. Ensuring that a platform meets all functional and non-functional requirements, while remaining within budget and is easily maintainable, can be challenging.
As Cloud Architect and DevOps Team Lead, I recently participated in the development of two architecturally similar, lightweight, cloud-based containerized application platforms. From the start, both platforms were architected to maximize security and performance, while minimizing cost and operational complexity. The later platform was built on AWS with Docker Enterprise Edition.
Docker Enterprise Edition
Released in March of this year, Docker Enterprise Edition (Docker EE) is a secure, full-featured container-based management platform. There are currently eight versions of Docker EE, available for Windows Server, Azure, AWS, and multiple Linux distros, including RHEL, CentOS, Ubuntu, SUSE, and Oracle.
Docker EE is one of several production-grade container orchestration Platforms as a Service (PaaS). Some of the other container platforms in this category include:
- Google Container Engine (GCE, based on Google’s Kubernetes)
- AWS EC2 Container Service (ECS)
- Microsoft Azure Container Service (ACS)
- Mesosphere Enterprise DC/OS (based on Apache Mesos)
- Red Hat OpenShift (based on Kubernetes)
- Rancher Labs (based on Kubernetes, Cattle, Mesos, or Swarm)
Docker Community Edition (CE), Kubernetes, and Apache Mesos are free and open-source. Some providers, such as Rancher Labs, offer enterprise support for an additional fee. Cloud-based services, such as Red Hat Openshift Online, AWS, GCE, and ACS, charge the typical usage monthly fee. Docker EE, similar to Mesosphere Enterprise DC/OS and Red Hat OpenShift, is priced on a per node/per year annual subscription model.
Docker EE is currently offered in three subscription tiers, including Basic, Standard, and Advanced. Additionally, Docker offers Business Day and Business Critical support. Docker EE’s Advanced Tier adds several significant features, including secure multi-tenancy with node-based isolation, and image security scanning and continuous vulnerability scanning, as part of Docker EE’s Docker Trusted Registry.
Architecting for Affordability and Maintainability
Building an enterprise-scale application platform, using public cloud infrastructure, such as AWS, and a licensed Containers-as-a-Service (CaaS) platform, such as Docker EE, can quickly become complex and costly to build and maintain. Based on current list pricing, the cost of a single Linux node ranges from USD 75 per month for basic support, up to USD 300 per month for Docker Enterprise Edition Advanced with Business Critical support. Although cost is relative to the value generated by the application platform, none the less, architects should always strive to avoid unnecessary complexity and cost.
Reoccurring operational costs, such as licensed software subscriptions, support contracts, and monthly cloud-infrastructure charges, are often overlooked by project teams during the build phase. Accurately forecasting reoccurring costs of a fully functional Production platform, under expected normal load, is essential. Teams often overlook how Docker image registries, databases, data lakes, and data warehouses, quickly swell, inflating monthly cloud-infrastructure charges to maintain the platform. The need to control cloud costs have led to the growth of third-party cloud management solutions, such as CloudCheckr Cloud Management Platform (CMP).
Shared Docker Environment Model
Most software development projects require multiple environments in which to continuously develop, test, demonstrate, stage, and release code. Creating separate environments, replete with their own Docker EE Universal Control Plane (aka Control Plane or UCP), Docker Trusted Registry (DTR), AWS infrastructure, and third-party components, would guarantee a high-level of isolation and performance. However, replicating all elements in each environment would add considerable build and run costs, as well as unnecessary complexity.
On both recent projects, we choose to create a single AWS Virtual Private Cloud (VPC), which contained all of the non-production environments required by our project teams. In parallel, we built an entirely separate Production VPC for the Production environment. I’ve seen this same pattern repeated with Red Hat OpenStack and Microsoft Azure.
Production
Isolating Production from the lower environments is essential to ensure security, and to eliminate non-production traffic from impacting the performance of Production. Corporate compliance and regulatory policies often dictate complete Production isolation. Having separate infrastructure, security appliances, role-based access controls (RBAC), configuration and secret management, and encryption keys and SSL certificates, are all required.
For complete separation of Production, different AWS accounts are frequently used. Separate AWS accounts provide separate billing, usage reporting, and AWS Identity and Access Management (IAM), amongst other advantages.
Performance and Staging
Unlike Production, there are few reasons to completely isolate lower-environments from one another. The exception I’ve encountered is Performance and Staging. These two environments are frequently separated from other environments to ensure the accuracy of performance testing and release staging activities. Performance testing, in particular, can generate enormous load on systems, which if not isolated, will impair adjacent environments, applications, and monitoring systems.
On a few recent projects, to reduce cost and complexity, we repurposed the UAT environment for performance testing, once user-acceptance testing was complete. Performance testing was conducted during off-peak development and testing periods, with access to adjacent environments blocked.
The multi-purpose UAT environment further served as a Staging environment. Applications were deployed and released to the UAT and Performance environments, following a nearly-identical process used for Production. Hotfixes to Production were also tested in this environment.
Example of Shared Environments
To demonstrate how to architect a shared non-production Docker EE environment, which minimizes cost and complexity, let’s examine the example shown below. In the example, built on AWS with Docker EE, there are four typical non-production environments, CI/CD, Development, Test, and UAT, and one Production environment.
In the example, there are two separate VPCs, the Production VPC, and the Non-Production VPC. There is no reason to configure VPC Peering between the two VPCs, as there is no need for direct communication between the two. Within the Non-Production VPC, to the left in the diagram, there is a cluster of three Docker EE UCP Manager EC2 nodes, a cluster of three DTR Worker EC2 nodes, and the four environments, consisting of varying numbers of EC2 Worker nodes. Production, to the right of the diagram, has its own cluster of three UCP Manager EC2 nodes and a cluster of six EC2 Worker nodes.
Single Non-Production UCP
As a primary means of reducing cost and complexity, in the example, a single minimally-sized Docker EE UCP cluster of three Manager nodes orchestrate activities across all four non-production environments. Alternately, you would have to create a UCP cluster for each environment; that means nine more Worker Nodes to configure and maintain.
The UCP users, teams, organizations, access controls, Docker Secrets, overlay networks, and other UCP features, for all non-production environments, are managed through the single Control Plane. All deployments to all the non-production environments, from the CI/CD server, are performed through the single Control Plane. Each UCP Manager node is deployed to a different AWS Availability Zone (AZ) to ensure high-availability.
Shared DTR
As another means of reducing cost and complexity, in the example, a Docker EE DTR cluster of three Worker nodes contain all Docker image repositories. Both the non-production and the Production environments use this DTR as a secure source of all Docker images. Not having to replicate image repositories, access controls, infrastructure, and figuring out how to migrate images between two separate DTR clusters, is a significant time, cost, and complexity savings. Each DTR Worker node is also deployed to a different AZ to ensure high-availability.
Using a shared DTR between non-production and Production is an important security consideration your project team needs to consider. A single DTR, shared between non-production and Production, comes with inherent availability and security risks, which should be understood in advance.
Separate Non-Production Worker Nodes
In the shared non-production environments example, each environment has dedicated AWS EC2 instances configured as Docker EE Worker nodes. The number of Worker nodes is determined by the requirements for each environment, as dictated by the project’s Development, Testing, Security, and DevOps teams. Like the UCP and DTR clusters, each Worker node, within an individual environment, is deployed to a different AZ to ensure high-availability and mimic the Production architecture.
Minimizing the number of Worker nodes in each environment, as well as the type and size of each EC2 node, offers a significant potential cost and administrative savings.
Separate Environment Ingress
In the example, the UCP, DTR, and each of the four environments are accessed through separate URLs, using AWS Hosted Zone CNAME records (subdomains). Encrypted HTTPS traffic is routed through a series of security appliances, depending on traffic type, to individual private AWS Elastic Load Balancers (ELB), one for both UCPs, the DTR, and each of the environments. Each ELB load-balances traffic to the Docker EE nodes associated the specific traffic. All firewalls, ELBs, and the UCP and DTR are secured with a high-grade wildcard SSL certificate.
Separate Data Sources
In the shared non-production environments example, there is one Amazon Relational Database Service (RDS) instance in non-Production and one Production. Both RDS instances are replicated across multiple Availability Zones. Within the single shared non-production RDS instance, there are four separate databases, one per non-production environment. This architecture sacrifices the potential database performance of separate RDS instances for additional cost and complexity.
Maintaining Environment Separation
Node Labels
To obtain sufficient environment separation while using a single UCP, each Docker EE Worker node is tagged with an environment
node label. The node label indicates which environment the Worker node is associated with. For example, in the screenshot below, a Worker node is assigned to the Development environment by tagging it with the key of environment
and the value of dev
.
* The Docker EE screens shown here are from UCP 2.1.5, not the recently released 2.2.x, which has an updated UI appearance.Each service’s Docker Compose file uses deployment placement constraints, which indicate where Docker should or should not deploy services. In the hello-world Docker Compose file example below, the node.labels.environment
constraint is set to the ENVIRONMENT
variable, which is set during container deployment by the CI/CD server. This constraint directs Docker to only deploy the hello-world service to nodes which contain the placement constraint of node.labels.environment
, whose value matches the ENVIRONMENT
variable value.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Hello World Service Stack | |
# DTR_URL: Docker Trusted Registry URL | |
# IMAGE: Docker Image to deply | |
# ENVIRONMENT: Environment to deploy into | |
version: '3.2' | |
services: | |
hello-world: | |
image: ${DTR_URL}/${IMAGE} | |
deploy: | |
placement: | |
constraints: | |
– node.role == worker | |
– node.labels.environment == ${ENVIRONMENT} | |
replicas: 4 | |
update_config: | |
parallelism: 4 | |
delay: 10s | |
restart_policy: | |
condition: any | |
max_attempts: 3 | |
delay: 10s | |
logging: | |
driver: fluentd | |
options: | |
tag: docker.{{.Name}} | |
env: SERVICE_NAME,ENVIRONMENT | |
environment: | |
SERVICE_NAME: hello-world | |
ENVIRONMENT: ${ENVIRONMENT} | |
command: "java \ | |
-Dspring.profiles.active=${ENVIRONMENT} \ | |
-Djava.security.egd=file:/dev/./urandom \ | |
-jar hello-world.jar" |
Deploying from CI/CD Server
The ENVIRONMENT
value is set as an environment variable, which is then used by the CI/CD server, running a docker stack deploy
or a docker service update
command, within a deployment pipeline. Below is an example of how to use the environment variable as part of a Jenkins pipeline as code Jenkinsfile.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env groovy | |
// Deploy Hello World Service Stack | |
node('java') { | |
properties([parameters([ | |
choice(choices: ["ci", "dev", "test", "uat"].join("\n"), | |
description: 'Environment', name: 'ENVIRONMENT') | |
])]) | |
stage('Git Checkout') { | |
dir('service') { | |
git branch: 'master', | |
credentialsId: 'jenkins_github_credentials', | |
url: 'ssh://git@garystafford/hello-world.git' | |
} | |
dir('credentials') { | |
git branch: 'master', | |
credentialsId: 'jenkins_github_credentials', | |
url: 'ssh://git@garystafford/ucp-bundle-jenkins.git' | |
} | |
} | |
dir('service') { | |
stage('Build and Unit Test') { | |
sh './gradlew clean cleanTest build' | |
} | |
withEnv(["IMAGE=hello-world:${BUILD_NUMBER}"]) { | |
stage('Docker Build Image') { | |
withCredentials([[$class: 'StringBinding', | |
credentialsId: 'docker_username', | |
variable: 'DOCKER_PASSWORD'], | |
[$class: 'StringBinding', | |
credentialsId: 'docker_username', | |
variable: 'DOCKER_USERNAME']]) { | |
sh "docker login -u ${DOCKER_USERNAME} -p ${DOCKER_PASSWORD} ${DTR_URL}" | |
} | |
sh "docker build –no-cache -t ${DTR_URL}/${IMAGE} ." | |
} | |
stage('Docker Push Image') { | |
sh "docker push ${DTR_URL}/${IMAGE}" | |
} | |
withEnv(['DOCKER_TLS_VERIFY=1', | |
"DOCKER_CERT_PATH=${WORKSPACE}/credentials/", | |
"DOCKER_HOST=${DOCKER_HOST}"]) { | |
stage('Docker Stack Deploy') { | |
try { | |
sh "docker service rm ${params.ENVIRONMENT}_hello-world" | |
sh 'sleep 30s' // wait for service to be completely removed if it exists | |
} catch (err) { | |
echo "Error: ${err}" // catach and move on if it doesn't already exist | |
} | |
sh "docker stack deploy \ | |
–compose-file=docker-compose.yml ${params.ENVIRONMENT}" | |
} | |
} | |
} | |
} | |
} |
Centralized Logging and Metrics Collection
Centralized logging and metrics collection systems are used for application and infrastructure dashboards, monitoring, and alerting. In the shared non-production environment examples, the centralized logging and metrics collection systems are internal to each VPC, but reside on separate EC2 instances and are not registered with the Control Plane. In this way, the logging and metrics collection systems should not impact the reliability, performance, and security of the applications running within Docker EE. In the example, Worker nodes run a containerized copy of fluentd, which collects and pushes logs to ELK’s Elasticsearch.
Logging and metrics collection systems could also be supplied by external cloud-based SaaS providers, such as Loggly, Sysdig and Datadog, or by the platform’s cloud-provider, such as Amazon CloudWatch.
With four environments running multiple containerized copies of each service, figuring out which log entry came from which service instance, requires multiple data points. As shown in the example Kibana UI below, the environment value, along with the service name and container ID, as well as the git commit hash and branch, are added to each log entry for easier troubleshooting. To include the environment, the value of the ENVIRONMENT
variable is passed to Docker’s fluentd log driver as an env
option. This same labeling method is used to tag metrics.
Separate Docker Service Stacks
For further environment separation within the single Control Plane, services are deployed as part of the same Docker service stack. Each service stack contains all services that comprise an application running within a single environment. Multiple stacks may be required to support multiple, distinct applications within the same environment.
For example, in the screenshot below, a hello-world service container, built with a Docker image, tagged with build 59 of the Jenkins continuous integration pipeline, is deployed as part of both the Development (dev) and Test service stacks. The CD and UAT service stacks each contain different versions of the hello-world service.
Separate Docker Overlay Networks
For additional environment separation within the single non-production UCP, all Docker service stacks associated with an environment, reside on the same Docker overlay network. Overlay networks manage communications among the Docker Worker nodes, enabling service-to-service communication for all services on the same overlay network while isolating services running on one network from services running on another network.
in the example screenshot below, the hello-world service, a member of the test service stack, is running on the test_default
overlay network.
Cleaning Up
Having distinct environment-centric Docker service stacks and overlay networks makes it easy to clean up an environment, without impacting adjacent environments. Both service stacks and overlay networks can be removed to clear an environment’s contents.
Separate Performance Environment
In the alternative example below, a Performance environment has been added to the Non-Production VPC. To ensure a higher level of isolation, the Performance environment has its own UPC, RDS, and ELBs. The Performance environment shares the DTR, as well as the security, logging, and monitoring components, with the rest of the non-production environments.
Below, the Performance environment has half the number of Worker nodes as Production. Performance results can be scaled for expected Production performance, given more nodes. Alternately, the number of nodes can be scaled up temporarily to match Production, then scaled back down to a minimum after testing is complete.
Shared DevOps Tooling
All environments leverage shared Development and DevOps resources, deployed to a separate VPC. Resources include Agile Application Lifecycle Management (ALM), such as JIRA or CA Agile Central, source control repository management (SCM), such as GitLab or Bitbucket, binary repository management, such as Artifactory or Nexus, and a CI/CD solution, such as Jenkins, TeamCity, or Bamboo.
From the DevOps VPC, Docker images are pushed and pulled from the DTR in the Non-Production VPC. Deployments of container-based application are executed from the DevOps VPC CI/CD server to the non-production, Performance, and Production UCPs. Separate DevOps CI/CD pipelines and access controls are essential in maintaining the separation of the non-production and Production environments.
Complete Platform
Several common components found in a Docker EE cloud-based AWS platform were discussed in the post. However, a complete AWS application platform has many more moving parts. Below is a comprehensive list of components, including DevOps tooling, organized into two categories: 1) common components that can be potentially shared across the non-production environments to save cost and complexity, and 2) components that should be replicated in each non-environment for security and performance.
Shared Non-Production Components:
- AWS
-
- Virtual Private Cloud (VPC), Region, Availability Zones
- Route Tables, Network ACLs, Internet Gateways
- Subnets
- Some Security Groups
- IAM Groups, User, Roles, Policies (RBAC)
- Relational Database Service (RDS)
- ElastiCache
- API Gateway, Lambdas
- S3 Buckets
- Bastion Servers, NAT Gateways
- Route 53 Hosted Zone (Registered Domain)
- EC2 Key Pairs
- Hardened Linux AMI
- Docker EE
-
- UCP and EC2 Manager Nodes
- DTR and EC2 Worker Nodes
- UCP and DTR Users, Teams, Organizations
- DTR Image Repositories
- Secret Management
- Third-Party Components/Products
-
- SSL Certificates
- Security Components: Firewalls, Virus Scanning, VPN Servers
- Container Security
- End-User IAM
- Directory Service
- Log Aggregation
- Metric Collection
- Monitoring, Alerting
- Configuration and Secret Management
- DevOps
-
- CI/CD Pipelines as Code
- Infrastructure as Code
- Source Code Repositories
- Binary Artifact Repositories
Isolated Non-Production Components:
- AWS
-
- Route 53 Hosted Zones and Associated Records
- Elastic Load Balancers (ELB)
- Elastic Compute Cloud (EC2) Worker Nodes
- Elastic IPs
- ELB and EC2 Security Groups
- RDS Databases (Single RDS Instance with Separate Databases)
All opinions in this post are my own and not necessarily the views of my current employer or their clients.
Infrastructure as Code Maturity Model
Posted by Gary A. Stafford in Build Automation, Continuous Delivery, DevOps, Enterprise Software Development on November 25, 2016
Systematically Evolving an Organization’s Infrastructure
Infrastructure and software development teams are increasingly building and managing infrastructure using automated tools that have been described as “infrastructure as code.” – Kief Morris (Infrastructure as Code)
The process of managing and provisioning computing infrastructure and their configuration through machine-processable, declarative, definition files, rather than physical hardware configuration or the use of interactive configuration tools. – Wikipedia (abridged)
Convergence of CD, Cloud, and IaC
In 2011, co-authors Jez Humble, formerly of ThoughtWorks, and David Farley, published their ground-breaking book, Continuous Delivery. Humble and Farley’s book set out, in their words, to automate the ‘painful, risky, and time-consuming process’ of the software ‘build, deployment, and testing process.’
Over the next five years, Humble and Farley’s Continuous Delivery made a significant contribution to the modern phenomena of DevOps. According to Wikipedia, DevOps is the ‘culture, movement or practice that emphasizes the collaboration and communication of both software developers and other information-technology (IT) professionals while automating the process of software delivery and infrastructure changes.’
In parallel with the growth of DevOps, Cloud Computing continued to grow at an explosive rate. Amazon pioneered modern cloud computing in 2006 with the launch of its Elastic Compute Cloud. Two years later, in 2008, Microsoft launched its cloud platform, Azure. In 2010, Rackspace launched OpenStack.
Today, there is a flock of ‘cloud’ providers. Their services fall into three primary service models: Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). Since we will be discussing infrastructure, we will focus on IaaS and PaaS. Leaders in this space include Google Cloud Platform, RedHat, Oracle Cloud, Pivotal Cloud Foundry, CenturyLink Cloud, Apprenda, IBM SmartCloud Enterprise, and Heroku, to mention just a few.
Finally, fast forward to June 2016, O’Reilly releases Infrastructure as Code
Managing Servers in the Cloud, by Kief Morris, ThoughtWorks. This crucial work bridges many of the concepts first introduced in Humble and Farley’s Continuous Delivery, with the evolving processes and practices to support cloud computing.
This post examines how to apply the principles found in the Continuous Delivery Maturity Model, an analysis tool detailed in Humble and Farley’s Continuous Delivery, and discussed herein, to the best practices found in Morris’ Infrastructure as Code.
Infrastructure as Code
Before we continue, we need a shared understanding of infrastructure as code. Below are four examples of infrastructure as code, as Wikipedia defined them, ‘machine-processable, declarative, definition files.’ The code was written using four popular tools, including HashiCorp Packer, Docker, AWS CloudFormation, and HashiCorp Terraform. Executing the code provisions virtualized cloud infrastructure.
HashiCorp Packer
Packer definition of an AWS EBS-backed AMI, based on Ubuntu.
{ "variables": { "aws_access_key": "", "aws_secret_key": "" }, "builders": [{ "type": "amazon-ebs", "access_key": "{{user `aws_access_key`}}", "secret_key": "{{user `aws_secret_key`}}", "region": "us-east-1", "source_ami": "ami-fce3c696", "instance_type": "t2.micro", "ssh_username": "ubuntu", "ami_name": "packer-example {{timestamp}}" }] }
Docker
Dockerfile, used to create a Docker image, and subsequently a Docker container, running MongoDB.
FROM ubuntu:16.04 MAINTAINER Docker RUN apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv EA312927 RUN echo "deb http://repo.mongodb.org/apt/ubuntu" \ $(cat /etc/lsb-release | grep DISTRIB_CODENAME | cut -d= -f2)/mongodb-org/3.2 multiverse" | \ tee /etc/apt/sources.list.d/mongodb-org-3.2.list RUN apt-get update && apt-get install -y mongodb-org RUN mkdir -p /data/db EXPOSE 27017 ENTRYPOINT ["/usr/bin/mongod"]
AWS CloudFormation
AWS CloudFormation declaration for three services enabled on a running instance.
services: sysvinit: nginx: enabled: "true" ensureRunning: "true" files: - "/etc/nginx/nginx.conf" sources: - "/var/www/html" php-fastcgi: enabled: "true" ensureRunning: "true" packages: yum: - "php" - "spawn-fcgi" sendmail: enabled: "false" ensureRunning: "false"
HashiCorp Terraform
Terraform definition of an AWS m1.small EC2 instance, running NGINX on Ubuntu.
resource "aws_instance" "web" { connection { user = "ubuntu" } instance_type = "m1.small" Ami = "${lookup(var.aws_amis, var.aws_region)}" Key_name = "${aws_key_pair.auth.id}" vpc_security_group_ids = ["${aws_security_group.default.id}"] Subnet_id = "${aws_subnet.default.id}" provisioner "remote-exec" { inline = [ "sudo apt-get -y update", "sudo apt-get -y install nginx", "sudo service nginx start", ] } }
Cloud-based Infrastructure as a Service
The previous examples provide but the narrowest of views into the potential breadth of infrastructure as code. Leading cloud providers, such as Amazon and Microsoft, offer hundreds of unique offerings, most of which may be defined and manipulated through code — infrastructure as code.
What Infrastructure as Code?
The question many ask is, what types of infrastructure can be defined as code? Although vendors and cloud providers have their unique names and descriptions, most infrastructure is divided into a few broad categories:
- Compute
- Databases, Caching, and Messaging
- Storage, Backup, and Content Delivery
- Networking
- Security and Identity
- Monitoring, Logging, and Analytics
- Management Tooling
Continuous Delivery Maturity Model
We also need a common understanding of the Continuous Delivery Maturity Model. According to Humble and Farley, the Continuous Delivery Maturity Model was distilled as a model that ‘helps to identify where an organization stands in terms of the maturity of its processes and practices and defines a progression that an organization can work through to improve.’
The Continuous Delivery Maturity Model is a 5×6 matrix, consisting of six areas of practice and five levels of maturity. Each of the matrix’s 30 elements defines a required discipline an organization needs to follow, to be considered at that level of maturity within that practice.
Areas of Practice
The CD Maturity Model examines six broad areas of practice found in most enterprise software organizations:
- Build Management and Continuous Integration
- Environments and Deployment
- Release Management and Compliance
- Testing
- Data Management
- Configuration Management
Levels of Maturity
The CD Maturity Model defines five level of increasing maturity, from a score of -1 to 3, from Regressive to Optimizing:
- Level 3: Optimizing – Focus on process improvement
- Level 2: Quantitatively Managed – Process measured and controlled
- Level 1: Consistent – Automated processes applied across whole application lifecycle
- Level 0: Repeatable – Process documented and partly automated
- Level -1: Regressive – Processes unrepeatable, poorly controlled, and reactive
Maturity Model Analysis
The CD Maturity Model is an analysis tool. In my experience, organizations use the maturity model in one of two ways. First, an organization completes an impartial evaluation of their existing levels of maturity across all areas of practice. Then, the organization focuses on improving the overall organization’s maturity, attempting to achieve a consistent level of maturity across all areas of practice. Alternately, the organization concentrates on a subset of the practices, which have the greatest business value, or given their relative immaturity, are a detriment to the other practices.
* CD Maturity Model Analysis Tool available on GitHub.
Infrastructure as Code Maturity Levels
Although infrastructure as code is not explicitly called out as a practice in the CD Maturity Model, many of it’s best practices can be found in the maturity model. For example, the model prescribes automated environment provisioning, orchestrated deployments, and the use of metrics for continuous improvement.
Instead of trying to retrofit infrastructure as code into the existing CD Maturity Model, I believe it is more effective to independently apply the model’s five levels of maturity to infrastructure as code. To that end, I have selected many of the best practices from the book, Infrastructure as Code, as well as from my experiences. Those selected practices have been distributed across the model’s five levels of maturity.
The result is the first pass at an evolving Infrastructure as Code Maturity Model. This model may be applied alongside the broader CD Maturity Model, or independently, to evaluate and further develop an organization’s infrastructure practices.
IaC Level -1: Regressive
Processes unrepeatable, poorly controlled, and reactive
- Limited infrastructure is provisioned and managed as code
- Infrastructure provisioning still requires many manual processes
- Infrastructure code is not written using industry-standard tooling and patterns
- Infrastructure code not built, unit-tested, provisioned and managed, as part of a pipeline
- Infrastructure code, processes, and procedures are inconsistently documented, and not available to all required parties
IaC Level 0: Repeatable
Processes documented and partly automated
- All infrastructure code and configuration are stored in a centralized version control system
- Testing, provisioning, and management of infrastructure are done as part of automated pipeline
- Infrastructure is deployable as individual components
- Leverages programmatic interfaces into physical devices
- Automated security inspection of components and dependencies
- Self-service CLI or API, where internal customers provision their resources
- All code, processes, and procedures documented and available
- Immutable infrastructure and processes
IaC Level 1: Consistent
Automated processes applied across whole application lifecycle
- Fully automated provisioning and management of infrastructure
- Minimal use of unsupported, ‘home-grown’ infrastructure tooling
- Unit-tests meet code-coverage requirements
- Code is continuously tested upon every check-in to version control system
- Continuously available infrastructure using zero-downtime provisioning
- Uses configuration registries
- Templatized configuration files (no awk/sed magic)
- Secrets are securely management
- Auto-scaling based on user-defined load characteristics
IaC Level 2: Quantitatively Managed
Processes measured and controlled
- Uses infrastructure definition files
- Capable of automated rollbacks
- Infrastructure and supporting systems are highly available and fault tolerant
- Externalized configuration, no black box API to modify configuration
- Fully monitored infrastructure with configurable alerting
- Aggregated, auditable infrastructure logging
- All code, processes, and procedures are well documented in a Knowledge Management System
- Infrastructure code uses declarative versus imperative programming model, maybe…
IaC Level 3: Optimizing
Focus on process improvement
- Self-healing, self-configurable, self-optimizing, infrastructure
- Performance tested and monitored against business KPIs
- Maximal infrastructure utilization and workload density
- Adheres to Cloud Native and 12-Factor patterns
- Cloud-agnostic code that minimizes cloud vendor lock-in
All opinions in this post are my own and not necessarily the views of my current employer or their clients.