Posts Tagged Cloud

Managing AWS Infrastructure as Code using Ansible, CloudFormation, and CodeBuild

Introduction

When it comes to provisioning and configuring resources on the AWS cloud platform, there is a wide variety of services, tools, and workflows you could choose from. You could decide to exclusively use the cloud-based services provided by AWS, such as CodeBuild, CodePipeline, CodeStar, and OpsWorks. Alternatively, you could choose open-source software (OSS) for provisioning and configuring AWS resources, such as community editions of Jenkins, HashiCorp Terraform, Pulumi, Chef, and Puppet. You might also choose to use licensed products, such as Octopus Deploy, TeamCity, CloudBees Core, Travis CI Enterprise, and XebiaLabs XL Release. You might even decide to write your own custom tools or scripts in Python, Go, JavaScript, Bash, or other common languages.

The reality in most enterprises I have worked with, teams integrate a combination of AWS services, open-source software, custom scripts, and occasionally licensed products to construct complete, end-to-end, infrastructure as code-based workflows for provisioning and configuring AWS resources. Choices are most often based on team experience, vendor relationships, and an enterprise’s specific business use cases.

In the following post, we will explore one such set of easily-integrated tools for provisioning and configuring AWS resources. The tool-stack is comprised of Red Hat Ansible, AWS CloudFormation, and AWS CodeBuild, along with several complementary AWS technologies. Using these tools, we will provision a relatively simple AWS environment, then deploy, configure, and test a highly-available set of Apache HTTP Servers. The demonstration is similar to the one featured in a previous post, Getting Started with Red Hat Ansible for Google Cloud Platform.

ansible-aws-stack2.png

Why Ansible?

With its simplicity, ease-of-use, broad compatibility with most major cloud, database, network, storage, and identity providers amongst other categories, Ansible has been a popular choice of Engineering teams for configuration-management since 2012. Given the wide variety of polyglot technologies used within modern Enterprises and the growing predominance of multi-cloud and hybrid cloud architectures, Ansible provides a common platform for enabling mature DevOps and infrastructure as code practices. Ansible is easily integrated with higher-level orchestration systems, such as AWS CodeBuild, Jenkins, or Red Hat AWX and Tower.

Technologies

The primary technologies used in this post include the following.

Red Hat Ansible

ansibleAnsible, purchased by Red Hat in October 2015, seamlessly provides workflow orchestration with configuration management, provisioning, and application deployment in a single platform. Unlike similar tools, Ansible’s workflow automation is agentless, relying on Secure Shell (SSH) and Windows Remote Management (WinRM). If you are interested in learning more on the advantages of Ansible, they’ve published a whitepaper on The Benefits of Agentless Architecture.

According to G2 Crowd, Ansible is a clear leader in the Configuration Management Software category, ranked right behind GitLab. Competitors in the category include GitLab, AWS Config, Puppet, Chef, Codenvy, HashiCorp Terraform, Octopus Deploy, and JetBrains TeamCity.

AWS CloudFormation

Deployment__Management_copy_AWS_CloudFormation-512

According to AWS, CloudFormation provides a common language to describe and provision all the infrastructure resources within AWS-based cloud environments. CloudFormation allows you to use a JSON- or YAML-based template to model and provision, in an automated and secure manner, all the resources needed for your applications across all AWS regions and accounts.

Codifying your infrastructure, often referred to as ‘Infrastructure as Code,’ allows you to treat your infrastructure as just code. You can author it with any IDE, check it into a version control system, and review the files with team members before deploying it.

AWS CodeBuild

code-build-console-iconAccording to AWS, CodeBuild is a fully managed continuous integration service that compiles your source code, runs tests, and produces software packages that are ready to deploy. With CodeBuild, you don’t need to provision, manage, and scale your own build servers. CodeBuild scales continuously and processes multiple builds concurrently, so your builds are not left waiting in a queue.

CloudBuild integrates seamlessly with other AWS Developer tools, including CodeStar, CodeCommit, CodeDeploy, and CodePipeline.

According to G2 Crowd, the main competitors to AWS CodeBuild, in the Build Automation Software category, include Jenkins, CircleCI, CloudBees Core and CodeShip, Travis CI, JetBrains TeamCity, and Atlassian Bamboo.

Other Technologies

In addition to the major technologies noted above, we will also be leveraging the following services and tools to a lesser extent, in the demonstration:

  • AWS CodeCommit
  • AWS CodePipeline
  • AWS Systems Manager Parameter Store
  • Amazon Simple Storage Service (S3)
  • AWS Identity and Access Management (IAM)
  • AWS Command Line Interface (CLI)
  • CloudFormation Linter
  • Apache HTTP Server

Demonstration

Source Code

All source code for this post is contained in two GitHub repositories. The CloudFormation templates and associated files are in the ansible-aws-cfn GitHub repository. The Ansible Roles and related files are in the ansible-aws-roles GitHub repository. Both repositories may be cloned using the following commands.

git clone --branch master --single-branch --depth 1 --no-tags \ 
  https://github.com/garystafford/ansible-aws-cfn.git

git clone --branch master --single-branch --depth 1 --no-tags \
  https://github.com/garystafford/ansible-aws-roles.git

Development Process

The general process we will follow for provisioning and configuring resources in this demonstration are as follows:

  • Create an S3 bucket to store the validated CloudFormation templates
  • Create an Amazon EC2 Key Pair for Ansible
  • Create two AWS CodeCommit Repositories to store the project’s source code
  • Put parameters in Parameter Store
  • Write and test the CloudFormation templates
  • Configure Ansible and AWS Dynamic Inventory script
  • Write and test the Ansible Roles and Playbooks
  • Write the CodeBuild build specification files
  • Create an IAM Role for CodeBuild and CodePipeline
  • Create and test CodeBuild Projects and CodePipeline Pipelines
  • Provision, deploy, and configure the complete web platform to AWS
  • Test the final web platform

Prerequisites

For this demonstration, I will assume you already have an AWS account, the AWS CLI, Python, and Ansible installed locally, an S3 bucket to store the final CloudFormation templates and an Amazon EC2 Key Pair for Ansible to use for SSH.

 Continuous Integration and Delivery Overview

In this demonstration, we will be building multiple CI/CD pipelines for provisioning and configuring our resources to AWS, using several AWS services. These services include CodeCommit, CodeBuild, CodePipeline, Systems Manager Parameter Store, and Amazon Simple Storage Service (S3). The diagram below shows the complete CI/CD workflow we will build using these AWS services, along with Ansible.

aws_devops

AWS CodeCommit

According to Amazon, AWS CodeCommit is a fully-managed source control service that makes it easy to host secure and highly scalable private Git repositories. CodeCommit eliminates the need to operate your own source control system or worry about scaling its infrastructure.

Start by creating two AWS CodeCommit repositories to hold the two GitHub projects your cloned earlier. Commit both projects to your own AWS CodeCommit repositories.

screen_shot_2019-07-26_at_9_02_54_pm

Configuration Management

We have several options for storing the configuration values necessary to provision and configure the resources on AWS. We could set configuration values as environment variables directly in CodeBuild. We could set configuration values from within our Ansible Roles. We could use AWS Systems Manager Parameter Store to store configuration values. For this demonstration, we will use a combination of all three options.

AWS Systems Manager Parameter Store

According to Amazon, AWS Systems Manager Parameter Store provides secure, hierarchical storage for configuration data management and secrets management. You can store data such as passwords, database strings, and license codes as parameter values, as either plain text or encrypted.

The demonstration uses two CloudFormation templates. The two templates have several parameters. A majority of those parameter values will be stored in Parameter Store, retrieved by CloudBuild, and injected into the CloudFormation template during provisioning.

screen_shot_2019-07-26_at_9_38_33_pm

The Ansible GitHub project includes a shell script, parameter_store_values.sh, to put the necessary parameters into Parameter Store. The script requires the AWS Command Line Interface (CLI) to be installed locally. You will need to change the KEY_PATH key value in the script (snippet shown below) to match the location your private key, part of the Amazon EC2 Key Pair you created earlier for use by Ansible.

KEY_PATH="/path/to/private/key"

# put encrypted parameter to Parameter Store
aws ssm put-parameter \
  --name $PARAMETER_PATH/ansible_private_key \
  --type SecureString \
  --value "file://${KEY_PATH}" \
  --description "Ansible private key for EC2 instances" \
  --overwrite

SecureString

Whereas all other parameters are stored in Parameter Store as String datatypes, the private key is stored as a SecureString datatype. Parameter Store uses an AWS Key Management Service (KMS) customer master key (CMK) to encrypt the SecureString parameter value. The IAM Role used by CodeBuild (discussed later) will have the correct permissions to use the KMS key to retrieve and decrypt the private key SecureString parameter value.

screen_shot_2019-07-26_at_9_41_42_pm

CloudFormation

The demonstration uses two CloudFormation templates. The first template, network-stack.template, contains the AWS network stack resources. The template includes one VPC, one Internet Gateway, two NAT Gateways, four Subnets, two Elastic IP Addresses, and associated Route Tables and Security Groups. The second template, compute-stack.template, contains the webserver compute stack resources. The template includes an Auto Scaling Group, Launch Configuration, Application Load Balancer (ALB), ALB Listener, ALB Target Group, and an Instance Security Group. Both templates originated from the AWS CloudFormation template sample library, and were modified for this demonstration.

The two templates are located in the cfn_templates directory of the CloudFormation project, as shown below in the tree view.

.
├── LICENSE.md
├── README.md
├── buildspec_files
│   ├── build.sh
│   └── buildspec.yml
├── cfn_templates
│   ├── compute-stack.template
│   └── network-stack.template
├── codebuild_projects
│   ├── build.sh
│   └── cfn-validate-s3.json
├── codepipeline_pipelines
│   ├── build.sh
│   └── cfn-validate-s3.json
└── requirements.txt

The templates require no modifications for the demonstration. All parameters are in Parameter store or set by the Ansible Roles, and consumed by the Ansible Playbooks via CodeBuild.

Ansible

We will use Red Hat Ansible to provision the network and compute resources by interacting directly with CloudFormation, deploy and configure Apache HTTP Server, and finally, perform final integration tests of the system. In my opinion, the closest equivalent to Ansible on the AWS platform is AWS OpsWorks. OpsWorks lets you use Chef and Puppet (direct competitors to Ansible) to automate how servers are configured, deployed, and managed across Amazon EC2 instances or on-premises compute environments.

Ansible Config

To use Ansible with AWS and CloudFormation, you will first want to customize your project’s ansible.cfg file to enable the aws_ec2 inventory plugin. Below is part of my configuration file as a reference.

[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp
fact_caching_timeout = 300

host_key_checking = False
roles_path = roles
inventory = inventories/hosts
remote_user = ec2-user
private_key_file = ~/.ssh/ansible

[inventory]
enable_plugins = host_list, script, yaml, ini, auto, aws_ec2

Ansible Roles

According to Ansible, Roles are ways of automatically loading certain variable files, tasks, and handlers based on a known file structure. Grouping content by roles also allows easy sharing of roles with other users. For the demonstration, I have written four roles, located in the roles directory, as shown below in the project tree view. The default, common role is not used in this demonstration.

.
├── LICENSE.md
├── README.md
├── ansible.cfg
├── buildspec_files
│   ├── buildspec_compute.yml
│   ├── buildspec_integration_tests.yml
│   ├── buildspec_network.yml
│   └── buildspec_web_config.yml
├── codebuild_projects
│   ├── ansible-test.json
│   ├── ansible-web-config.json
│   ├── build.sh
│   ├── cfn-compute.json
│   ├── cfn-network.json
│   └── notes.md
├── filter_plugins
├── group_vars
├── host_vars
├── inventories
│   ├── aws_ec2.yml
│   ├── ec2.ini
│   ├── ec2.py
│   └── hosts
├── library
├── module_utils
├── notes.md
├── parameter_store_values.sh
├── playbooks
│   ├── 10_cfn_network.yml
│   ├── 20_cfn_compute.yml
│   ├── 30_web_config.yml
│   └── 40_integration_tests.yml
├── production
├── requirements.txt
├── roles
│   ├── cfn_compute
│   ├── cfn_network
│   ├── common
│   ├── httpd
│   └── integration_tests
├── site.yml
└── staging

The four roles include a role for provisioning the network, the cfn_network role. A role for configuring the compute resources, the cfn_compute role. A role for deploying and configuring the Apache servers, the httpd role. Finally, a role to perform final integration tests of the platform, the integration_tests role. The individual roles help separate the project’s major parts, network, compute, and middleware, into logical code files. Each role was initially built using Ansible Galaxy (ansible-galaxy init). They follow Galaxy’s standard file structure, as shown in the tree view below, of the cfn_network role.

.
├── README.md
├── defaults
│   └── main.yml
├── files
├── handlers
│   └── main.yml
├── meta
│   └── main.yml
├── tasks
│   ├── create.yml
│   ├── delete.yml
│   └── main.yml
├── templates
├── tests
│   ├── inventory
│   └── test.yml
└── vars
    └── main.yml

Testing Ansible Roles

In addition to checking each role during development and on each code commit with Ansible Lint, each role contains a set of unit tests, in the tests directory, to confirm the success or failure of the role’s tasks. Below we see a basic set of tests for the cfn_compute role. First, we gather Facts about the deployed EC2 instances. Facts information Ansible can automatically derive from your remote systems. We check the facts for expected properties of the running EC2 instances, including timezone, Operating System, major OS version, and the UserID. Note the use of the failed_when conditional. This Ansible playbook error handling conditional is used to confirm the success or failure of tasks.

---
- name: Test cfn_compute Ansible role
  gather_facts: True
  hosts: tag_Group_webservers

  pre_tasks:
  - name: List all ansible facts
    debug:
      msg: "{{ ansible_facts }}"

  tasks:
  - name: Check if EC2 instance's timezone is set to 'UTC'
    debug:
      msg: Timezone is UTC
    failed_when: ansible_facts['date_time']['tz'] != 'UTC'

  - name: Check if EC2 instance's OS is 'Amazon'
    debug:
      msg: OS is Amazon
    failed_when: ansible_facts['distribution_file_variety'] != 'Amazon'

  - name: Check if EC2 instance's OS major version is '2018'
    debug:
      msg: OS major version is 2018
    failed_when: ansible_facts['distribution_major_version'] != '2018'

  - name: Check if EC2 instance's UserID is 'ec2-user'
    debug:
      msg: UserID is ec2-user
    failed_when: ansible_facts['user_id'] != 'ec2-user'

If we were to run the test on their own, against the two correctly provisioned and configured EC2 web servers, we would see results similar to the following.

screen_shot_2019-07-26_at_6_55_04_pm

In the cfn_network role unit tests, below, note the use of the Ansible cloudformation_facts module. This module allows us to obtain facts about the successfully completed AWS CloudFormation stack. We can then use these facts to drive additional provisioning and configuration, or testing. In the task below, we get the network CloudFormation stack’s Outputs. These are the exact same values we would see in the stack’s Output tab, from the AWS CloudFormation management console.

---
- name: Test cfn_network Ansible role
  gather_facts: False
  hosts: localhost

  pre_tasks:
    - name: Get facts about the newly created cfn network stack
      cloudformation_facts:
        stack_name: "ansible-cfn-demo-network"
      register: cfn_network_stack_facts

    - name: List 'stack_outputs' from cached facts
      debug:
        msg: "{{ cloudformation['ansible-cfn-demo-network'].stack_outputs }}"

  tasks:
  - name: Check if the AWS Region of the VPC is {{ lookup('env','AWS_REGION') }}
    debug:
      msg: "AWS Region of the VPC is {{ lookup('env','AWS_REGION') }}"
    failed_when: cloudformation['ansible-cfn-demo-network'].stack_outputs['VpcRegion'] != lookup('env','AWS_REGION')

Similar to the CloudFormation templates, the Ansible roles require no modifications. Most of the project’s parameters are decoupled from the code and stored in Parameter Store or CodeBuild buildspec files (discussed next). The few parameters found in the roles, in the defaults/main.yml files are neither account- or environment-specific.

Ansible Playbooks

The roles will be called by our Ansible Playbooks. There is a create and a delete set of tasks for the cfn_network and cfn_compute roles. Either create or delete tasks are accessible through the role, using the main.yml file and referencing the create or delete Ansible Tags.

---
- import_tasks: create.yml
  tags:
    - create

- import_tasks: delete.yml
  tags:
    - delete

Below, we see the create tasks for the cfn_network role, create.yml, referenced above by main.yml. The use of the cloudcormation module in the first task allows us to create or delete AWS CloudFormation stacks and demonstrates the real power of Ansible—the ability to execute complex AWS resource provisioning, by extending its core functionality via a module. By switching the Cloud module, we could just as easily provision resources on Google Cloud, Azure, AliCloud, OpenStack, or VMWare, to name but a few.

---
- name: create a stack, pass in the template via an S3 URL
  cloudformation:
    stack_name: "{{ stack_name }}"
    state: present
    region: "{{ lookup('env','AWS_REGION') }}"
    disable_rollback: false
    template_url: "{{ lookup('env','TEMPLATE_URL') }}"
    template_parameters:
      VpcCIDR: "{{ lookup('env','VPC_CIDR') }}"
      PublicSubnet1CIDR: "{{ lookup('env','PUBLIC_SUBNET_1_CIDR') }}"
      PublicSubnet2CIDR: "{{ lookup('env','PUBLIC_SUBNET_2_CIDR') }}"
      PrivateSubnet1CIDR: "{{ lookup('env','PRIVATE_SUBNET_1_CIDR') }}"
      PrivateSubnet2CIDR: "{{ lookup('env','PRIVATE_SUBNET_2_CIDR') }}"
      TagEnv: "{{ lookup('env','TAG_ENVIRONMENT') }}"
    tags:
      Stack: "{{ stack_name }}"

The CloudFormation parameters in the above task are mainly derived from environment variables, whose values were retrieved from the Parameter Store by CodeBuild and set in the environment. We obtain these external values using Ansible’s Lookup Plugins. The stack_name variable’s value is derived from the role’s defaults/main.yml file. The task variables use the Python Jinja2 templating system style of encoding.

variables

The associated Ansible Playbooks, which call the tasks, are located in the playbooks directory, as shown previously in the tree view. The playbooks define a few required parameters, like where the list of hosts will be derived and calls the appropriate roles. For our simple demonstration, only a single role is called per playbook. Typically, in a larger project, you would call multiple roles from a single playbook. Below, we see the Network playbook, playbooks/10_cfn_network.yml, which calls the cfn_network role.

---
- name: Provision VPC and Subnets
  hosts: localhost
  connection: local
  gather_facts: False

  roles:
    - role: cfn_network

Dynamic Inventory

Another principal feature of Ansible is demonstrated in the Web Server Configuration playbook, playbooks/30_web_config.yml, shown below. Note the hosts to which we want to deploy and configure Apache HTTP Server is based on an AWS tag value, indicated by the reference to tag_Group_webservers. This indirectly refers to an AWS tag, named Group, with the value of webservers, which was applied to our EC2 hosts by CloudFormation. The ability to generate a Dynamic Inventory, using a dynamic external inventory system, is a key feature of Ansible.

---
- name: Configure Apache Web Servers
  hosts: tag_Group_webservers
  gather_facts: False
  become: yes
  become_method: sudo

  roles:
    - role: httpd

To generate a dynamic inventory of EC2 hosts, we are using the Ansible AWS EC2 Dynamic Inventory script, inventories/ec2.py and inventories/ec2.ini files. The script dynamically queries AWS for all the EC2 hosts containing specific AWS tags, belonging to a particular Security Group, Region, Availability Zone, and so forth.

I have customized the AWS EC2 Dynamic Inventory script’s configuration in the inventories/aws_ec2.yml file. Amongst other configuration items, the file defines  keyed_groups. This instructs the script to inventory EC2 hosts according to their unique AWS tags and tag values.

plugin: aws_ec2
remote_user: ec2-user
private_key_file: ~/.ssh/ansible
regions:
  - us-east-1
keyed_groups:
  - key: tags.Name
    prefix: tag_Name_
    separator: ''
  - key: tags.Group
    prefix: tag_Group_
    separator: ''
hostnames:
  - dns-name
  - ip-address
  - private-dns-name
  - private-ip-address
compose:
  ansible_host: ip_address

Once you have built the CloudFormation compute stack in the proceeding section of the demonstration, to build the dynamic EC2 inventory of hosts, you would use the following command.

ansible-inventory -i inventories/aws_ec2.yml --graph

You would then see an inventory of all your EC2 hosts, resembling the following.

@all:
  |--@aws_ec2:
  |  |--ec2-18-234-137-73.compute-1.amazonaws.com
  |  |--ec2-3-95-215-112.compute-1.amazonaws.com
  |--@tag_Group_webservers:
  |  |--ec2-18-234-137-73.compute-1.amazonaws.com
  |  |--ec2-3-95-215-112.compute-1.amazonaws.com
  |--@tag_Name_Apache_Web_Server:
  |  |--ec2-18-234-137-73.compute-1.amazonaws.com
  |  |--ec2-3-95-215-112.compute-1.amazonaws.com
  |--@ungrouped:

Note the two EC2 web servers instances, listed under tag_Group_webservers. They represent the target inventory onto which we will install Apache HTTP Server. We could also use the tag, Name, with the value tag_Name_Apache_Web_Server.

AWS CodeBuild

Recalling our diagram, you will note the use of CodeBuild is a vital part of each of our five DevOps workflows. CodeBuild is used to 1) validate the CloudFormation templates, 2) provision the network resources,  3) provision the compute resources, 4) install and configure the web servers, and 5) run integration tests.

aws_devops

Splitting these processes into separate workflows, we can redeploy the web servers without impacting the compute resources or redeploy the compute resources without affecting the network resources. Often, different teams within a large enterprise are responsible for each of these resources categories—architecture, security (IAM), network, compute, web servers, and code deployments. Separating concerns makes a shared ownership model easier to manage.

Build Specifications

CodeBuild projects rely on a build specification or buildspec file for its configuration, as shown below. CodeBuild’s buildspec file is synonymous to Jenkins’ Jenkinsfile. Each of our five workflows will use CodeBuild. Each CodeBuild project references a separate buildspec file, included in the two GitHub projects, which by now you have pushed to your two CodeCommit repositories.

screen_shot_2019-07-26_at_6_10_59_pm

Below we see an example of the buildspec file for the CodeBuild project that deploys our AWS network resources, buildspec_files/buildspec_network.yml.

version: 0.2

env:
  variables:
    TEMPLATE_URL: "https://s3.amazonaws.com/garystafford_cloud_formation/cf_demo/network-stack.template"
    AWS_REGION: "us-east-1"
    TAG_ENVIRONMENT: "ansible-cfn-demo"
  parameter-store:
    VPC_CIDR: "/ansible_demo/vpc_cidr"
    PUBLIC_SUBNET_1_CIDR: "/ansible_demo/public_subnet_1_cidr"
    PUBLIC_SUBNET_2_CIDR: "/ansible_demo/public_subnet_2_cidr"
    PRIVATE_SUBNET_1_CIDR: "/ansible_demo/private_subnet_1_cidr"
    PRIVATE_SUBNET_2_CIDR: "/ansible_demo/private_subnet_2_cidr"

phases:
  install:
    runtime-versions:
      python: 3.7
    commands:
      - pip install -r requirements.txt -q
  build:
    commands:
      - ansible-playbook -i inventories/aws_ec2.yml playbooks/10_cfn_network.yml --tags create  -v
  post_build:
    commands:
      - ansible-playbook -i inventories/aws_ec2.yml roles/cfn_network/tests/test.yml

There are several distinct sections to the buildspec file. First, in the variables section, we define our variables. They are a combination of three static variable values and five variable values retrieved from the Parameter Store. Any of these may be overwritten at build-time, using the AWS CLI, SDK, or from the CodeBuild management console. You will need to update some of the variables to match your particular environment, such as the TEMPLATE_URL to match your S3 bucket path.

Next, the phases of our build. Again, if you are familiar with Jenkins, think of these as Stages with multiple Steps. The first phase, install, builds a Docker container, in which the build process is executed. Here we are using Python 3.7. We also run a pip command to install the required Python packages from our requirements.txt file. Next, we perform our build phase by executing an Ansible command.

 ansible-playbook \
  -i inventories/aws_ec2.yml \
  playbooks/10_cfn_network.yml --tags create -v

The command calls our playbook, playbooks/10_cfn_network.yml. The command references the create tag. This causes the playbook to run to cfn_network role’s create tasks (roles/cfn_network/tasks/create.yml), as defined in the main.yml file (roles/cfn_network/tasks/main.yml). Lastly, in our post_build phase, we execute our role’s unit tests (roles/cfn_network/tests/test.yml), using a second Ansible command.

CodeBuild Projects

Next, we need to create CodeBuild projects. You can do this using the AWS CLI or from the CodeBuild management console (shown below). I have included individual templates and a creation script in each project, in the codebuild_projects directory, which you could use to build the projects, using the AWS CLI. You would have to modify the JSON templates, replacing all references to my specific, unique AWS resources, with your own. For the demonstration, I suggest creating the five projects manually in the CodeBuild management console, using the supplied CodeBuild project templates as a guide.

screen_shot_2019-07-26_at_6_10_12_pm

CodeBuild IAM Role

To execute our CodeBuild projects, we need an IAM Role or Roles CodeBuild with permission to such resources as CodeCommit, S3, and CloudWatch. For this demonstration, I chose to create a single IAM Role for all workflows. I then allowed CodeBuild to assign the required policies to the Role as needed, which is a feature of CodeBuild.

screen_shot_2019-07-26_at_6_52_23_pm

CodePipeline Pipeline

In addition to CodeBuild, we are using CodePipeline for our first of five workflows. CodePipeline validates the CloudFormation templates and pushes them to our S3 bucket. The pipeline calls the corresponding CodeBuild project to validate each template, then deploys the valid CloudFormation templates to S3.

codepipeline

In true CI/CD fashion, the pipeline is automatically executed every time source code from the CloudFormation project is committed to the CodeCommit repository.

screen_shot_2019-07-26_at_6_12_51_pm

CodePipeline calls CodeBuild, which performs a build, based its buildspec file. This particular CodeBuild buildspec file also demonstrates another ability of CodeBuild, executing an external script. When we have a complex build phase, we may choose to call an external script, such as a Bash or Python script, verses embedding the commands in the buildspec.

version: 0.2

phases:
  install:
    runtime-versions:
      python: 3.7
  pre_build:
    commands:
      - pip install -r requirements.txt -q
      - cfn-lint -v
  build:
    commands:
      - sh buildspec_files/build.sh

artifacts:
  files:
    - '**/*'
  base-directory: 'cfn_templates'
  discard-paths: yes

Below, we see the script that is called. Here we are using both the CloudFormation Linter, cfn-lint, and the cloudformation validate-template command to validate our templates for comparison. The two tools give slightly different, yet relevant, linting results.

#!/usr/bin/env bash

set -e

for filename in cfn_templates/*.*; do
    cfn-lint -t ${filename}
    aws cloudformation validate-template \
      --template-body file://${filename}
done

Similar to the CodeBuild project templates, I have included a CodePipeline template, in the codepipeline_pipelines directory, which you could modify and create using the AWS CLI. Alternatively, I suggest using the CodePipeline management console to create the pipeline for the demo, using the supplied CodePipeline template as a guide.

screen_shot_2019-07-26_at_6_11_51_pm

Below, the stage view of the final CodePipleine pipeline.

screen_shot_2019-07-26_at_6_12_26_pm

Build the Platform

With all the resources, code, and DevOps workflows in place, we should be ready to build our platform on AWS. The CodePipeline project comes first, to validate the CloudFormation templates and place them into your S3 bucket. Since you are probably not committing new code to the CloudFormation file CodeCommit repository,  which would trigger the pipeline, you can start the pipeline using the AWS CLI (shown below) or via the management console.

# list names of pipelines
aws codepipeline list-pipelines

# execute the validation pipeline
aws codepipeline start-pipeline-execution --name cfn-validate-s3

screen_shot_2019-07-26_at_6_08_03_pm

The pipeline should complete within a few seconds.

screen_shot_2019-07-26_at_10_12_53_pm.png

Next, execute each of the four CodeBuild projects in the following order.

# list the names of the projects
aws codebuild list-projects

# execute the builds in order
aws codebuild start-build --project-name cfn-network
aws codebuild start-build --project-name cfn-compute

# ensure EC2 instance checks are complete before starting
# the ansible-web-config build!
aws codebuild start-build --project-name ansible-web-config
aws codebuild start-build --project-name ansible-test

As the code comment above states, be careful not to start the ansible-web-config build until you have confirmed the EC2 instance Status Checks have completed and have passed, as shown below. The previous, cfn-compute build will complete when CloudFormation finishes building the new compute stack. However, the fact CloudFormation finished does not indicate that the EC2 instances are fully up and running. Failure to wait will result in a failed build of the ansible-web-config CodeBuild project, which installs and configures the Apache HTTP Servers.

screen_shot_2019-07-26_at_6_27_52_pm

Below, we see the cfn_network CodeBuild project first building a Python-based Docker container, within which to perform the build. Each build is executed in a fresh, separate Docker container, something that can trip you up if you are expecting a previous cache of Ansible Facts or previously defined environment variables, persisted across multiple builds.

screen_shot_2019-07-26_at_6_15_12_pm

Below, we see the two completed CloudFormation Stacks, a result of our CodeBuild projects and Ansible.

screen_shot_2019-07-26_at_6_44_43_pm

The fifth and final CodeBuild build tests our platform by attempting to hit the Apache HTTP Server’s default home page, using the Application Load Balancer’s public DNS name.

screen_shot_2019-07-26_at_6_32_09_pm

Below, we see an example of what happens when a build fails. In this case, one of the final integration tests failed to return the expected results from the ALB endpoint.

screen_shot_2019-07-26_at_6_40_37_pm

Below, with the bug is fixed, we rerun the build, which re-executed the tests, successfully.

screen_shot_2019-07-26_at_6_38_21_pm

We can manually confirm the platform is working by hitting the same public DNS name of the ALB as our tests in our browser. The request should load-balance our request to one of the two running web server’s default home page. Normally, at this point, you would deploy your application to Apache, using a software continuous deployment tool, such as Jenkins, CodeDeploy, Travis CI, TeamCity, or Bamboo.

screen_shot_2019-07-26_at_6_39_26_pm

Cleaning Up

To clean up the running AWS resources from the demonstration, first delete the CloudFormation compute stack, then delete the network stack. To do so, execute the following commands, one at a time. The commands call the same playbooks we called to create the stacks, except this time, we use the delete tag, as opposed to the create tag.

# first delete cfn compute stack
ansible-playbook \ 
  -i inventories/aws_ec2.yml \ 
  playbooks/20_cfn_compute.yml -t delete -v

# then delete cfn network stack
ansible-playbook \ 
  -i inventories/aws_ec2.yml \ 
  playbooks/10_cfn_network.yml -t delete -v

You should observe the following output, indicating both CloudFormation stacks have been deleted.

screen_shot_2019-07-26_at_7_12_38_pm

Confirm the stacks were deleted from the CloudFormation management console or from the AWS CLI.

 

All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

, , , , , , , , , , , , ,

Leave a comment

Azure Kubernetes Service (AKS) Observability with Istio Service Mesh

In the last two-part post, Kubernetes-based Microservice Observability with Istio Service Mesh, we deployed Istio, along with its observability tools, Prometheus, Grafana, Jaeger, and Kiali, to Google Kubernetes Engine (GKE). Following that post, I received several questions about using Istio’s observability tools with other popular managed Kubernetes platforms, primarily Azure Kubernetes Service (AKS). In most cases, including with AKS, both Istio and the observability tools are compatible.

In this short follow-up of the last post, we will replace the GKE-specific cluster setup commands, found in part one of the last post, with new commands to provision a similar AKS cluster on Azure. The new AKS cluster will run Istio 1.1.3, released 4/15/2019, alongside the latest available version of AKS (Kubernetes), 1.12.6. We will replace Google’s Stackdriver logging with Azure Monitor logs. We will retain the external MongoDB Atlas cluster and the external CloudAMQP cluster dependencies.

Previous articles about AKS include First Impressions of AKS, Azure’s New Managed Kubernetes Container Service (November 2017) and Architecting Cloud-Optimized Apps with AKS (Azure’s Managed Kubernetes), Azure Service Bus, and Cosmos DB (December 2017).

Source Code

All source code for this post is available on GitHub in two projects. The Go-based microservices source code, all Kubernetes resources, and all deployment scripts are located in the k8s-istio-observe-backend project repository.

git clone \
  --branch master --single-branch \
  --depth 1 --no-tags \
  https://github.com/garystafford/k8s-istio-observe-backend.git

The Angular UI TypeScript-based source code is located in the k8s-istio-observe-frontend repository. You will not need to clone the Angular UI project for this post’s demonstration.

Setup

This post assumes you have a Microsoft Azure account with the necessary resource providers registered, and the Azure Command-Line Interface (CLI), az, installed and available to your command shell. You will also need Helm and Istio 1.1.3 installed and configured, which is covered in the last post.

screen_shot_2019-03-27_at_1_35_46_pm

Start by logging into Azure from your command shell.

az login \
  --username {{ your_username_here }} \
  --password {{ your_password_here }}

Resource Providers

If you are new to Azure or AKS, you may need to register some additional resource providers to complete this demonstration.

az provider list --output table

screen_shot_2019-03-27_at_5_37_46_pm

If you are missing required resource providers, you will see errors similar to the one shown below. Simply activate the particular provider corresponding to the error.

Operation failed with status:'Bad Request'. 
Details: Required resource provider registrations 
Microsoft.Compute, Microsoft.Network are missing.

To register the necessary providers, use the Azure CLI or the Azure Portal UI.

az provider register --namespace Microsoft.ContainerService
az provider register --namespace Microsoft.Network
az provider register --namespace Microsoft.Compute

Resource Group

AKS requires an Azure Resource Group. According to Azure, a resource group is a container that holds related resources for an Azure solution. The resource group includes those resources that you want to manage as a group. I chose to create a new resource group associated with my closest geographic Azure Region, East US, using the Azure CLI.

az group create \
  --resource-group aks-observability-demo \
  --location eastus

screen_shot_2019-03-26_at_6_54_39_pm

Create the AKS Cluster

Before creating the GKE cluster, check for the latest versions of AKS. At the time of this post, the latest versions of AKS was 1.12.6.

az aks get-versions \
  --location eastus \
  --output table

screen_shot_2019-03-26_at_6_56_38_pm

Using the latest GKE version, create the GKE managed cluster. There are many configuration options available with the az aks create command. For this post, I am creating three worker nodes using the Azure Standard_DS3_v2 VM type, which will give us a total of 12 vCPUs and 42 GB of memory. Anything smaller and all the Pods may not be schedulable. Instead of supplying an existing SSH key, I will let Azure create a new one. You should have no need to SSH into the worker nodes. I am also enabling the monitoring add-on. According to Azure, the add-on sets up Azure Monitor for containers, announced in December 2018, which monitors the performance of workloads deployed to Kubernetes environments hosted on AKS.

time az aks create \
  --name aks-observability-demo \
  --resource-group aks-observability-demo \
  --node-count 3 \
  --node-vm-size Standard_DS3_v2 \
  --enable-addons monitoring \
  --generate-ssh-keys \
  --kubernetes-version 1.12.6

Using the time command, we observe that the cluster took approximately 5m48s to provision; I have seen times up to almost 10 minutes. AKS provisioning is not as fast as GKE, which in my experience is at least 2x-3x faster than AKS for a similarly sized cluster.

screen_shot_2019-03-26_at_7_03_49_pm

After the cluster creation completes, retrieve your AKS cluster credentials.

az aks get-credentials \
  --name aks-observability-demo \
  --resource-group aks-observability-demo \
  --overwrite-existing

Examine the Cluster

Use the following command to confirm the cluster is ready by examining the status of three worker nodes.

kubectl get nodes --output=wide

screen_shot_2019-03-27_at_6_06_10_pm.png

Observe that Azure currently uses Ubuntu 16.04.5 LTS for the worker node’s host operating system. If you recall, GKE offers both Ubuntu as well as a Container-Optimized OS from Google.

Kubernetes Dashboard

Unlike GKE, there is no custom AKS dashboard. Therefore, we will use the Kubernetes Web UI (dashboard), which is installed by default with AKS, unlike GKE. According to Azure, to make full use of the dashboard, since the AKS cluster uses RBAC, a ClusterRoleBinding must be created before you can correctly access the dashboard.

kubectl create clusterrolebinding kubernetes-dashboard \
  --clusterrole=cluster-admin \
  --serviceaccount=kube-system:kubernetes-dashboard

Next, we must create a proxy tunnel on local port 8001 to the dashboard running on the AKS cluster. This CLI command creates a proxy between your local system and the Kubernetes API and opens your web browser to the Kubernetes dashboard.

az aks browse \
  --name aks-observability-demo \
  --resource-group aks-observability-demo

screen_shot_2019-03-26_at_7_08_54_pm

Although you should use the Azure CLI, PowerShell, or SDK for all your AKS configuration tasks, using the dashboard for monitoring the cluster and the resources running on it, is invaluable.

screen_shot_2019-03-26_at_7_06_57_pm

The Kubernetes dashboard also provides access to raw container logs. Azure Monitor provides the ability to construct complex log queries, but for quick troubleshooting, you may just want to see the raw logs a specific container is outputting, from the dashboard.

screen_shot_2019-03-29_at_9_23_57_pm

Azure Portal

Logging into the Azure Portal, we can observe the AKS cluster, within the new Resource Group.

screen_shot_2019-03-26_at_7_08_25_pm

In addition to the Azure Resource Group we created, there will be a second Resource Group created automatically during the creation of the AKS cluster. This group contains all the resources that compose the AKS cluster. These resources include the three worker node VM instances, and their corresponding storage disks and NICs. The group also includes a network security group, route table, virtual network, and an availability set.

screen_shot_2019-03-26_at_7_08_04_pm

Deploy Istio

From this point on, the process to deploy Istio Service Mesh and the Go-based microservices platform follows the previous post and use the exact same scripts. After modifying the Kubernetes resource files, to deploy Istio, use the bash script, part4_install_istio.sh. I have added a few more pauses in the script to account for the apparently slower response times from AKS as opposed to GKE. It definitely takes longer to spin up the Istio resources on AKS than on GKE, which can result in errors if you do not pause between each stage of the deployment process.

screen_shot_2019-03-26_at_7_11_44_pm

screen_shot_2019-03-26_at_7_18_26_pm

Using the Kubernetes dashboard, we can view the Istio resources running in the istio-system Namespace, as shown below. Confirm that all resource Pods are running and healthy before deploying the Go-based microservices platform.

screen_shot_2019-03-26_at_7_16_50_pm

Deploy the Platform

Deploy the Go-based microservices platform, using bash deploy script, part5a_deploy_resources.sh.

screen_shot_2019-03-26_at_7_20_05_pm

The script deploys two replicas (Pods) of each of the eight microservices, Service-A through Service-H, and the Angular UI, to the dev and test Namespaces, for a total of 36 Pods. Each Pod will have the Istio sidecar proxy (Envoy Proxy) injected into it, alongside the microservice or UI.

screen_shot_2019-03-26_at_7_21_24_pm

Azure Load Balancer

If we return to the Resource Group created automatically when the AKS cluster was created, we will now see two additional resources. There is now an Azure Load Balancer and Public IP Address.

screen_shot_2019-03-26_at_7_21_56_pm

Similar to the GKE cluster in the last post, when the Istio Ingress Gateway is deployed as part of the platform, it is materialized as an Azure Load Balancer. The front-end of the load balancer is the new public IP address. The back-end of the load-balancer is a pool containing the three AKS worker node VMs. The load balancer is associated with a set of rules and health probes.

screen_shot_2019-03-26_at_7_22_51_pm

DNS

I have associated the new Azure public IP address, connected with the front-end of the load balancer, with the four subdomains I am using to represent the UI and the edge service, Service-A, in both Namespaces. If Azure is your primary Cloud provider, then Azure DNS is a good choice to manage your domain’s DNS records. For this demo, you will require your own domain.

screen_shot_2019-03-28_at_9_43_42_pm

Testing the Platform

With everything deployed, test the platform is responding and generate HTTP traffic for the observability tools to record. Similar to last time, I have chosen hey, a modern load generator and benchmarking tool, and a worthy replacement for Apache Bench (ab). Unlike ab, hey supports HTTP/2. Below, I am running hey directly from Azure Cloud Shell. The tool is simulating 10 concurrent users, generating a total of 500 HTTP GET requests to Service A.

# quick setup from Azure Shell using Bash
go get -u github.com/rakyll/hey
cd go/src/github.com/rakyll/hey/
go build
  
./hey -n 500 -c 10 -h2 http://api.dev.example-api.com/api/ping

We had 100% success with all 500 calls resulting in an HTTP 200 OK success status response code. Based on the results, we can observe the platform was capable of approximately 4 requests/second, with an average response time of 2.48 seconds and a mean time of 2.80 seconds. Almost all of that time was the result of waiting for the response, as the details indicate.

screen_shot_2019-03-26_at_7_57_03_pm

Logging

In this post, we have replaced GCP’s Stackdriver logging with Azure Monitor logs. According to Microsoft, Azure Monitor maximizes the availability and performance of applications by delivering a comprehensive solution for collecting, analyzing, and acting on telemetry from Cloud and on-premises environments. In my opinion, Stackdriver is a superior solution for searching and correlating the logs of distributed applications running on Kubernetes. I find the interface and query language of Stackdriver easier and more intuitive than Azure Monitor, which although powerful, requires substantial query knowledge to obtain meaningful results. For example, here is a query to view the log entries from the services in the dev Namespace, within the last day.

let startTimestamp = ago(1d);
KubePodInventory
| where TimeGenerated > startTimestamp
| where ClusterName =~ "aks-observability-demo"
| where Namespace == "dev"
| where Name contains "service-"
| distinct ContainerID
| join
(
    ContainerLog
    | where TimeGenerated > startTimestamp
)
on ContainerID
| project LogEntrySource, LogEntry, TimeGenerated, Name
| order by TimeGenerated desc
| render table

Below, we see the Logs interface with the search query and log entry results.

screen_shot_2019-03-29_at_9_13_37_pm

Below, we see a detailed view of a single log entry from Service A.

screen_shot_2019-03-29_at_9_18_12_pm

Observability Tools

The previous post goes into greater detail on the features of each of the observability tools provided by Istio, including Prometheus, Grafana, Jaeger, and Kiali.

We can use the exact same kubectl port-forward commands to connect to the tools on AKS as we did on GKE. According to Google, Kubernetes port forwarding allows using a resource name, such as a service name, to select a matching pod to port forward to since Kubernetes v1.10. We forward a local port to a port on the tool’s pod.

# Grafana
kubectl port-forward -n istio-system \
  $(kubectl get pod -n istio-system -l app=grafana \
  -o jsonpath='{.items[0].metadata.name}') 3000:3000 &
  
# Prometheus
kubectl -n istio-system port-forward \
  $(kubectl -n istio-system get pod -l app=prometheus \
  -o jsonpath='{.items[0].metadata.name}') 9090:9090 &
  
# Jaeger
kubectl port-forward -n istio-system \
$(kubectl get pod -n istio-system -l app=jaeger \
-o jsonpath='{.items[0].metadata.name}') 16686:16686 &
  
# Kiali
kubectl -n istio-system port-forward \
  $(kubectl -n istio-system get pod -l app=kiali \
  -o jsonpath='{.items[0].metadata.name}') 20001:20001 &

screen_shot_2019-03-26_at_8_04_24_pm

Prometheus and Grafana

Prometheus is a completely open source and community-driven systems monitoring and alerting toolkit originally built at SoundCloud, circa 2012. Interestingly, Prometheus joined the Cloud Native Computing Foundation (CNCF) in 2016 as the second hosted-project, after Kubernetes.

Grafana describes itself as the leading open source software for time series analytics. According to Grafana Labs, Grafana allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. You can easily create, explore, and share visually-rich, data-driven dashboards. Grafana also users to visually define alert rules for your most important metrics. Grafana will continuously evaluate rules and can send notifications.

According to Istio, the Grafana add-on is a pre-configured instance of Grafana. The Grafana Docker base image has been modified to start with both a Prometheus data source and the Istio Dashboard installed. Below, we see one of the pre-configured dashboards, the Istio Service Dashboard.

screen_shot_2019-03-26_at_8_16_52_pm

Jaeger

According to their website, Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing system released as open source by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems, including distributed context propagation, distributed transaction monitoring, root cause analysis, service dependency analysis, and performance and latency optimization. The Jaeger website contains a good overview of Jaeger’s architecture and general tracing-related terminology.

screen_shot_2019-03-26_at_8_03_31_pm

Below, we see a typical, distributed trace of the services, starting ingress gateway and passing across the upstream service dependencies.

screen_shot_2019-03-26_at_8_03_45_pm

Kaili

According to their website, Kiali provides answers to the questions: What are the microservices in my Istio service mesh, and how are they connected? Kiali works with Istio, in OpenShift or Kubernetes, to visualize the service mesh topology, to provide visibility into features like circuit breakers, request rates and more. It offers insights about the mesh components at different levels, from abstract Applications to Services and Workloads.

There is a common Kubernetes Secret that controls access to the Kiali API and UI. The default login is admin, the password is 1f2d1e2e67df.

screen_shot_2019-03-26_at_7_59_17_pm

Below, we see a detailed view of our platform, running in the dev namespace, on AKS.

screen_shot_2019-03-26_at_8_02_38_pm

Delete AKS Cluster

Once you are finished with this demo, use the following two commands to tear down the AKS cluster and remove the cluster context from your local configuration.

time az aks delete \
  --name aks-observability-demo \
  --resource-group aks-observability-demo \
  --yes

kubectl config delete-context aks-observability-demo

Conclusion

In this brief, follow-up post, we have explored how the current set of observability tools, part of the latest version of Istio Service Mesh, integrates with Azure Kubernetes Service (AKS).

All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

, , , , , , , , , , , , , ,

Leave a comment

Big Data Analytics with Java and Python, using Cloud Dataproc, Google’s Fully-Managed Spark and Hadoop Service

There is little question, big data analytics, data science, artificial intelligence (AI), and machine learning (ML), a subcategory of AI, have all experienced a tremendous surge in popularity over the last few years. Behind the hype curves and marketing buzz, these technologies are having a significant influence on all aspects of our modern lives.

However, installing, configuring, and managing the technologies that support big data analytics, data science, ML, and AI, at scale and in Production, often demands an advanced level of familiarity with Linux, distributed systems, cloud- and container-based platforms, databases, and data-streaming applications. The mere ability to manage terabytes and petabytes of transient data is beyond the capability of many enterprises, let alone performing analysis of that data.

To ease the burden of implementing these technologies, the three major cloud providers, AWS, Azure, and Google Cloud, all have multiple Big Data Analytics-, AI-, and ML-as-a-Service offerings. In this post, we will explore one such cloud-based service offering in the field of big data analytics, Google Cloud Dataproc. We will focus on Cloud Dataproc’s ability to quickly and efficiently run Spark jobs written in Java and Python, two widely adopted enterprise programming languages.

Featured Technologies

The following technologies are featured prominently in this post.

dataproc

Google Cloud Dataproc

dataproc_logoAccording to Google, Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running the Apache Spark and Apache Hadoop ecosystem on Google Cloud Platform. Dataproc is a complete platform for data processing, analytics, and machine learning. Dataproc offers per-second billing, so you only pay for exactly the resources you consume. Dataproc offers frequently updated and native versions of Apache Spark, Hadoop, Pig, and Hive, as well as other related applications. Dataproc has built-in integrations with other Google Cloud Platform (GCP) services, such as Cloud Storage, BigQuery, Bigtable, Stackdriver Logging, and Stackdriver Monitoring. Dataproc’s clusters are configurable and resizable from a three to hundreds of nodes, and each cluster action takes less than 90 seconds on average.

Similar Platform as a Service (PaaS) offerings to Dataproc, include Amazon Elastic MapReduce (EMR), Microsoft Azure HDInsight, and Qubole Data Service. Qubole is offered on AWS, Azure, and Oracle Cloud Infrastructure (Oracle OCI).

According to Google, Cloud Dataproc and Cloud Dataflow, both part of GCP’s Data Analytics/Big Data Product offerings, can both be used for data processing, and there’s overlap in their batch and streaming capabilities. Cloud Dataflow is a fully-managed service for transforming and enriching data in stream and batch modes. Dataflow uses the Apache Beam SDK to provide developers with Java and Python APIs, similar to Spark.

Apache Spark

spark_logoAccording to Apache, Spark is a unified analytics engine for large-scale data processing, used by well-known, modern enterprises, such as Netflix, Yahoo, and eBay. With in-memory speeds up to 100x faster than Hadoop, Apache Spark achieves high performance for static, batch, and streaming data, using a state-of-the-art DAG (Directed Acyclic Graph) scheduler, a query optimizer, and a physical execution engine.

According to a post by DataFlair, ‘the DAG in Apache Spark is a set of Vertices and Edges, where vertices represent the RDDs and the edges represent the Operation to be applied on RDD (Resilient Distributed Dataset). In Spark DAG, every edge directs from earlier to later in the sequence. On the calling of Action, the created DAG submits to DAG Scheduler which further splits the graph into the stages of the task.’ Below, we see a three-stage DAG visualization, displayed using the Spark History Server Web UI, from a job demonstrated in this post.

Screen Shot 2018-12-15 at 11.20.57 PM

Spark’s polyglot programming model allows users to write applications in Scala, Java, Python, R, and SQL. Spark includes libraries for Spark SQL (DataFrames and Datasets), MLlib (Machine Learning), GraphX (Graph Processing), and DStreams (Spark Streaming). Spark may be run using its standalone cluster mode or on Apache Hadoop YARNMesos, and Kubernetes.

PySpark

pyspark_logoThe Spark Python API, PySpark, exposes the Spark programming model to Python. PySpark is built on top of Spark’s Java API. Data is processed in Python and cached and shuffled in the JVM. According to Apache, Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a JVM.

Apache Hadoop

hadoop_logo1According to Apache, the Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. This is a rather modest description of such a significant and transformative project. When we talk about Hadoop, often it is in the context of the project’s well-known modules, which includes:

  • Hadoop Common: The common utilities that support the other Hadoop modules
  • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data
  • Hadoop YARN (Yet Another Resource Negotiator): A framework for job scheduling and cluster resource management, also known as ‘Hadoop NextGen’
  • Hadoop MapReduce: A YARN-based system for parallel processing of large datasets
  • Hadoop Ozone: An object store for Hadoop

Based on the Powered by Apache Hadoop list, there are many well-known enterprises and academic institutions using Apache Hadoop, including Adobe, eBay, Facebook, Hulu, LinkedIn, and The New York Times.

Spark vs. Hadoop

There are many articles and posts that delve into the Spark versus Hadoop debate, this post is not one of them. Although both are mature technologies, Spark, the new kid on the block, reached version 1.0.0 in May 2014, whereas Hadoop reached version 1.0.0, earlier, in December 2011. According to Google Trends, interest in both technologies has remained relatively high over the last three years. However, interest in Spark, based on the volume of searches, has been steadily outpacing Hadoop for well over a year now. The in-memory speed of Spark over HDFS-based Hadoop and ease of Spark SQL for working with structured data are likely big differentiators for many users coming from a traditional relational database background and users with large or streaming datasets, requiring near real-time processing.

spark-to-hadoop

In this post, all examples are built to run on Spark. This is not meant to suggest Spark is necessarily superior or that Spark runs better on Dataproc than Hadoop. In fact, Dataproc’s implementation of Spark relies on Hadoop’s core HDFS and YARN technologies to run.

Demonstration

To show the capabilities of Cloud Dataproc, we will create both a single-node Dataproc cluster and three-node cluster, upload Java- and Python-based analytics jobs and data to Google Cloud Storage, and execute the jobs on the Spark cluster. Finally, we will enable monitoring and notifications for the Dataproc clusters and the jobs running on the clusters with Stackdriver. The post will demonstrate the use of the Google Cloud Console, as well as Google’s Cloud SDK’s command line tools, for all tasks.

In this post, we will be uploading and running individual jobs on the Dataproc Spark cluster, as opposed to using the Cloud Dataproc Workflow Templates. According to Google, Workflow Template is a reusable workflow configuration. It defines a graph of jobs with information on where to run those jobs. Workflow Templates are useful for automating your Datapoc workflows, however, automation is not the primary topic of this post.

Source Code

All open-sourced code for this post can be found on GitHub in two repositories, one for Java with Spark and one for Python with PySpark. Source code samples are displayed as GitHub Gists, which may not display correctly on all mobile and social media browsers.

Cost

Of course, there is a cost associated with provisioning cloud services. However, if you manage the Google Cloud Dataproc resources prudently, the costs are negligible. Regarding pricing, according to Google, Cloud Dataproc pricing is based on the size of Cloud Dataproc clusters and the duration of time that they run. The size of a cluster is based on the aggregate number of virtual CPUs (vCPUs) across the entire cluster, including the master and worker nodes. The duration of a cluster is the length of time, measured in minutes, between cluster creation and cluster deletion.

Over the course of writing the code for this post, as well as writing the post itself, the entire cost of all the related resources was a minuscule US$7.50. The cost includes creating, running, and deleting more than a dozen Dataproc clusters and uploading and executing approximately 75-100 Spark and PySpark jobs. Given the quick creation time of a cluster, 2 minutes on average or less in this demonstration, there is no reason to leave a cluster running longer than it takes to complete your workloads.

Kaggle Datasets

To explore the features of Dataproc, we will use a publicly-available dataset from Kaggle. Kaggle is a popular open-source resource for datasets used for big-data and ML applications. Their tagline is ‘Kaggle is the place to do data science projects’.

For this demonstration, I chose the IBRD Statement Of Loans Data dataset, from World Bank Financial Open Data, and available on Kaggle. The International Bank for Reconstruction and Development (IBRD) loans are public and publicly guaranteed debt extended by the World Bank Group. IBRD loans are made to, or guaranteed by, countries that are members of IBRD. This dataset contains historical snapshots of the Statement of Loans including the latest available snapshots.

screen_shot_2018-12-04_at_7.02.53_pm

There are two data files available. The ‘Statement of Loans’ latest available snapshots data file contains 8,713 rows of loan data (~3 MB), ideal for development and testing. The ‘Statement of Loans’ historic data file contains approximately 750,000 rows of data (~265 MB). Although not exactly ‘big data’, the historic dataset is large enough to sufficiently explore Dataproc. Both IBRD files have an identical schema with 33 columns of data (gist).

In this demonstration, both the Java and Python jobs will perform the same simple analysis of the larger historic dataset. For the analysis, we will ascertain the top 25 historic IBRD borrower, we will determine their total loan disbursements, current loan obligations, and the average interest rates they were charged for all loans. This simple analysis will be performed using Spark’s SQL capabilities. The results of the analysis, a Spark DataFrame containing 25 rows, will be saved as a CSV-format data file.

SELECT country, country_code,
       Format_number(total_disbursement, 0) AS total_disbursement,
       Format_number(total_obligation, 0) AS total_obligation,
       Format_number(avg_interest_rate, 2) AS avg_interest_rate
FROM   (SELECT country,
               country_code,
               Sum(disbursed) AS total_disbursement,
               Sum(obligation) AS total_obligation,
               Avg(interest_rate) AS avg_interest_rate
        FROM   loans
        GROUP  BY country, country_code
        ORDER  BY total_disbursement DESC
        LIMIT  25)

Google Cloud Storage

First, we need a location to store our Spark jobs, data files, and results, which will be accessible to Dataproc. Although there are a number of choices, the simplest and most convenient location for Dataproc is a Google Cloud Storage bucket. According to Google, Cloud Storage offers the highest level of availability and performance within a single region and is ideal for compute, analytics, and ML workloads in a particular region. Cloud Storage buckets are nearly identical to Amazon Simple Storage Service (Amazon S3), their object storage service.

Using the Google Cloud Console, Google’s Web Admin UI, create a new, uniquely named Cloud Storage bucket. Our Dataproc clusters will eventually be created in a single regional location. We need to ensure our new bucket is created in the same regional location as the clusters; I chose us-east1.

screen_shot_2018-12-04_at_7.04.45_pm

We will need the new bucket’s link, to use within the Java and Python code as well from the command line with gsutil. The gsutil tool is a Python application that lets you access Cloud Storage from the command line. The bucket’s link may be found on the Storage Browser Console’s Overview tab. A bucket’s link is always in the format, gs://bucket-name.

screen_shot_2018-12-04_at_7.06.06_pm

Alternatively, we may also create the Cloud Storage bucket using gsutil with the make buckets (mb) command, as follows:

# Always best practice since features are updated frequently
gcloud components update
  
export PROJECT=your_project_name
export REGION=us-east1
export BUCKET_NAME=gs://your_bucket_name
  
# Make sure you are creating resources in the correct project
gcloud config set project $PROJECT
  
gsutil mb -p $PROJECT -c regional -l $REGION $BUCKET_NAME

Cloud Dataproc Cluster

Next, we will create two different Cloud Dataproc clusters for demonstration purposes. If you have not used Cloud Dataproc previously in your GCP Project, you will first need to enable the API for Cloud Dataproc.

screen_shot_2018-12-04_at_7.15.05_pm

Single Node Cluster

We will start with a single node cluster with no worker nodes, suitable for development and testing Spark and Hadoop jobs, using small datasets. Create a single-node Dataproc cluster using the Single Node Cluster mode option. Create the cluster in the same region as the new Cloud Storage bucket. This will allow the Dataproc cluster access to the bucket without additional security or IAM configuration. I used the n1-standard-1 machine type, with 1 vCPU and 3.75 GB of memory. Observe the resources assigned to Hadoop YARN for Spark job scheduling and cluster resource management.

screen_shot_2018-12-04_at_7.19.37_pm

The new cluster, consisting of a single node and no worker nodes, should be ready for use in a few minutes or less.

screen_shot_2018-12-04_at_7.38.23_pm

Note the Image version, 1.3.16-deb9. According to Google, Dataproc uses image versions to bundle operating system, big data components, and Google Cloud Platform connectors into one package that is deployed on a cluster.  This image, released in November 2018, is the latest available version at the time of this post. The image contains:

  • Apache Spark 2.3.1
  • Apache Hadoop 2.9.0
  • Apache Pig 0.17.0
  • Apache Hive 2.3.2
  • Apache Tez 0.9.0
  • Cloud Storage connector 1.9.9-hadoop2
  • Scala 2.11.8
  • Python 2.7

To avoid lots of troubleshooting, make sure your code is compatible with the image’s versions. It is important to note the image does not contain a version of Python 3. You will need to ensure your Python code is built to run with Python 2.7. Alternatively, use Dataproc’s --initialization-actions flag along with bootstrap and setup shell scripts to install Python 3 on the cluster using pip or conda. Tips for installing Python 3 on Datapoc be found on Stack Overflow and elsewhere on the Internet.

As as an alternative to the Google Cloud Console, we are able to create the cluster using a REST command. Google provides the Google Cloud Console’s equivalent REST request, as shown in the example below.

screen_shot_2018-12-04_at_7.20.07_pm

Additionally, we have the option of using the gcloud command line tool. This tool provides the primary command-line interface to Google Cloud Platform and is part of Google’s Cloud SDK, which also includes the aforementioned gsutil. Here again, Google provides the Google Cloud Console’s equivalent gcloud command. This is a great way to learn to use the command line.

screen_shot_2018-12-04_at_7.20.21_pm

Using the dataproc clusters create command, we are able to create the same cluster as shown above from the command line, as follows:

export PROJECT=your_project_name
export CLUSTER_1=your_single_node_cluster_name 
export REGION=us-east1
export ZONE=us-east1-b
export MACHINE_TYPE_SMALL=n1-standard-1
  
gcloud dataproc clusters create $CLUSTER_1 \
  --region $REGION \
  --zone $ZONE \
  --single-node \
  --master-machine-type $MACHINE_TYPE_SMALL \
  --master-boot-disk-size 500 \
  --image-version 1.3-deb9 \
  --project $PROJECT

There are a few useful commands to inspect your running Dataproc clusters. The dataproc clusters describe command, in particular, provides detailed information about all aspects of the cluster’s configuration and current state.

gcloud dataproc clusters list --region $REGION

gcloud dataproc clusters describe $CLUSTER_2 \
  --region $REGION --format json

Standard Cluster

In addition to the single node cluster, we will create a second three-node Dataproc cluster. We will compare the speed of a single-node cluster to that of a true cluster with multiple worker nodes. Create a new Dataproc cluster using the Standard Cluster mode option. Again, make sure to create the cluster in the same region as the new Storage bucket.

screen_shot_2018-12-04_at_10.15.05_pm

The second cluster contains a single master node and two worker nodes. All three nodes use the n1-standard-4 machine type, with 4 vCPU and 15 GB of memory. Although still considered a minimally-sized cluster, this cluster represents a significant increase in compute power over the first single-node cluster, which had a total of 2 vCPU, 3.75 GB of memory, and no worker nodes on which to distribute processing. Between the two workers in the second cluster, we have 8 vCPU and 30 GB of memory for computation.

screen_shot_2018-12-04_at_10.18.54_pm

Again, we have the option of using the gcloud command line tool to create the cluster:

export PROJECT=your_project_name
export CLUSTER_2=your_three_node_cluster_name 
export REGION=us-east1
export ZONE=us-east1-b
export NUM_WORKERS=2
export MACHINE_TYPE_LARGE=n1-standard-4
  
gcloud dataproc clusters create $CLUSTER_2 \
  --region $REGION \
  --zone $ZONE \
  --master-machine-type $MACHINE_TYPE_LARGE \
  --master-boot-disk-size 500 \
  --num-workers $NUM_WORKERS \
  --worker-machine-type $MACHINE_TYPE_LARGE \
  --worker-boot-disk-size 500 \
  --image-version 1.3-deb9 \
  --project $PROJECT

Cluster Creation Speed: Cloud Dataproc versus Amazon EMS?

In a series of rather unscientific tests, I found the three-node Dataproc cluster took less than two minutes on average to be created. Compare that time to a similar three-node cluster built with Amazon’s EMR service using their general purpose m4.4xlarge Amazon EC2 instance type. In a similar series of tests, I found the EMR cluster took seven minutes on average to be created. The EMR cluster took 3.5 times longer to create than the comparable Dataproc cluster. Again, although not a totally accurate comparison, since both services offer different features, it gives you a sense of the speed of Dataproc as compared to Amazon EMR.

Staging Buckets

According to Google, when you create a cluster, Cloud Dataproc creates a Cloud Storage staging bucket in your project or reuses an existing Cloud Dataproc-created bucket from a previous cluster creation request. Staging buckets are used to stage miscellaneous configuration and control files that are needed by your cluster. Below, we see the staging buckets created for the two Dataproc clusters.

screen_shot_2018-12-04_at_10.26.49_pm

Project Files

Before uploading the jobs and running them on the Cloud Dataproc clusters, we need to understand what is included in the two GitHub projects. If you recall from the Kaggle section of the post, both projects are basically the same but, written in different languages, Java and Python. The jobs they contain all perform the same basic analysis on the dataset.

Java Project

The dataproc-java-demo Java-based GitHub project contains three classes, each which are jobs to run by Spark. The InternationalLoansApp Java class is only intended to be run locally with the smaller 8.7K rows of data in the snapshot CSV file (gist).

On line 20, the Spark Session’s Master URL, .master("local[*]"), directs Spark to run locally with as many worker threads as logical cores on the machine. There are several options for setting the Master URL, detailed here.

On line 30, the path to the data file, and on line 84, the output path for the data file, is a local relative file path.

On lines 38–42, we do a bit of clean up on the column names, for only those columns we are interested in for the analysis. Be warned, the column names of the IBRD data are less than ideal for SQL-based analysis, containing mixed-cased characters, word spaces, and brackets.

On line 79, we call Spark DataFrame’s repartition method, dfDisbursement.repartition(1). The repartition method allows us to recombine the results of our analysis and output a single CSV file to the bucket. Ordinarily, Spark splits the data into partitions and executes computations on the partitions in parallel. Each partition’s data is written to separate CSV files when a DataFrame is written back to the bucket.

Using coalesce(1) or repartition(1) to recombine the resulting 25-Row DataFrame on a single node is okay for the sake of this demonstration, but is not practical for recombining partitions from larger DataFrames. There are more efficient and less costly ways to manage the results of computations, depending on the intended use of the resulting data.

screen_shot_2018-12-05_at_4.04.24_pm

The InternationalLoansAppDataprocSmall class is intended to be run on the Dataproc clusters, analyzing the same smaller CSV data file. The InternationalLoansAppDataprocLarge class is also intended to be run on the Dataproc clusters, however, it analyzes the larger 750K rows of data in the IRBD historic CSV file (gist).

On line 20, note the Spark Session’s Master URL, .master(yarn), directs Spark to connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode when submitting the job. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. Recall, the Dataproc cluster runs Spark on YARN.

Also, note on line 30, the path to the data file, and on line 63, the output path for the data file, is to the Cloud Storage bucket we created earlier (.load("gs://your-bucket-name/your-data-file.csv"). Cloud Dataproc clusters automatically install the Cloud Storage connector. According to Google, there are a number of benefits to choosing Cloud Storage over traditional HDFS including data persistence, reliability, and performance.

These are the only two differences between the local version of the Spark job and the version of the Spark job intended for Dataproc. To build the project’s JAR file, which you will later upload to the Cloud Storage bucket, compile the Java project using the gradle build command from the root of the project. For convenience, the JAR file is also included in the GitHub repository.

screen_shot_2018-12-07_at_12.57.55_pm

Python Project

The dataproc-python-demo Python-based GitHub project contains two Python scripts to be run using PySpark for this post. The international_loans_local.py Python script is only intended to be run locally with the smaller 8.7K rows of data in the snapshot CSV file. It does a few different analysis with the smaller dataset. (gist).

Identical to the corresponding Java class, note on line 12, the Spark Session’s Master URL, .master("local[*]"), directs Spark to run locally with as many worker threads as logical cores on the machine.

Also identical to the corresponding Java class, note on line 26, the path to the data file, and on line 66, the output path for the resulting data file, is a local relative file path.

screen_shot_2018-12-05_at_4.02.50_pm

The international_loans_dataproc-large.py Python script is intended to be run on the Dataproc clusters, analyzing the larger 750K rows of data in the IRBD historic CSV file (gist).

On line 12, note the Spark Session’s Master URL, .master(yarn), directs Spark to connect to a YARN cluster.

Again, note on line 26, the path to the data file, and on line 59, the output path for the data file, is to the Cloud Storage bucket we created earlier (.load("gs://your-bucket-name/your-data-file.csv").

These are the only two differences between the local version of the PySpark job and the version of the PySpark job intended for Dataproc. With Python, there is no pre-compilation necessary. We will upload the second script, directly.

Uploading Job Resources to Cloud Storage

In total, we need to upload four items to the new Cloud Storage bucket we created previously. The items include the two Kaggle IBRD CSV files, the compiled Java JAR file from the dataproc-java-demo project, and the Python script from the dataproc-python-demo project. Using the Google Cloud Console, upload the four files to the new Google Storage bucket, as shown below. Make sure you unzip the two Kaggle IRBD CSV data files before uploading.

screen_shot_2018-12-05_at_12.52.51_pm

Like before, we also have the option of using gsutil with the copy (cp) command to upload the four files. The cp command accepts wildcards, as shown below.

export PROJECT=your_project_name
export BUCKET_NAME=gs://your_bucket_name
  
gsutil cp data/ibrd-statement-of-loans-*.csv $BUCKET_NAME
gsutil cp build/libs/dataprocJavaDemo-1.0-SNAPSHOT.jar $BUCKET_NAME
gsutil cp international_loans_dataproc_large.py $BUCKET_NAME

If our Java or Python jobs were larger, or more complex and required multiple files to run, we could also choose to upload ZIP or other common compression formatted archives using the --archives flag.

Running Jobs on Dataproc

The easiest way to run a job on the Dataproc cluster is by submitting a job through the Dataproc Jobs UI, part of the Google Cloud Console.

screen_shot_2018-12-05_at_11.29.34_pm

Dataproc has the capability of running multiple types of jobs, including:

  • Hadoop
  • Spark
  • SparkR
  • PySpark
  • Hive
  • SparkSql
  • Pig

We will be running both Spark and PySpark jobs as part of this demonstration.

Spark Jobs

To run a Spark job using the JAR file, select Job type Spark. The Region will match your Dataproc cluster and bucket locations, us-east-1 in my case. You should have a choice of both clusters in your chosen region. Run both jobs at least twice, once on both clusters, for a total of four jobs.

screen_shot_2018-12-05_at_12.57.55_pm

Lastly, you will need to input the main class and the path to the JAR file. The JAR location will be:

gs://your_bucket_name/dataprocJavaDemo-1.0-SNAPSHOT.jar

The main class for the smaller dataset will be:

org.example.dataproc.InternationalLoansAppDataprocSmall

The main class for the larger dataset will be:

org.example.dataproc.InternationalLoansAppDataprocLarge

During or after job execution, you may view details in the Output tab of the Dataproc Jobs console.

screen_shot_2018-12-04_at_7.53.27_pm

Like every other step in this demonstration, we can also use the gcloud command line tool, instead of the web console, to submit our Spark jobs to each cluster. Here, I am submitting the larger dataset Spark job to the three-node cluster.

export CLUSTER_2=your_three_node_cluster_name
export REGION=us-east1
export BUCKET_NAME=gs://your_bucket_name
  
gcloud dataproc jobs submit spark \
  --region $REGION \
  --cluster $CLUSTER_2 \
  --class org.example.dataproc.InternationalLoansAppDataprocLarge \
  --jars $BUCKET_NAME/dataprocJavaDemo-1.0-SNAPSHOT.jar \
  --async

PySpark Jobs

To run a Spark job using the Python script, select Job type PySpark. The Region will match your Dataproc cluster and bucket locations, us-east-1 in my case. You should have a choice of both clusters. Run the job at least twice, once on both clusters.

screen_shot_2018-12-05_at_12.53.36_pm

Lastly, you will need to input the main Python file path. There is only one Dataproc Python script, which analyzes the larger dataset. The script location will be:

gs://your_bucket_name/international_loans_dataproc_large.py

Like every other step in this demonstration, we can also use the gcloud command line tool instead of the web console to submit our PySpark jobs to each cluster. Below, I am submitting the PySpark job to the three-node cluster.

export CLUSTER_2=your_three_node_cluster_name
export REGION=us-east1
export BUCKET_NAME=gs://your_bucket_name
  
gcloud dataproc jobs submit pyspark \
  $BUCKET_NAME/international_loans_dataproc_large.py \
  --region $REGION \
  --cluster $CLUSTER_2 \
  --async

Including the optional --async flag with any of the dataproc jobs submit command, the job will be sent to the Dataproc cluster and immediately release the terminal back to the user. If you do not to use the --async flag, the terminal will be unavailable until the job is finished.

However, without the flag, we will get the standard output (stdout) and standard error (stderr) from Dataproc. The output includes some useful information, including different stages of the job execution lifecycle and execution times.

screen_shot_2018-12-05_at_10.38.52_pm

File Output

During development and testing, outputting results to the console is useful. However, in Production, the output from jobs is most often written to Apache Parquet, Apache Avro, CSV, JSON, or XML format files, persisted Apache Hive, SQL, or NoSQL database, or streamed to another system for post-processing, using technologies such as Apache Kafka.

Once both the Java and Python jobs have run successfully on the Dataproc cluster, you should observe the results have been saved back to the Storage bucket. Each script saves its results to a single CSV file in separate directories, as shown below.

screen_shot_2018-12-05_at_4.09.31_pm.png

The final dataset, written to the CSV file, contains the results of the analysis results (gist).

Cleaning Up

When you are finished, make sure to delete your running clusters. This may be done through the Google Cloud Console. Deletion of the three-node cluster took, on average, slightly more than one minute.

screen_shot_2018-12-04_at_11.11.40_pm

As usual, we can also use the gcloud command line tool instead of the web console to delete the Dataproc clusters.

export CLUSTER_1=your_single_node_cluster_name
export CLUSTER_2=your_three_node_cluster_name 
export REGION=us-east1
  
yes | gcloud dataproc clusters delete $CLUSTER_1 --region $REGION
yes | gcloud dataproc clusters delete $CLUSTER_2 --region $REGION

Results

Some observations, based on approximately 75 successful jobs. First, both the Python job and the Java jobs ran in nearly the same amount of time on the single-node cluster and then on the three-node cluster. This is beneficial since, although, a lot of big data analysis is performed with Python, Java is still the lingua franca of many large enterprises.

screen_shot_2018-12-05_at_1.49.01_pm

Consecutive Execution

Below are the average times for running the larger dataset on both clusters, in Java, and in Python. The jobs were all run consecutively as opposed to concurrently. The best time was 59 seconds on the three-node cluster compared to the best time of 150 seconds on the single-node cluster, a difference of 256%. Given the differences in the two clusters, this large variation is expected. The average difference between the two clusters for running the large dataset was 254%.

chart2

Concurrent Execution

It is important to understand the impact of concurrently running multiple jobs on the same Dataproc cluster. To demonstrate this, both the Java and Python jobs were also run concurrently. In one such test, ten copies of the Python job were run concurrently on the three-node cluster.

concurrent-jobs

Observe that the execution times of the concurrent jobs increase in near-linear time. The first job completes in roughly the same time as the consecutively executed jobs, shown above, but each proceeding job’s execution time increases linearly.

chart1

According to Apache, when running on a cluster, each Spark application gets an independent set of executor JVMs that only run tasks and store data for that application. Each application is given a maximum amount of resources it can use and holds onto them for its whole duration. Note no tuning was done to the Dataproc clusters to optimize for concurrent execution.

Really Big Data?

Although there is no exact definition of ‘big data’, 750K rows of data at 265 MB is probably not generally considered big data. Likewise, the three-node cluster used in this demonstration is still pretty diminutive. Lastly, the SQL query was less than complex. To really test the abilities of Dataproc would require a multi-gigabyte or multi-terabyte-sized dataset, divided amongst multiple files, computed on a much beefier cluster with more workers nodes and more computer resources.

Monitoring and Instrumentation

In addition to viewing the results of running and completed jobs, there are a number of additional monitoring resources, including the Hadoop Yarn Resource Manager, HDFS NameNode, and Spark History Server Web UIs, and Google Stackdriver. I will only briefly introduce these resources, and not examine any of these interfaces in detail. You’re welcome to investigate the resources for your own clusters. Apache lists other Spark monitoring and instrumentation resources in their documentation.

To access the Hadoop Yarn Resource Manager, HDFS NameNode, and Spark History Server Web UIs, you must create an SSH tunnel and run Chrome through a proxy. Google Dataproc provides both commands and a link to documentation in the Dataproc Cluster tab, to connect.

screen_shot_2018-12-15_at_9.09.00_pm

Hadoop Yarn Resource Manager Web UI

Once you are connected to the Dataproc cluster, via the SSH tunnel and proxy, the Hadoop Yarn Resource Manager Web UI is accessed on port 8088. The UI allows you to view all aspects of the YARN cluster and the distributed applications running on the YARN system.

screen_shot_2018-12-15_at_9.15.27_pm

HDFS NameNode Web UI

Once you are connected to the Dataproc cluster, via the SSH tunnel and proxy, the HDFS NameNode Web UI may is accessed on port 9870. According to the Hadoop Wiki, the NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks were across the cluster the file data is kept. It does not store the data of these files itself.

screen_shot_2018-12-15_at_9.41.19_pm

Spark History Server Web UI

We can view the details of all completed jobs using the Spark History Server Web UI. Once you are connected to the cluster, via the SSH tunnel and proxy, the Spark History Server Web UI is accessed on port 18080. Of all the methods of reviewing aspects of a completed Spark job, the History Server provides the most detailed.

screen_shot_2018-12-15_at_9.15.09_pm

Using the History Server UI, we can drill into fine-grained details of each job, including the event timeline.

screen_shot_2018-12-15_at_9.22.32_pm

Also, using the History Server UI, we can see a visualization of the Spark job’s DAG (Directed Acyclic Graph). DataBricks published an excellent post on learning how to interpret the information visualized in the Spark UI.

screen_shot_2018-12-15_at_9.19.15_pm

Not only can view the DAG and drill into each Stage of the DAG, from the UI.

screen_shot_2018-12-15_at_10.22.33_pm

Stackdriver

We can also enable Google Stackdriver for monitoring and management of services, containers, applications, and infrastructure. Stackdriver offers an impressive array of services, including debugging, error reporting, monitoring, alerting, tracing, logging, and dashboards, to mention only a few Stackdriver features.

screen_shot_2018-12-05_at_3.18.31_pm

There are dozens of metrics available, which collectively, reflect the health of the Dataproc clusters. Below we see the states of one such metric, the YARN virtual cores (vcores). A YARN vcore, introduced in Hadoop 2.4, is a usage share of a host CPU.  The number of YARN virtual cores is equivalent to the number of worker nodes (2) times the number of vCPUs per node (4), for a total of eight YARN virtual cores. Below, we see that at one point in time, 5 of the 8 vcores have been allocated, with 2 more available.

screen_shot_2018-12-05_at_3.29.47_pm

Next, we see the states of the YARN memory size. YARN memory size is calculated as the number of worker nodes (2) times the amount of memory on each node (15 GB) times the fraction given to YARN (0.8), for a total of 24 GB (2 x 15 GB x 0.8). Below, we see that at one point in time, 20 GB of RAM is allocated with 4 GB available. At that instant in time, the workload does not appear to be exhausting the cluster’s memory.

screen_shot_2018-12-05_at_3.30.39_pm

Notifications

Since no one actually watches dashboards all day, waiting for something to fail, how do know when we have an issue with Dataproc? Stackdrive offers integrations with most popular notification channels, including email, SMS, Slack, PagerDuty, HipChat, and Webhooks. With Stackdriver, we define a condition which describes when a service is considered unhealthy. When triggered, Stackdriver sends a notification to one or more channels.

notifications

Below is a preview of two alert notifications in Slack. I enabled Slack as a notification channel and created an alert which is triggered each time a Dataproc job fails. Whenever a job fails, such as the two examples below, I receive a Slack notification through the Slack Channel defined in Stackdriver.

slack.png

Slack notifications contain a link, which routes you back to Stackdriver, to an incident which was opened on your behalf, due to the job failure.

incident

For convenience, the incident also includes a pre-filtered link directly to the log entries at the time of the policy violation. Stackdriver logging offers advanced filtering capabilities to quickly find log entries, as shown below.screen_shot_2018-12-09_at_12.52.51_pm

With Stackdriver, you get monitoring, logging, alerting, notification, and incident management as a service, with minimal cost and upfront configuration. Think about how much time and effort it takes the average enterprise to achieve this level of infrastructure observability on their own, most never do.

Conclusion

In this post, we have seen the ease-of-use, extensive feature-set, out-of-the-box integration ability with other cloud services, low cost, and speed of Google Cloud Dataproc, to run big data analytics workloads. Couple this with the ability of Stackdriver to provide monitoring, logging, alerting, notification, and incident management for Dataproc with minimal up-front configuration. In my opinion, based on these features, Google Cloud Dataproc leads other cloud competitors for fully-managed Spark and Hadoop Cluster management.

In future posts, we will examine the use of Cloud Dataproc Workflow Templates for process automation, the integration capabilities of Dataproc with services such as BigQuery, Bigtable, Cloud Dataflow, and Google Cloud Pub/Sub, and finally, DevOps for Big Data with Dataproc and tools like Spinnaker and Jenkins on GKE.

All opinions expressed in this post are my own and not necessarily the views of my current or past employers, their clients, nor Apache or Google.

, , , , , , , , , , , ,

2 Comments

Building Serverless Actions for Google Assistant with Google Cloud Functions, Cloud Datastore, and Cloud Storage

Introduction

In this post, we will create an Action for Google Assistant using the ‘Actions on Google’ development platform, Google Cloud Platform’s serverless Cloud Functions, Cloud Datastore, and Cloud Storage, and the current LTS version of Node.js. According to Google, Actions are pieces of software, designed to extend the functionality of the Google Assistant, Google’s virtual personal assistant, across a multitude of Google-enabled devices, including smartphones, cars, televisions, headphones, watches, and smart-speakers.

Here is a brief YouTube video preview of the final Action for Google Assistant, we will explore in this post, running on an Apple iPhone 8.

If you want to compare the development of an Action for Google Assistant with those of AWS and Azure, in addition to this post, please read my previous two posts in this series, Building and Integrating LUIS-enabled Chatbots with Slack, using Azure Bot Service, Bot Builder SDK, and Cosmos DB and Building Asynchronous, Serverless Alexa Skills with AWS Lambda, DynamoDB, S3, and Node.js. All three of the article’s demonstrations are written in Node.js, all three leverage their cloud platform’s machine learning-based Natural Language Understanding services, and all three take advantage of NoSQL database and storage services available on their respective cloud platforms.

Google Technologies

The final architecture of our Action for Google Assistant will look as follows.

Google Assistant Architecture v2

Here is a brief overview of the key technologies we will incorporate into our architecture.

Actions on Google

According to Google, Actions on Google is the platform for developers to extend the Google Assistant. Similar to Amazon’s Alexa Skills Kit Development Console for developing Alexa Skills, Actions on Google is a web-based platform that provides a streamlined user-experience to create, manage, and deploy Actions. We will use the Actions on Google platform to develop our Action in this post.

Dialogflow

According to Google, Dialogflow is an enterprise-grade Natural language understanding (NLU) platform that makes it easy for developers to design and integrate conversational user interfaces into mobile apps, web applications, devices, and bots. Dialogflow is powered by Google’s machine learning for Natural Language Processing (NLP). Dialogflow was initially known as API.AI prior being renamed by Google in late 2017.

We will use the Dialogflow web-based development platform and version 2 of the Dialogflow API, which became GA in April 2018, to build our Action for Google Assistant’s rich, natural-language conversational interface.

Google Cloud Functions

Google Cloud Functions are the event-driven serverless compute platform, part of the Google Cloud Platform (GCP). Google Cloud Functions are comparable to Amazon’s AWS Lambda and Azure Functions. Cloud Functions is a relatively new service from Google, released in beta in March 2017, and only recently becoming GA at Cloud Next ’18 (July 2018). The main features of Cloud Functions include automatic scaling, high availability, fault tolerance, no servers to provision, manage, patch or update, and a payment model based on the function’s execution time. The programmatic logic behind our Action for Google Assistant will be handled by a Cloud Function.

Node.js LTS

We will write our Action’s Google Cloud Function using the Node.js 8 runtime. Google just released the ability to write Google Cloud Functions in Node 8.11.1 and Python 3.7.0, at Cloud Next ’18 (July 2018). It is still considered beta functionality. Previously, you had to write your functions in Node version 6 (currently, 6.14.0).

Node 8, also known as Project Carbon, was the first Long Term Support (LTS) version of Node to support async/await with Promises. Async/await is the new way of handling asynchronous operations in Node.js. We will make use of async/await and Promises within our Action’s Cloud Function.

Google Cloud Datastore

Google Cloud Datastore is a highly-scalable NoSQL database. Cloud Datastore is similar in features and capabilities to Azure Cosmos DB and Amazon DynamoDB. Datastore automatically handles sharding and replication and offers features like a RESTful interface, ACID transactions, SQL-like queries, and indexes. We will use Datastore to persist the information returned to the user from our Action for Google Assistant.

Google Cloud Storage

The last technology, Google Cloud Storage is secure and durable object storage, nearly identical to Amazon Simple Storage Service (Amazon S3) and Azure Blob Storage. We will store publicly accessible images in a Google Cloud Storage bucket, which will be displayed in Google Assistant Basic Card responses.

Demonstration

To demonstrate Actions for Google Assistant, we will build an informational Action that responds to the user with interesting facts about Azure, Microsoft’s Cloud computing platform (Google talking about Azure, ironic). Note this is not intended to be an official Microsoft bot and is only used for demonstration purposes.

Source Code

All open-sourced code for this post can be found on GitHub. Note code samples in this post are displayed as Gists, which may not display correctly on some mobile and social media browsers. Links to gists are also provided.

Development Process

This post will focus on the development and integration of an Action with Google Cloud Platform’s serverless and asynchronous Cloud Functions, Cloud Datastore, and Cloud Storage. The post is not intended to be a general how-to on developing and publishing Actions for Google Assistant, or how to specifically use services on the Google Cloud Platform.

Building the Action will involve the following steps.

  • Design the Action’s conversation model;
  • Import the Azure Facts Entities into Cloud Datastore on GCP;
  • Create and upload the images to Cloud Storage on GCP;
  • Create the new Actions on Google project using the Actions on Google console;
  • Develop the Action’s Intent using the Dialogflow console;
  • Bulk import the Action’s Entities using the Dialogflow console;
  • Configure the Dialogflow Actions on Google Integration;
  • Develop and deploy the Cloud Function to GCP;
  • Test the Action using Actions on Google Simulator;

Let’s explore each step in more detail.

Conversational Model

The conversational model design of the Azure Tech Facts Action for Google Assistant is similar to the Azure Tech Facts Alexa Custom Skill, detailed in my previous post. We will have the option to invoke the Action in two ways, without initial intent (Explicit Invocation) and with intent (Implicit Invocation), as shown below. On the left, we see an example of an explicit invocation of the Action. Google Assistant then queries the user for more information. On the right, an implicit invocation of the Action includes the intent, being the Azure fact they want to learn about. Google Assistant responds directly, both verbally and visually with the fact.

preview_3

Each fact returned by Google Assistant will include a Simple ResponseBasic Card and Suggestions response types for devices with a display, as shown below. The user may continue to ask for additional facts or choose to cancel the Action at any time.

preview_1

Lastly, as part of the conversational model, we will include the option of asking for a random fact, as well as asking for help. Examples of both are shown below. Again, Google Assistant responds to the user, vocally and, optionally, visually, for display-enabled devices.

preview_2

GCP Account and Project

The following steps assume you have an existing GCP account and you have created a project on GCP to house the Cloud Function, Cloud Storage Bucket, and Cloud Datastore Entities. The post also assumes that you have the Google Cloud SDK installed on your development machine, and have authenticated your identity from the command line (gist).

Google Cloud Storage

First, the images, actually Azure icons available from Microsoft, displayed in the responses shown above, are uploaded to a Google Storage Bucket. To handle these tasks, we will use the gsutil CLI to create, upload, and manage the images. The gsutil CLI tool, like gcloud, is part of the Google Cloud SDK. The gsutil mb (make bucket) command creates the bucket, gsutil cp (copy files and objects) command is used to copy the images to the new bucket, and finally, the gsutil iam (get, set, or change bucket and/or object IAM permissions) command is used to make the images public. I have included a shell scriptbucket-uploader.sh, to make this process easier. (gist).

From the Storage Console on GCP, you should observe the images all have publicly accessible URLs. This will allow the Cloud Function to access the bucket, and retrieve and display the images. There are more secure ways to store and display the images from the function. However, this is the simplest method since we are not concerned about making the images public.

assistant-003

We will need the URL of the new Storage bucket, later, when we develop to our Action’s Cloud Function. The bucket URL can be obtained from the Storage Console on GCP, as shown below in the Link URL.

assistant-004

Google Cloud Datastore

In Cloud Datastore, the category data object is referred to as a Kind, similar to a Table in a relational database. In Datastore, we will have an ‘AzureFact’ Kind of data. In Datastore, a single object is referred to as an Entity, similar to a Row in a relational database. Each one of our entities represents a unique reference value from our Azure Facts Intent’s facts entities, such as ‘competition’ and ‘certifications’. Individual data is known as a Property in Datastore, similar to a Column in a relational database. We will have four Properties for each entity: name, response, title, and image. Lastly, a Key in Datastore is similar to a Primary Key in a relational database. The Key we will use for our entities is the unique reference value string from our Azure Facts Intent’s facts entities, such as ‘competition’ or ‘certifications’. The Key value is stored within the entity’s name Property.

There are a number of ways to create the Datastore entities for our Action, including manually from the Datastore console on GCP. However, to automate the process, we will use a script, written in Node.js and using the Google Cloud Datastore Node.js Client, to create the entities. We will use the Client API’s Datastore Class upsert method, which will create or update an entire collection of entities with one call and returns a callback. The script , upsert-entities.js, is included in source control and can be run with the following command. Below is a snippet of the script, which shows the structure of the entities (gist).

Once the upsert command completes successfully, you should observe a collection of ‘AzureFact’ Type Datastore Entities in the Datastore console on GCP.

assistant-006

Below, we see the structure of a single Datastore Entity, the ‘certifications’ Entity, containing the fact response, title, and name of the image, which is stored in our Google Storage bucket.

assistant-007

New ‘Actions on Google’ Project

With the images uploaded and the database entries created, we can start building our Actions for Google Assistant. Using the Actions on Google web console, we first create a new Actions project.

assistant-010

The Directory Information tab is where we define metadata about the project. This information determines how it will look in the Actions directory and is required to publish your project. The Actions directory is where users discover published Actions on the web and mobile devices.

assistant-018

Actions and Intents

Our project will contain a series of related Actions. According to Google, an Action is ‘an interaction you build for the Assistant that supports a specific intent and has a corresponding fulfillment that processes the intent.’ To build our Actions, we first want to create our Intents. To do so, we will want to switch from the Actions on Google console to the Dialogflow console. Actions on Google provides a link for switching to Dialogflow in the Actions tab.

assistant-027.png

We will build our Action’s Intents in Dialogflow. The term Intent, used by Dialogflow, is standard terminology across other voice-assistant platforms, such as Amazon’s Alexa and Microsoft’s Azure Bot Service and LUIS. In Dialogflow, will be building Intents—the Azure Facts Intent, Welcome Intent, and the Fallback Intent.

assistant-030.png

Below, we see the Azure Facts Intent. The Azure Facts Intent is the main Intent, responsible for handling our user’s requests for facts about Azure. The Intent includes a fair number, but certainly not an exhaustive list, of training phrases. These represent all the possible ways a user might express intent when invoking the Action. According to Google, the greater the number of natural language examples in the Training Phrases section of Intents, the better the classification accuracy.

assistant-011

Intent Entities

Each of the highlighted words in the training phrases maps to the facts parameter, which maps to a collection of @facts Entities. Entities represent a list of intents the Action is trained to understand.  According to Google, there are three types of entities: system (defined by Dialogflow), developer (defined by a developer), and user (built for each individual end-user in every request) entities. We will be creating developer type entities for our Action’s Intent.

assistant-012

Synonyms

An entity contains Synonyms. Multiple synonyms may be mapped to a single reference value. The reference value is the value passed to the Cloud Function by the Action. For example, take the reference value of ‘competition’. A user might ask Google about Azure’s competition. However, the user might also substitute the words ‘competitor’ or ‘competitors’ for ‘competition’. Using synonyms, if the user utters any of these three words in their intent, they will receive the same response.

assistant-014

Although our Azure Facts Action is a simple example, typical Actions might contain hundreds of entities or more, each with several synonyms. Dialogflow provides the option of copy and pasting bulk entities, in either JSON or CSV format. The project’s source code includes both JSON or CSV formats, which may be input in this manner.

assistant-015

Automated Expansion

Not every possible fact, which will have a response, returned by Google Assistant, needs an entity defined. For example, we created a ‘compliance’ Cloud Datastore Entity. The Action understands the term ‘compliance’ and will return a response to the user if they ask about Azure compliance. However, ‘compliance’ is not defined as an Intent Entity, since we have chosen not to define any synonyms for the term ‘compliance’.

In order to allow this, you must enable Allow Automated Expansion. According to Google, this option allows an Agent to recognize values that have not been explicitly listed in the entity. Google describes Agents as NLU (Natural Language Understanding) modules.

Actions on Google Integration

Another configuration item in Dialogflow that needs to be completed is the Dialogflow’s Actions on Google integration. This will integrate the Azure Tech Facts Action with Google Assistant. Google provides more than a dozen different integrations, as shown below.

assistant-026.png

The Dialogflow’s Actions on Google integration configuration is simple, just choose the Azure Facts Intent as our Action’s Implicit Invocation intent, in addition to the default Welcome Intent, which is our Action’s Explicit Invocation intent. According to Google, integration allows our Action to reach users on every device where the Google Assistant is available.

assistant-017

Action Fulfillment

When an intent is received from the user, it is fulfilled by the Action. In the Dialogflow Fulfillment console, we see the Action has two fulfillment options, a Webhook or a Cloud Function, which can be edited inline. A Webhook allows us to pass information from a matched intent into a web service and get a result back from the service. In our example, our Action’s Webhook will call our Cloud Function, using the Cloud Function’s URL endpoint. We first need to create our function in order to get the endpoint, which we will do next.

assistant-016

Google Cloud Functions

Our Cloud Function, called by our Action, is written in Node.js 8. As stated earlier, Node 8 LTS was the first LTS version to support async/await with Promises. Async/await is the new way of handling asynchronous operations in Node.js, replacing callbacks.

Our function, index.js, is divided into four sections: constants, intent handlers, helper functions, and the function’s entry point. The Cloud Function attempts to follow many of the coding practices from Google’s code examples on Github.

Constants

The section defines the global constants used within the function. Note the constant for the URL of our new Cloud Storage bucket, on line 30 below, IMAGE_BUCKET, references an environment variable, process.env.IMAGE_BUCKET. This value is set in the .env.yaml file. All environment variables in the .env.yaml file will be set during the Cloud Function’s deployment, explained later in this post. Environment variables were recently released, and are still considered beta functionality (gist).

The npm package dependencies declared in the constants section, are defined in the dependencies section of the package.json file. Function dependencies include Actions on Google, Firebase Functions, and Cloud Datastore (gist).

Intent Handlers

The three intent handlers correspond to the three intents in the Dialogflow console: Azure Facts Intent, Welcome Intent, and Fallback Intent. Each handler responds in a very similar fashion. The handlers all return a SimpleResponse for audio-only and display-enabled devices. Optionally, a BasicCard is returned for display-enabled devices (gist).

The Welcome Intent handler handles explicit invocations of our Action. The Fallback Intent handler handles both help requests, as well as cases when Dialogflow cannot match any of the user’s input. Lastly, the Azure Facts Intent handler handles implicit invocations of our Action, returning a fact to the user from Cloud Datastore, based on the user’s requested fact.

Helper Functions

The next section of the function contains two helper functions. The primary function is the buildFactResponse function. This is the function that queries Google Cloud Datastore for the fact. The second function, the selectRandomFact, handles the fact value of ‘random’, by selecting a random fact value to query Datastore. (gist).

Async/Await, Promises, and Callbacks

Let’s look closer at the relationship and asynchronous nature of the Azure Facts Intent intent handler and buildFactResponse function. Below, note the async function on line 1 in the intent and the await function on line 3, which is part of the buildFactResponse function call. This is typically how we see async/await applied when calling an asynchronous function, such as buildFactResponse. The await function allows the intent’s execution to wait for the buildFactResponse function’s Promise to be resolved, before attempting to use the resolved value to construct the response.

The buildFactResponse function returns a Promise, as seen on line 28. The Promise’s payload contains the results of the successful callback from the Datastore API’s runQuery function. The runQuery function returns a callback, which is then resolved and returned by the Promise, as seen on line 40 (gist).

The payload returned by Google Datastore, through the resolved Promise to the intent handler,  will resemble the example response, shown below. Note the image, response, and title key/value pairs in the textPayload section of the response payload. These are what are used to format the SimpleResponse and BasicCard responses (gist).

Cloud Function Deployment

To deploy the Cloud Function to GCP, use the gcloud CLI with the beta version of the functions deploy command. According to Google, gcloud is a part of the Google Cloud SDK. You must download and install the SDK on your system and initialize it before you can use gcloud. You should ensure that your function is deployed to the same region as your Google Storage Bucket. Currently, Cloud Functions are only available in four regions. I have included a shell scriptdeploy-cloud-function.sh, to make this step easier. (gist).

The creation or update of the Cloud Function can take up to two minutes. Note the .gcloudignore file referenced in the verbose output below. This file is created the first time you deploy a new function. Using the the .gcloudignore file, you can limit the deployed files to just the function (index.js) and the package.json file. There is no need to deploy any other files to GCP.

assistant-028

If you recall, the URL endpoint of the Cloud Function is required in the Dialogflow Fulfillment tab. The URL can be retrieved from the deployment output (shown above), or from the Cloud Functions Console on GCP (shown below). The Cloud Function is now deployed and will be called by the Action when a user invokes the Action.

assistant-009

Simulation Testing and Debugging

With our Action and all its dependencies deployed and configured, we can test the Action using the Simulation console on Actions on Google. According to Google, the Action Simulation console allows us to manually test our Action by simulating a variety of Google-enabled hardware devices and their settings. You can also access debug information such as the request and response that your fulfillment receives and sends.

Below, in the Action Simulation console, we see the successful display of the initial Azure Tech Facts containing the expected Simple Response, Basic Card, and Suggestions, triggered by a user’s explicit invocation of the Action.

The simulated response indicates that the Google Cloud Function was called, and it responded successfully. It also indicates that the Google Cloud Function was able to successfully retrieve the correct image from Google Cloud Storage.

assistant-019

Below, we see the successful response to the user’s implicit invocation of the Action, in which they are seeking a fact about Azure’s Cognitive Services. The simulated response indicates that the Google Cloud Function was called, and it responded successfully. It also indicates that the Google Cloud Function was able to successfully retrieve the correct Entity from Google Cloud Datastore, as well as the correct image from Google Cloud Storage.

assistant-020

If we had issues with the testing, the Action Simulation console also contains tabs containing the request and response objects sent to and from the Cloud Function, the audio response, a debug console, and any errors.

Logging and Analytics

In addition to the Simulation console’s ability to debug issues with our service, we also have Google Stackdriver Logging. The Stackdriver logs, which are viewed from the GCP management console, contain the complete requests and responses, to and from the Cloud Function, from the Google Assistant Action. The Stackdriver logs will also contain any logs entries you have explicitly placed in the Cloud Function.

assistant-021

We also have the ability to view basic Analytics about our Action from within the Dialogflow Analytics console. Analytics displays metrics, such as the number of sessions, the number of queries, the number of times each Intent was triggered, how often users exited the Action from an intent, and Sessions flows, shown below.

In simple Action such as this one, the Session flow is not very beneficial. However, in more complex Actions, with multiple Intents and a variety potential user interactions, being able to visualize Session flows becomes essential to understanding the user’s conversational path through the Action.

assistant-031.png

Conclusion

In this post, we have seen how to use the Actions on Google development platform and the latest version of the Dialogflow API to build Google Actions. Google Actions rather effortlessly integrate with the breath Google Cloud Platform’s many serverless offerings, including Google Cloud Functions, Cloud Datastore, and Cloud Storage.

We have seen how Google is quickly maturing their serverless functions, to compete with AWS and Azure, with the recently announced support of LTS version 8 of Node.js and Python, to create an Actions for Google Assistant.

Impact of Serverless

As an Engineer, I have spent endless days, late nights, and thankless weekends, building, deploying and managing servers, virtual machines, container clusters, persistent storage, and database servers. I think what is most compelling about platforms like Actions on Google, but even more so, serverless technologies on GCP, is that I spend the majority of my time architecting and developing compelling software. I don’t spend time managing infrastructure, worrying about capacity, configuring networking and security, and doing DevOps.

¹Azure is a trademark of Microsoft

All opinions expressed in this post are my own and not necessarily the views of my current or past employers, their clients, or Google and Microsoft.

, , , , , , , ,

6 Comments

Developing Cloud-Native Data-Centric Spring Boot Applications for Pivotal Cloud Foundry

In this post, we will explore the development of a cloud-native, data-centric Spring Boot 2.0 application, and its deployment to Pivotal Software’s hosted Pivotal Cloud Foundry service, Pivotal Web Services. We will add a few additional features, such as Spring Data, Lombok, and Swagger, to enhance our application.

According to Pivotal, Spring Boot makes it easy to create stand-alone, production-grade Spring-based Applications. Spring Boot takes an opinionated view of the Spring platform and third-party libraries. Spring Boot 2.0 just went GA on March 1, 2018. This is the first major revision of Spring Boot since 1.0 was released almost 4 years ago. It is also the first GA version of Spring Boot that provides support for Spring Framework 5.0.

Pivotal Web Services’ tagline is ‘The Agile Platform for the Agile Team Powered by Cloud Foundry’. According to Pivotal,  Pivotal Web Services (PWS) is a hosted environment of Pivotal Cloud Foundry (PCF). PWS is hosted on AWS in the US-East region. PWS utilizes two availability zones for redundancy. PWS provides developers a Spring-centric PaaS alternative to AWS Elastic Beanstalk, Elastic Container Service (Amazon ECS), and OpsWorks. With PWS, you get the reliability and security of AWS, combined with the rich-functionality and ease-of-use of PCF.

To demonstrate the feature-rich capabilities of the Spring ecosystem, the Spring Boot application shown in this post incorporates the following complimentary technologies:

  • Spring Boot Actuator: Sub-project of Spring Boot, adds several production grade services to Spring Boot applications with little developer effort
  • Spring Data JPA: Sub-project of Spring Data, easily implement JPA based repositories and data access layers
  • Spring Data REST: Sub-project of Spring Data, easily build hypermedia-driven REST web services on top of Spring Data repositories
  • Spring HATEOAS: Create REST representations that follow the HATEOAS principle from Spring-based applications
  • Springfox Swagger 2: We are using the Springfox implementation of the Swagger 2 specification, an automated JSON API documentation for API’s built with Spring
  • Lombok: The @Data annotation generates boilerplate code that is typically associated with simple POJOs (Plain Old Java Objects) and beans: @ToString, @EqualsAndHashCode, @Getter, @Setter, and @RequiredArgsConstructor

Source Code

All source code for this post can be found on GitHub. To get started quickly, use one of the two following commands (gist).

For this post, I have used JetBrains IntelliJ IDEA and Git Bash on Windows for development. However, all code should be compatible with most popular IDEs and development platforms. The project assumes you have Docker and the Cloud Foundry Command Line Interface (cf CLI) installed locally.

Code samples in this post are displayed as Gists, which may not display correctly on some mobile and social media browsers. Links to gists are also provided.

Demo Application

The Spring Boot application demonstrated in this post is a simple election-themed RESTful API. The app allows API consumers to create, read, update, and delete, candidates, elections, and votes, via its exposed RESTful HTTP-based resources.

The Spring Boot application consists of (7) JPA Entities that mirror the tables and views in the database, (7) corresponding Spring Data Repositories, (2) Spring REST Controller, (4) Liquibase change sets, and associated Spring, Liquibase, Swagger, and PCF configuration files. I have intentionally chosen to avoid the complexities of using Data Transfer Objects (DTOs) for brevity, albeit a security concern, and directly expose the entities as resources.

img022_Final_Project

Controller Resources

This application is a simple CRUD application. The application contains a few simple HTTP GET resources in each of the two controller classes, as an introduction to Spring REST controllers. For example, the CandidateController contains the /candidates/summary and /candidates/summary/{election} resources (shown below in Postman). Typically, you would expose your data to the end-user as controller resources, as opposed to exposing the entities directly. The ease of defining controller resources is one of the many powers of Spring Boot.

img025_CustomResource.PNG

Paging and Sorting

As an introduction to Spring Data’s paging and sorting features, both the VoteRepository and VotesElectionsViewRepository Repository Interfaces extend Spring Data’s PagingAndSortingRepository<T,ID> interface, instead of the default CrudRepository<T,ID> interface. With paging and sorting enabled, you may both sort and limit the amount of data returned in the response payload. For example, to reduce the size of your response payload, you might choose to page through the votes in blocks of 25 votes at a time. In that case, as shown below in Postman, if you needed to return just votes 26-50, you would append the /votes resource with ?page=1&size=25. Since paging starts on page 0 (zero), votes 26-50 will on page 1.

img024_Paging

Swagger

This project also includes the Springfox implementation of the Swagger 2 specification. Along with the Swagger 2 dependency, the project takes a dependency on io.springfox:springfox-swagger-ui. The Springfox Swagger UI dependency allows us to view and interactively test our controller resources through Swagger’s browser-based UI, as shown below.

img027B_Swagger

All Swagger configuration can be found in the project’s SwaggerConfig Spring Configuration class.

Gradle

This post’s Spring Boot application is built with Gradle, although it could easily be converted to Maven if desired. According to Gradle, Gradle is the modern tool used to build, automate and deliver everything from mobile apps to microservices.

Data

In real life, most applications interact with one or more data sources. The Spring Boot application demonstrated in this post interacts with data from a PostgreSQL database. PostgreSQL, or simply Postgres, is the powerful, open-source object-relational database system, which has supported production-grade applications for 15 years. The application’s single elections database consists of (6) tables, (3) views, and (2) function, which are both used to generate random votes for this demonstration.

img020_Database_Diagram

Spring Data makes interacting with PostgreSQL easy. In addition to the features of Spring Data, we will use Liquibase. Liquibase is known as the source control for your database. With Liquibase, your database development lifecycle can mirror your Spring development lifecycle. Both DDL (Data Definition Language) and DML (Data Manipulation Language) changes are versioned controlled, alongside the Spring Boot application source code.

Locally, we will take advantage of Docker to host our development PostgreSQL database, using the official PostgreSQL Docker image. With Docker, there is no messy database installation and configuration of your local development environment. Recreating and deleting your PostgreSQL database is simple.

To support the data-tier in our hosted PWS environment, we will implement ElephantSQL, an offering from the Pivotal Services Marketplace. ElephantSQL is a hosted version of PostgreSQL, running on AWS. ElephantSQL is referred to as PostgreSQL as a Service, or more generally, a Database as a Service or DBaaS. As a Pivotal Marketplace service, we will get easy integration with our PWS-hosted Spring Boot application, with near-zero configuration.

Docker

First, set up your local PostgreSQL database using Docker and the official PostgreSQL Docker image. Since this is only a local database instance, we will not worry about securing our database credentials (gist).

Your running PostgreSQL container should resemble the output shown below.

img001_docker

Data Source

Most IDEs allow you to create and save data sources. Although this is not a requirement, it makes it easier to view the database’s resources and table data. Below, I have created a data source in IntelliJ from the running PostgreSQL container instance. The port, username, password, and database name were all taken from the above Docker command.

img002_IntelliJ_Data_Source

Liquibase

There are multiple strategies when it comes to managing changes to your database. With Liquibase, each set of changes are handled as change sets. Liquibase describes a change set as an atomic change that you want to apply to your database. Liquibase offers multiple formats for change set files, including XML, JSON, YAML, and SQL. For this post, I have chosen SQL, specifically PostgreSQL SQL dialect, which can be designated in the IntelliJ IDE. Below is an example of the first changeset, which creates four tables and two indexes.

img023_Change_Set

As shown below, change sets are stored in the db/changelog/changes sub-directory, as configured in the master change log file (db.changelog-master.yaml). Change set files follow an incremental naming convention.

img003C_IntelliJ_Liquibase_Changesets

The empty PostgreSQL database, before any Liquibase changes, should resemble the screengrab shown below.

img003_IntelliJ_Blank_Database_cropped

To automatically run Liquibase database migrations on startup, the org.liquibase:liquibase-core dependency must be added to the project’s build.gradle file. To apply the change sets to your local, empty PostgreSQL database, simply start the service locally with the gradle bootRun command. As the app starts after being compiled, any new Liquibase change sets will be applied.

img004_Gradle_bootRun

You might ask how does Liquibase know the change sets are new. During the initial startup of the Spring Boot application, in addition to any initial change sets, Liquibase creates two database tables to track changes, the databasechangelog and databasechangeloglock tables. Shown below are the two tables, along with the results of the four change sets included in the project, and applied by Liquibase to the local PostgreSQL elections database.

img005_IntelliJ_Initial_Database_cropped

Below we see the contents of the databasechangelog table, indicating that all four change sets were successfully applied to the database. Liquibase checks this table before applying change sets.

img006B_IntelliJ_Database_Change_Log

ElephantSQL

Before we can deploy our Spring Boot application to PWS, we need an accessible PostgreSQL instance in the Cloud; I have chosen ElephantSQL. Available through the Pivotal Services Marketplace, ElephantSQL currently offers one free and three paid service plans for their PostgreSQL as a Service. I purchased the Panda service plan as opposed to the free Turtle service plan. I found the free service plan was too limited in the maximum number of database connections for multiple service instances.

Previewing and purchasing an ElephantSQL service plan from the Pivotal Services Marketplace, assuming you have an existing PWS account, literally takes a single command (gist).

The output of the command should resemble the screengrab below. Note the total concurrent connections and total storage for each plan.

img007_PCF_ElephantSQL_Service_Purchase

To get details about the running ElephantSQL service, use the cf service elections command.

img007_PCF_ElephantSQL_Service_Info

From the ElephantSQL Console, we can obtain the connection information required to access our PostgreSQL elections database. You will need the default database name, username, password, and URL.

img012_PWS_ElephantSQL_Details

Service Binding

Once you have created the PostgreSQL database service, you need to bind the database service to the application. We will bind our application and the database, using the PCF deployment manifest file (manifest.yml), found in the project’s root directory. Binding is done using the services section (shown below).

The key/value pairs in the env section of the deployment manifest will become environment variables, local to the deployed Spring Boot service. These key/value pairs in the manifest will also override any configuration set in Spring’s external application properties file (application.yml). This file is located in the resources sub-directory. Note the SPRING_PROFILES_ACTIVE: test environment variable in the manifest.yml file. This variable designates which Spring Profile will be active from the multiple profiles defined in the application.yml file.

img008B_PCF_Manifest

Deployment to PWS

Next, we run gradle build followed by cf push to deploy one instance of the Spring Boot service to PWS and associate it with our ElephantSQL database instance. Below is the expected output from the cf push command.

img008_PCF_CF_Push

Note the route highlighted below. This is the URL where your Spring Boot service will be available.

img009_PCF_CF_Push2

To confirm your ElephantSQL database was populated by Liquibase when PWS started the deployed Spring application instance, we can check the ElephantSQL Console’s Stats tab. Note the database tables and rows in each table, signifying Liquibase ran successfully. Alternately, you could create another data source in your IDE, connected to ElephantSQL; this can be helpful for troubleshooting.

img013_Candidates

To access the running service and check that data is being returned, point your browser (or Postman) to the URL returned from the cf push command output (see second screengrab above) and hit the /candidates resource. Obviously, your URL, referred to as a route by PWS, will be different and unique. In the response payload, you should observe a JSON array of eight candidate objects. Each candidate was inserted into the Candidate table of the database, by Liquibase, when Liquibase executed the second of the four change sets on start-up.

img012_PWS_ElephantSQL

With Spring Boot Actuator and Spring Data REST, our simple Spring Boot application has dozens of resources exposed automatically, without extensive coding of resource controllers. Actuator exposes resources to help manage and troubleshoot the application, such as info, health, mappings (shown below), metrics, env, and configprops, among others. All Actuator resources are exposed explicitly, thus they can be disabled for Production deployments. With Spring Boot 2.0, all Actuator resources are now preceded with /actuator/ .

img029_Postman_Mappings

According to Pivotal, Spring Data REST builds on top of Spring Data repositories, analyzes an application’s domain model and exposes hypermedia-driven HTTP resources for aggregates contained in the model, such as our /candidates resource. A partial list of the application’s exposed resources are listed in the GitHub project’s README file.

In Spring’s approach to building RESTful web services, HTTP requests are handled by a controller. Spring Data REST automatically exposes CRUD resources for our entities. With Spring Data JPA, POJOs like our Candidate class are annotated with @Entity, indicating that it is a JPA entity. Lacking a @Table annotation, it is assumed that this entity will be mapped to a table named Candidate.

With Spring’s Data REST’s RESTful HTTP-based API, traditional database Create, Read, Update, and Delete commands for each PostgreSQL database table are automatically mapped to equivalent HTTP methods, including POST, GET, PUT, PATCH, and DELETE.

Below is an example, using Postman, to create a new Candidate using an HTTP POST method.

img029_Postman_Post

Below is an example, using Postman, to update a new Candidate using an HTTP PUT method.

img029_Postman_Put.PNG

With Spring Data REST, we can even retrieve data from read-only database Views, as shown below. This particular JSON response payload was returned from the candidates_by_elections database View, using the /election-candidates resource.

img028_Postman_View.PNG

Scaling Up

Once your application is deployed and you have tested its functionality, you can easily scale out or scale in the number instances, memory, and disk, with the cf scale command (gist).

Below is sample output from scaling up the Spring Boot application to two instances.

img016_Scale_Up2

Optionally, you can activate auto-scaling, which will allow the application to scale based on load.

img016_Autoscaling.PNG

Following the PCF architectural model, auto-scaling is actually another service from the Pivotal Services Marketplace, PCF App Autoscaler, as seen below, running alongside our ElephantSQL service.

img016_Autoscaling2.PNG

With PCF App Autoscaler, you set auto-scaling minimum and maximum instance limits and define scaling rules. Below, I have configured auto-scaling to scale out the number of application instances when the average CPU Utilization of all instances hits 80%. Conversely, the application will scale in when the average CPU Utilization recedes below 40%. In addition to CPU Utilization, PCF App Autoscaler also allows you to set scaling rules based on HTTP Throughput, HTTP Latency, RabbitMQ Depth (queue depth), and Memory Utilization.

Furthermore, I set the auto-scaling minimum number of instances to two and the maximum number of instances to four. No matter how much load is placed on the application, PWS will not scale above four instances. Conversely, PWS will maintain a minimum of two running instances at all times.

img016_Autoscaling3

Conclusion

This brief post demonstrates both the power and simplicity of Spring Boot to quickly develop highly-functional, data-centric RESTful API applications with minimal coding. Further, when coupled with Pivotal Cloud Foundry, Spring developers have a highly scalable, resilient cloud-native application hosting platform.

, , , , , , , , , , , ,

Leave a comment

Deploying and Configuring Istio on Google Kubernetes Engine (GKE)

GKE_021B

Introduction

Unquestionably, Kubernetes has quickly become the leading Container-as-a-Service (CaaS) platform. In late September 2017, Rancher Labs announced the release of Rancher 2.0, based on Kubernetes. In mid-October, at DockerCon Europe 2017, Docker announced they were integrating Kubernetes into the Docker platform. In late October, Microsoft released the public preview of Managed Kubernetes for Azure Container Service (AKS). In November, Google officially renamed its Google Container Engine to Google Kubernetes Engine. Most recently, at AWS re:Invent 2017, Amazon announced its own manged version of Kubernetes, Amazon Elastic Container Service for Kubernetes (Amazon EKS).

The recent abundance of Kuberentes-based CaaS offerings makes deploying, scaling, and managing modern distributed applications increasingly easier. However, as Craig McLuckie, CEO of Heptio, recently stated, “…it doesn’t matter who is delivering Kubernetes, what matters is how it runs.” Making Kubernetes run better is the goal of a new generation of tools, such as Istio, EnvoyProject Calico, Helm, and Ambassador.

What is Istio?

One of those new tools and the subject of this post is Istio. Released in Alpha by Google, IBM and Lyft, in May 2017, Istio is an open platform to connect, manage, and secure microservices. Istio describes itself as, “…an easy way to create a network of deployed services with load balancing, service-to-service authentication, monitoring, and more, without requiring any changes in service code. You add Istio support to services by deploying a special sidecar proxy throughout your environment that intercepts all network communication between microservices, configured and managed using Istio’s control plane functionality.

Istio contains several components, split between the data plane and a control plane. The data plane includes the Istio Proxy (an extended version of Envoy proxy). The control plane includes the Istio Mixer, Istio Pilot, and Istio-Auth. The Istio components work together to provide behavioral insights and operational control over a microservice-based service mesh. Istio describes a service mesh as a “transparently injected layer of infrastructure between a service and the network that gives operators the controls they need while freeing developers from having to bake solutions to distributed system problems into their code.

In this post, we will deploy the latest version of Istio, v0.4.0, on Google Cloud Platform, using the latest version of Google Kubernetes Engine (GKE), 1.8.4-gke.1. Both versions were just released in mid-December, as this post is being written. Google, as you probably know, was the creator of Kubernetes, now an open-source CNCF project. Google was the first Cloud Service Provider (CSP) to offer managed Kubernetes in the Cloud, starting in 2014, with Google Container Engine (GKE), which used Kubernetes. This post will outline the installation of Istio on GKE, as well as the deployment of a sample application, integrated with Istio, to demonstrate Istio’s observability features.

Getting Started

All code from this post is available on GitHub. You will need to change some variables within the code, to meet your own project’s needs (gist).

The scripts used in this post are as follows, in order of execution (gist).

Code samples in this post are displayed as Gists, which may not display correctly on some mobile and social media browsers. Links to gists are also provided.

Creating GKE Cluster

First, we create the Google Kubernetes Engine (GKE) cluster. The GKE cluster creation is highly-configurable from either the GCP Cloud Console or from the command line, using the Google Cloud Platform gcloud CLI. The CLI will be used throughout the post. I have chosen to create a highly-available, 3-node cluster (1 node/zone) in GCP’s South Carolina us-east1 region (gist).

Once built, we need to retrieve the cluster’s credentials.

Having chosen to use Kubernetes’ Alpha Clusters feature, the following warning is displayed, warning the Alpha cluster will be deleted in 30 days (gist).

The resulting GKE cluster will have the following characteristics (gist).

Installing Istio

With the GKE cluster created, we can now deploy Istio. There are at least two options for deploying Istio on GCP. You may choose to manually install and configure Istio in a GKE cluster, as I will do in this post, following these instructions. Alternatively, you may choose to use the Istio GKE Deployment Manager. This all-in-one GCP service will create your GKE cluster, and install and configure Istio and the Istio add-ons, including their Book Info sample application.

G002_DeployCluster

There were a few reasons I chose not to use the Istio GKE Deployment Manager option. First, until very recently, you could not install the latest versions of Istio with this option (as of 12/21 you can now deploy v0.3.0 and v0.4.0). Secondly, currently, you only have the choice of GKE version 1.7.8-gke.0. I wanted to test the latest v1.8.4 release with a stable GA version of RBAC. Thirdly, at least three out of four of my initial attempts to use the Istio GKE Deployment Manager failed during provisioning for unknown reasons. Lastly, you will learn more about GKE, Kubernetes, and Istio by doing it yourself, at least the first time.

Istio Code Changes

Before installing Istio, I had to make several minor code changes to my existing Kubernetes resource files. The requirements are detailed in Istio’s Pod Spec Requirements. These changes are minor, but if missed, cause errors during deployment, which can be hard to identify and resolve.

First, you need to name your Service ports in your Service resource files. More specifically, you need to name your service ports, http, as shown in the Candidate microservice’s Service resource file, below (note line 10) (gist).

Second, an app label is required for Istio. I added an app label to each Deployment and Service resource file, as shown below in the Candidate microservice’s Deployment resource files (note lines 5 and 6) (gist).

The next set of code changes were to my existing Ingress resource file. The requirements for an Ingress resource using Istio are explained here. The first change, Istio ignores all annotations other than kubernetes.io/ingress.class: istio (note line 7, below). The next change, if using HTTPS, the secret containing your TLS/SSL certificate and private key must be called istio-ingress-certs; all other names will be ignored (note line 10, below). Related and critically important, that secret must be deployed to the istio-system namespace, not the application’s namespace. The last change, for my particular my prefix match routing rules, I needed to change the rules from /{service_name} to /{service_name}/.*. The /.* is a special Istio notation that is used to indicate a prefix match (note lines 14, 18, and 22, below) (gist).

Installing Istio

To install Istio, you first will need to download and uncompress the correct distribution of Istio for your OS. Istio provides instructions for installation on various platforms.

My install-istio.sh script contains a variable, ISTIO_HOME, which should point to the root of your local Istio directory. We will also deploy all the current Istio add-ons, including Prometheus, Grafana, ZipkinService Graph, and Zipkin-to-Stackdriver (gist).

Once installed, from the GCP Cloud Console, an alternative to the native Kubernetes Dashboard, we should see the following Istio resources deployed and running. Below, note the three nodes are distributed across three zones within the GCP us-east-1 region, the correct version of GKE is employed, Stackdriver logging and monitoring are enabled, and the Alpha Clusters features are also enabled.

GKE_001

And here, we see the nodes that comprise the GKE cluster.

GKE_001_1

GKE_001_2.PNG

Below, note the four components that comprise Istio: istio-ca, istio-ingress, istio-mixer, and istio-pilot. Additionally, note the five components that comprise the Istio add-ons.

GKE_002

Below, observe the Istio Ingress has automatically been assigned a public IP address by GCP, accessible on ports 80 and 443. This IP address is how we will communicate with applications running on our GKE cluster, behind the Istio Ingress Load Balancer. Later, we will see how the Istio Ingress Load Balancer knows how to route incoming traffic to those application endpoints, using the Voter API’s Ingress configuration.

GKE_003.PNG

Istio makes ample use of Kubernetes Config Maps and Secrets, to store configuration, and to store certificates for mutual TLS.

GKE_004

Creation of the GKE cluster and deployed Istio to the cluster is complete. Following, I will demonstrate the deployment of the Voter API to the cluster. This will be used to demonstrate the capabilities of Istio on GKE.

Kubernetes Dashboard

In addition to the GCP Cloud Console, the native Kubernetes Dashboard is also available. To open, use the kubectl proxy command and connect to the Kubernetes Dashboard at https://127.0.0.1:8001/ui. You should now be able to view and edit all resources, from within the Kubernetes Dashboard.

GKE_005_5

Sample Application

To demonstrate the functionality of Istio and GKE, I will deploy the Voter API. I have used variations of the sample Voter API application in several previous posts, including Architecting Cloud-Optimized Apps with AKS (Azure’s Managed Kubernetes), Azure Service Bus, and Cosmos DB and Eventual Consistency: Decoupling Microservices with Spring AMQP and RabbitMQ. I suggest reading these two post to better understand the Voter API’s design.

AKS

For this post, I have reconfigured the Voter API to use MongoDB’s Atlas Database-as-a-Service (DBaaS) as the NoSQL data-source for each microservice. The Voter API is connected to a MongoDB Atlas 3-node M10 instance cluster in GCP’s us-east1 (South Carolina) region. With Atlas, you have the choice of deploying clusters to GCP or AWS.

GKE_014

The Voter API will use CloudAMQP’s RabbitMQ-as-a-Service for its decoupled, eventually consistent, message-based architecture. For this post, the Voter API is configured to use a RabbitMQ cluster in GCP’s us-east1 (South Carolina) region; I chose a minimally-configured free version of RabbitMQ. CloudAMQP allows you to provide a much more robust multi-node clusters for Production, on GCP or AWS.

GKE_015_1.PNG

CloudAMQP provides access to their own Management UI, in addition to access to RabbitMQ’s Management UI.

GKE_015B

With the Voter API running and taking traffic, we can see each Voter API microservice instance, nine replicas in total, connected to RabbitMQ. They are each publishing and consuming messages off the two queues.

GKE_016

The GKE, MongoDB Atlas, and RabbitMQ clusters are all running in the same GCP Region. Optimizing the Voter API cloud architecture on GCP, within a single Region, greatly reduces network latency, increases API performance, and improves end-to-end application and infrastructure observability and traceability.

Installing the Voter API

For simplicity, I have divided the Voter API deployment into three parts. First, we create the new voter-api Kubernetes Namespace, followed by creating a series of Voter API Kuberentes Secrets (gist).

There are a total of five secrets, one secret for each of the three microservice’s MongoDB databases, one secret for the RabbitMQ connection string (shown below), and one secret containing a Let’s Encrypt SSL/TLS certificate chain and private key for the Voter API’s domain, api.voter-demo.com (shown below).

GKE_011

GKE_006.PNG

GKE_007.PNG

Next, we create the microservice pods, using the Kubernetes Deployment files, create three ClusterIP-type Kubernetes Services, and a Kubernetes Ingress. The Ingress contains the service endpoint configuration, which Istio Ingress will use to correctly route incoming external API traffic (gist).

Three Kubernetes Pods for each of the three microservice should be created, for a total of nine pods. In the GCP Cloud UI’s Workloads (Kubernetes Deployments), you should see the following three resources. Note each Workload has three pods, each containing one replica of the microservice.

GKE_010

In the GCP Cloud UI’s Discovery and Load Balancing tab, you should observe the following four resources. Note the Voter API Ingress endpoints for the three microservices, which are used by the Istio Proxy, discussed below.

GKE_009.PNG

Istio Proxy

Examining the Voter API deployment more closely, you will observe that each of the nine Voter API microservice pods have two containers running within them (gist).

Along with the microservice container, there is an Istio Proxy container, commonly referred to as a sidecar container. Istio Proxy is an extended version of the Envoy proxy, Lyfts well-known, highly performant edge and service proxy. The proxy sidecar container is injected automatically when the Voter API pods are created. This is possible because we deployed the Istio Initializer (istio-initializer.yaml). The Istio Initializer guarantees that Istio Proxy will be automatically injected into every microservice Pod. This is referred to as automatic sidecar injection. Below we see an example of one of three Candidate pods running the istio-proxy sidecar.

GKE_012

In the example above, all traffic to and from the Candidate microservice now passes through the Istio Proxy sidecar. With Istio Proxy, we gain several enterprise-grade features, including enhanced observability, service discovery and load balancing, credential injection, and connection management.

Manual Sidecar Injection

What if we have application components we do not want automatically managed with Istio Proxy. In that case, manual sidecar injection might be preferable to automatic sidecar injection with Istio Initializer. For manual sidecar injection, we execute a istioctl kube-inject command for each of the Kubernetes Deployments. The command manually injects the Istio Proxy container configuration into the Deployment resource file, alongside each Voter API microservice container. On Mac and Linux, this command is similar to the following. Proxy injection is discussed in detail, here (gist).

External Service Egress

Whether you choose automatic or manual sidecar injection of the Istio Proxy, Istio’s egress rules currently only support HTTP and HTTPS requests. The Voter API makes external calls to its backend services, using two alternate protocols, MongoDB Wire Protocol (mongodb://) and RabbitMQ AMQP (amqps://). Since we cannot use an Istio egress rule for either protocol, we will use the includeIPRanges option with the istioctl kube-inject command to open egress to the two backend services. This will completely bypass Istio for a specific IP range. You can read more about calling external services directly, on Istio’s website.

You will need to modify the includeIPRanges argument within the create-voter-api-part3.sh script, adding your own GKE cluster’s IP ranges to the IP_RANGES variable. The two IP ranges can be found using the following GCP CLI command (gist).

The create-voter-api-part3.sh script also contains a modified version the istioctl kube-inject command for each Voter API Deployment. Using the modified command, the original Deployment files are not altered, instead, a temporary copy of the Deployment file is created into which Istio injects the required modifications. The temporary Deployment file is then used for the deployment, and then immediately deleted (gist).

Some would argue not having the actual deployed version of the file checked into in source code control is an anti-pattern; in this case, I would disagree. If I need to redeploy, I would just run the istioctl kube-inject command again. You can always view, edit, and import the deployed YAML file, from the GCP CLI or GKE Management UI.

The amount of Istio configuration injected into each microservice Pod’s Deployment resource file is considerable. The Candidate Deployment file swelled from 68 lines to 276 lines of code! This hints at the power, as well as the complexity of Istio. Shown below is a snippet of the Candidate Deployment YAML, after Istio injection.

GKE_025

Confirming Voter API

Installation of the Voter API is now complete. We can validate the Voter API is working, and that traffic is being routed through Istio, using Postman. Below, we see a list of candidates successfully returned from the Voter microservice, through the Voter API. This means, not only us the API running, but that messages have been successfully passed between the services, using RabbitMQ, and saved to the microservice’s corresponding MongoDB databases.

GKE_030

Below, note the server and x-envoy-upstream-service-time response headers. They both confirm the Voter API HTTPS traffic is being managed by Istio.

GKE_031.PNG

Observability

Observability is certainly one of the primary advantages of implementing Istio. For anyone like myself, who has spent many long and often frustrating hours installing, configuring, and managing monitoring systems for distributed platforms, Istio’s observability features are most welcome. Istio provides Prometheus, Grafana, ZipkinService Graph, and Zipkin-to-Stackdriver add-ons. Combined with the monitoring capabilities of Backend-as-a-Service providers, such as MongoDB Altas and CloudAMQP RabvbitMQ, you get considerable visibility into your application, out-of-the-box.

Prometheus
First, we will look at Prometheus, a leading open-source monitoring solution. The easiest way to access the Prometheus UI, or any of the other add-ons, including Prometheus, is using port-forwarding. For example with Prometheus, we use the following command (gist).

Alternatively, you could securely expose any of the Istio add-ons through the Istio Ingress, similar to how the Voter API microservice endpoints are exposed.

Prometheus collects time series metrics from both the Istio and Voter API components. Below we see two examples of typical metrics being collected; they include the 201 responses generated by the Candidate microservice, and the outflow of bytes by the Election microservice, over a given period of time.

GKE_022

GKE_022_1

Grafana
Although Prometheus is an excellent monitoring solution, Grafana, the leading open source software for time series analytics, provides a much easier way to visualize the metrics collected by Prometheus. Conveniently, Istio provides a dynamically-configured Grafana Dashboard, which will automatically display metrics for components deployed to GKE.

GKE_020B.PNG

Below, note the metrics collected for the Candidate and Election microservice replicas. Out-of-the-box, Grafana displays common HTTP KPIs, such as request rate, success rate, response codes, response time, and response size. Based on the version label included in the Deployment resource files, we can delineate metrics collected by the version of the Voter API microservices, in this case, v1 of the Candidate and Election microservices.

GKE_021B

Zipkin
Next, we have Zipkin, a leading distributed tracing system.

GKE_018

Since the Voter API application uses RabbitMQ to decouple communications between services, versus direct HTTP-based IPC, we won’t see any complex multi-segment traces. We will only see traces representing traffic to and from the microservices, which passes through the Istio Ingress.

GKE_019

Service Graph
Similar to Zipkin, Service Graph is not as valuable with the Voter API application as it could be with more complex applications. Below is a Service Graph view of the Voter API showing microservice version and requests/second to each microservice.

GKE_024

Stackdriver

One last tool we have to monitor our GKE cluster is Stackdriver. Stackdriver provides fine-grain monitoring, logging, and diagnostics. If you recall, we enabled Stackdriver logging and monitoring when we first provisioned the GKE cluster. Stackdrive allows us to examine the performance of the GKE cluster’s resources, review logs, and set alerts.

GKE_028

GKE_029

GKE_027

Zipkin-to-Stackdriver

When we installed Istio, we also installed the Zipkin-to-Stackdriver add-on. The Stackdriver Trace Zipkin Collector is a drop-in replacement for the standard Zipkin HTTP collector that writes to Google’s free Stackdriver Trace distributed tracing service. To use Stackdriver for traces originating from Zipkin, there is additional configuration required, which is commented out of the current version of the zipkin-to-stackdriver.yaml file (gist).

Instructions to configure the Zipkin-to-Stackdriver feature can be found here. Below is an example of how you might add the necessary configuration using a Kubernetes ConfigMap to inject the required user credentials JSON file (zipkin-to-stackdriver-creds.json) into the zipkin-to-stackdriver container. The new configuration can be seen on lines 27-44 (gist).

Conclusion

Istio provides a significant amount of fine-grained management control to Kubernetes. Managed Kubernetes CaaS offerings like GKE, coupled with tools like Istio, will soon make running reliable and secure containerized applications in Production, commonplace.

References

All opinions in this post are my own, and not necessarily the views of my current or past employers or their clients.

, , , , , , , , , , ,

1 Comment

Architecting Cloud-Optimized Apps with AKS (Azure’s Managed Kubernetes), Azure Service Bus, and Cosmos DB

An earlier post, Eventual Consistency: Decoupling Microservices with Spring AMQP and RabbitMQ, demonstrated the use of a message-based, event-driven, decoupled architectural approach for communications between microservices, using Spring AMQP and RabbitMQ. This distributed computing model is known as eventual consistency. To paraphrase microservices.io, ‘using an event-driven, eventually consistent approach, each service publishes an event whenever it updates its data. Other services subscribe to events. When an event is received, a service (subscriber) updates its data.

That earlier post illustrated a fairly simple example application, the Voter API, consisting of a set of three Spring Boot microservices backed by MongoDB and RabbitMQ, and fronted by an API Gateway built with HAProxy. All API components were containerized using Docker and designed for use with Docker CE for AWS as the Container-as-a-Service (CaaS) platform.

Optimizing for Kubernetes on Azure

This post will demonstrate how a modern application, such as the Voter API, is optimized for Kubernetes in the Cloud (Kubernetes-as-a-Service), in this case, AKS, Azure’s new public preview of Managed Kubernetes for Azure Container Service. According to Microsoft, the goal of AKS is to simplify the deployment, management, and operations of Kubernetes. I wrote about AKS in detail, in my last post, First Impressions of AKS, Azure’s New Managed Kubernetes Container Service.

In addition to migrating to AKS, the Voter API will take advantage of additional enterprise-grade Azure’s resources, including Azure’s Service Bus and Cosmos DB, replacements for the Voter API’s RabbitMQ and MongoDB. There are several architectural options for the Voter API’s messaging and NoSQL data source requirements when moving to Azure.

  1. Keep Dockerized RabbitMQ and MongoDB – Easy to deploy to Kubernetes, but not easily scalable, highly-available, or manageable. Would require storage optimized Azure VMs for nodes, node affinity, and persistent storage for data.
  2. Replace with Cloud-based Non-Azure Equivalents – Use SaaS-based equivalents, such as CloudAMQP (RabbitMQ-as-a-Service) and MongoDB Atlas, which will provide scalability, high-availability, and manageability.
  3. Replace with Azure Service Bus and Cosmos DB – Provides all the advantages of  SaaS-based equivalents, and additionally as Azure resources, benefits from being in the Azure Cloud alongside AKS.

Source Code

The Kubernetes resource files and deployment scripts used in this post are all available on GitHub. This is the only project you need to clone to reproduce the AKS example in this post.

git clone \
  --branch master --single-branch --depth 1 --no-tags \
  https://github.com/garystafford/azure-aks-sb-cosmosdb-demo.git

The Docker images for the three Spring microservices deployed to AKS, Voter, Candidate, and Election, are available on Docker Hub. Optionally, the source code, including Dockerfiles, for the Voter, Candidate, and Election microservices, as well as the Voter Client are available on GitHub, in the kub-aks branch.

git clone \
  --branch kub-aks --single-branch --depth 1 --no-tags \
  https://github.com/garystafford/candidate-service.git

git clone \
  --branch kub-aks --single-branch --depth 1 --no-tags \
  https://github.com/garystafford/election-service.git

git clone \
  --branch kub-aks --single-branch --depth 1 --no-tags \
  https://github.com/garystafford/voter-service.git

git clone \
  --branch kub-aks --single-branch --depth 1 --no-tags \
  https://github.com/garystafford/voter-client.git

Azure Service Bus

To demonstrate the capabilities of Azure’s Service Bus, the Voter API’s Spring microservice’s source code has been re-written to work with Azure Service Bus instead of RabbitMQ. A future post will explore the microservice’s messaging code. It is more likely that a large application, written specifically for a technology that is easily portable such as RabbitMQ or MongoDB, would likely remain on that technology, even if the application was lifted and shifted to the Cloud or moved between Cloud Service Providers (CSPs). Something important to keep in mind when choosing modern technologies – portability.

Service Bus is Azure’s reliable cloud Messaging-as-a-Service (MaaS). Service Bus is an original Azure resource offering, available for several years. The core components of the Service Bus messaging infrastructure are queues, topics, and subscriptions. According to Microsoft, ‘the primary difference is that topics support publish/subscribe capabilities that can be used for sophisticated content-based routing and delivery logic, including sending to multiple recipients.

Since the three Voter API’s microservices are not required to produce messages for more than one other service consumer, Service Bus queues are sufficient, as opposed to a pub/sub model using Service Bus topics.

Cosmos DB

Cosmos DB, Microsoft’s globally distributed, multi-model database, offers throughput, latency, availability, and consistency guarantees with comprehensive service level agreements (SLAs). Ideal for the Voter API, Cosmos DB supports MongoDB’s data models through the MongoDB API, a MongoDB database service built on top of Cosmos DB. The MongoDB API is compatible with existing MongoDB libraries, drivers, tools, and applications. Therefore, there are no code changes required to convert the Voter API from MongoDB to Cosmos DB. I simply had to change the database connection string.

NGINX Ingress Controller

Although the Voter API’s HAProxy-based API Gateway could be deployed to AKS, it is not optimal for Kubernetes. Instead, the Voter API will use an NGINX-based Ingress Controller. NGINX will serve as an API Gateway, as HAProxy did, previously.

According to NGINX, ‘an Ingress is a Kubernetes resource that lets you configure an HTTP load balancer for your Kubernetes services. Such a load balancer usually exposes your services to clients outside of your Kubernetes cluster.

An Ingress resource requires an Ingress Controller to function. Continuing from NGINX, ‘an Ingress Controller is an application that monitors Ingress resources via the Kubernetes API and updates the configuration of a load balancer in case of any changes. Different load balancers require different Ingress controller implementations. In the case of software load balancers, such as NGINX, an Ingress controller is deployed in a pod along with a load balancer.

There are currently two NGINX-based Ingress Controllers available, one from Kubernetes and one directly from NGINX. Both being equal, for this post, I chose the Kubernetes version, without RBAC (Kubernetes offers a version with and without RBAC). RBAC should always be used for actual cluster security. There are several advantages of using either version of the NGINX Ingress Controller for Kubernetes, including Layer 4 TCP and UDP and Layer 7 HTTP load balancing, reverse proxying, ease of SSL termination, dynamically-configurable path-based rules, and support for multiple hostnames.

Azure Web App

Lastly, the Voter Client application, not really part of the Voter API, but useful for demonstration purposes, will be converted from a containerized application to an Azure Web App. Since it is not part of the Voter API, separating the Client application from AKS makes better architectural sense. Web Apps are a powerful, richly-featured, yet incredibly simple way to host applications and services on Azure. For more information on using Azure Web Apps, read my recent post, Developing Applications for the Cloud with Azure App Services and MongoDB Atlas.

Revised Component Architecture

Below is a simplified component diagram of the new architecture, including Azure Service Bus, Cosmos DB, and the NGINX Ingress Controller. The new architecture looks similar to the previous architecture, but as you will see, it is actually very different.

AKS-r8.png

Process Flow

To understand the role of each API component, let’s look at one of the event-driven, decoupled process flows, the creation of a new election candidate. In the simplified flow diagram below, an API consumer executes an HTTP POST request containing the new candidate object as JSON. The Candidate microservice receives the HTTP request and creates a new document in the Cosmos DB Voter database. A Spring RepositoryEventHandler within the Candidate microservice responds to the document creation and publishes a Create Candidate event message, containing the new candidate object as JSON, to the Azure Service Bus Candidate Queue.

Candidate_Producer

Independently, the Voter microservice is listening to the Candidate Queue. Whenever a new message is produced by the Candidate microservice, the Voter microservice retrieves the message off the queue. The Voter microservice then transforms the new candidate object contained in the incoming message to its own candidate data model and creates a new document in its own Voter database.

Voter_Consumer

The same process flows exist between the Election and the Candidate microservices. The Candidate microservice maintains current elections in its database, which are retrieved from the Election queue.

Data Models

It is useful to understand, the Candidate microservice’s candidate domain model is not necessarily identical to the Voter microservice’s candidate domain model. Each microservice may choose to maintain its own representation of a vote, a candidate, and an election. The Voter service transforms the new candidate object in the incoming message based on its own needs. In this case, the Voter microservice is only interested in a subset of the total fields in the Candidate microservice’s model. This is the beauty of decoupling microservices, their domain models, and their datastores.

Other Events

The versions of the Voter API microservices used for this post only support Election Created events and Candidate Created events. They do not handle Delete or Update events, which would be necessary to be fully functional. For example, if a candidate withdraws from an election, the Voter service would need to be notified so no one places votes for that candidate. This would normally happen through a Candidate Delete or Candidate Update event.

Provisioning Azure Service Bus

First, the Azure Service Bus is provisioned. Provisioning the Service Bus may be accomplished using several different methods, including manually using the Azure Portal or programmatically using Azure Resource Manager (ARM) with PowerShell or Terraform. I chose to provision the Azure Service Bus and the two queues using the Azure Portal for expediency. I chose the Basic Service Bus Tier of service, of which there are three tiers, Basic, Standard, and Premium.

Azure_006_Full_ServiceBus

The application requires two queues, the candidate.queue, and the election.queue.

Azure_008_ServiceBus

Provisioning Cosmos DB

Next, Cosmos DB is provisioned. Like Azure Service Bus, Cosmos DB may be provisioned using several methods, including manually using the Azure Portal, programmatically using Azure Resource Manager (ARM) with PowerShell or Terraform, or using the Azure CLI, which was my choice.

az cosmosdb create \
  --name cosmosdb_instance_name_goes_here \
  --resource-group resource_group_name_goes_here \
  --location "East US=0" \
  --kind MongoDB

The post’s Cosmos DB instance exists within the single East US Region, with no failover. In a real Production environment, you would configure Cosmos DB with multi-region failover. I chose MongoDB as the type of Cosmos DB database account to create. The allowed values are GlobalDocumentDB, MongoDB, Parse. All other settings were left to the default values.

The three Spring microservices each have their own database. You do not have to create the databases in advance of consuming the Voter API. The databases and the database collections will be automatically created when new documents are first inserted by the microservices. Below, the three databases and their collections have been created and populated with documents.

Azure_001_CosmosDB

The GitHub project repository also contains three shell scripts to generate sample vote, candidate, and election documents. The scripts will delete any previous documents from the database collections and generate new sets of sample documents. To use, you will have to update the scripts with your own Voter API URL.

AKS_012_AddVotes2

MongoDB Aggregation Pipeline

Each of the three Spring microservices uses Spring Data MongoDB, which takes advantage of MongoDB’s Aggregation Framework. According to MongoDB, ‘the aggregation framework is modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into an aggregated result.’ Below is an example of aggregation from the Candidate microservice’s VoterContoller class.

Aggregation aggregation = Aggregation.newAggregation(
    match(Criteria.where("election").is(election)),
    group("candidate").count().as("votes"),
    project("votes").and("candidate").previousOperation(),
    sort(Sort.Direction.DESC, "votes")
);

To use MongoDB’s aggregation framework with Cosmos DB, it is currently necessary to activate the MongoDB Aggregation Pipeline Preview Feature of Cosmos DB. The feature can be activated from the Azure Portal, as shown below.

Azure_002_CosmosDB

Cosmos DB Emulator

Be warned, Cosmos DB can be very expensive, even without database traffic or any Production-grade bells and whistles. Be careful when spinning up instances on Azure for learning purposes, the cost adds up quickly! In less than ten days, while writing this post, my cost was almost US$100 for the Voter API’s Cosmos DB instance.

I strongly recommend downloading the free Azure Cosmos DB Emulator to develop and test applications from your local machine. Although certainly not as convenient, it will save you the cost of developing for Cosmos DB directly on Azure.

With Cosmos DB, you pay for reserved throughput provisioned and data stored in containers (a collection of documents or a table or a graph). Yes, that’s right, Azure charges you per MongoDB collection, not even per database. Azure Cosmos DB’s current pricing model seems less than ideal for microservice architectures, each with their own database instance.

By default the reserved throughput, billed as Request Units (RU) per second or RU/s, is set to 1,000 RU/s per collection. For development and testing, you can reduce each collection to a minimum of 400 RU/s. The Voter API creates five collections at 1,000 RU/s or 5,000 RU/s total. Reducing this to a total of 2,000 RU/s makes Cosmos DB marginally more affordable to explore.

Azure_003_CosmosDB.PNG

Building the AKS Cluster

An existing Azure Resource Group is required for AKS. I chose to use the latest available version of Kubernetes, 1.8.2.

# login to azure
az login \
  --username your_username  \
  --password your_password

# create resource group
az group create \
  --resource-group resource_group_name_goes_here \
  --location eastus

# create aks cluster
az aks create \
  --name cluser_name_goes_here \
  --resource-group resource_group_name_goes_here \
  --ssh-key-value path_to_your_public_key \
  --kubernetes-version 1.8.2

# get credentials to access aks cluster
az aks get-credentials \
  --name cluser_name_goes_here \
  --resource-group resource_group_name_goes_here

# display cluster's worker nodes
kubectl get nodes --output=wide

By default, AKS will provision a three-node Kubernetes cluster using Azure’s Standard D1 v2 Virtual Machines. According to Microsoft, ‘D series VMs are general purpose VM sizes provide balanced CPU-to-memory ratio. They are ideal for testing and development, small to medium databases, and low to medium traffic web servers.’ Azure D1 v2 VM’s are based on Linux OS images, currently Debian 8 (Jessie), with 1 vCPU and 3.5 GB of memory. By default with AKS, each VM receives 30 GiB of Standard HDD attached storage.

AKS_002_CreateCluster
You should always select the type and quantity of the cluster’s VMs and their attached storage, optimized for estimated traffic volumes and the specific workloads you are running. This can be done using the --node-count--node-vm-size, and --node-osdisk-size arguments with the az aks create command.

Deployment

The Voter API resources are deployed to its own Kubernetes Namespace, voter-api. The NGINX Ingress Controller resources are deployed to a different namespace, ingress-nginx. Separate namespaces help organize individual Kubernetes resources and separate different concerns.

Voter API

First, the voter-api namespace is created. Then, five required Kubernetes Secrets are created within the namespace. These secrets all contain sensitive information, such as passwords, that should not be shared. There is one secret for each of the three Cosmos DB database connection strings, one secret for the Azure Service Bus connection string, and one secret for the Let’s Encrypt SSL/TLS certificate and private key, used for secure HTTPS access to the Voter API.

AKS_004_CreateVoterAPI

Secrets

The Voter API’s secrets are used to populate environment variables within the pod’s containers. The environment variables are then available for use within the containers. Below is a snippet of the Voter pods resource file showing how the Cosmos DB and Service Bus connection strings secrets are used to populate environment variables.

env:
  - name: AZURE_SERVICE_BUS_CONNECTION_STRING
    valueFrom:
      secretKeyRef:
        name: azure-service-bus
        key: connection-string
  - name: SPRING_DATA_MONGODB_URI
    valueFrom:
      secretKeyRef:
        name: azure-cosmosdb-voter
        key: connection-string

Shown below, the Cosmos DB and Service Bus connection strings secrets have been injected into the Voter container and made available as environment variables to the microservice’s executable JAR file on start-up. As environment variables, the secrets are visible in plain text. Access to containers should be tightly controlled through Kubernetes RBAC and Azure AD, to ensure sensitive information, such as secrets, remain confidential.

AKS_014_EnvVars

Next, the three Kubernetes ReplicaSet resources, corresponding to the three Spring microservices, are created using Deployment controllers. According to Kubernetes, a Deployment that configures a ReplicaSet is now the recommended way to set up replication. The Deployments specify three replicas of each of the three Spring Services, resulting in a total of nine Kubernetes Pods.

Each pod, by default, will be scheduled on a different node if possible. According to Kubernetes, ‘the scheduler will automatically do a reasonable placement (e.g. spread your pods across nodes, not place the pod on a node with insufficient free resources, etc.).’ Note below how each of the three microservice’s three replicas has been scheduled on a different node in the three-node AKS cluster.

AKS_005B_CreateVoterPods

Next, the three corresponding Kubernetes ClusterIP-type Services are created. And lastly, the Kubernetes Ingress is created. According to Kubernetes, the Ingress resource is an API object that manages external access to the services in a cluster, typically HTTP. Ingress provides load balancing, SSL termination, and name-based virtual hosting.

The Ingress configuration contains the routing rules used with the NGINX Ingress Controller. Shown below are the routing rules for each of the three microservices within the Voter API. Incoming API requests are routed to the appropriate pod and service port by NGINX.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: voter-ingress
  namespace: voter-api
  annotations:
    ingress.kubernetes.io/ssl-redirect: "true"
spec:
  tls:
  - hosts:
    - api.voter-demo.com
    secretName: api-voter-demo-secret
  rules:
  - http:
      paths:
      - path: /candidate
        backend:
          serviceName: candidate
          servicePort: 8080
      - path: /election
        backend:
          serviceName: election
          servicePort: 8080
      - path: /voter
        backend:
          serviceName: voter
          servicePort: 8080

The screengrab below shows all of the Voter API resources created on AKS.

AKS_005B_CreateVoterAPI

NGINX Ingress Controller

After completing the deployment of the Voter API, the NGINX Ingress Controller is created. It starts with creating the ingress-nginx namespace. Next, the NGINX Ingress Controller is created, consisting of the NGINX Ingress Controller, three Kubernetes ConfigMap resources, and a default back-end application. The Controller and backend each have their own Service resources. Like the Voter API, each has three replicas, for a total of six pods. Together, the Ingress resource and NGINX Ingress Controller manage traffic to the Spring microservices.

The screengrab below shows all of the NGINX Ingress Controller resources created on AKS.

AKS_015_NGINX_Resources.PNG

The  NGINX Ingress Controller Service, shown above, has an external public IP address associated with itself.  This is because that Service is of the type, Load Balancer. External requests to the Voter API will be routed through the NGINX Ingress Controller, on this IP address.

kind: Service
apiVersion: v1
metadata:
  name: ingress-nginx
  namespace: ingress-nginx
  labels:
    app: ingress-nginx
spec:
  externalTrafficPolicy: Local
  type: LoadBalancer
  selector:
    app: ingress-nginx
  ports:
  - name: http
    port: 80
    targetPort: http
  - name: https
    port: 443
    targetPort: https

If you are only using HTTPS, not HTTP, then the references to HTTP and port 80 in the Ingress configuration are unnecessary. The NGINX Ingress Controller’s resources are explained in detail in the GitHub documentation, along with further configuration instructions.

DNS

To provide convenient access to the Voter API and the Voter Client, my domain, voter-demo.com, is associated with the public IP address associated with the Voter API Ingress Controller and with the public IP address associated with the Voter Client Azure Web App. DNS configuration is done through Azure’s DNS Zone resource.

AKS_008_FinalDNS

The two TXT type records might not look as familiar as the SOA, NS, and A type records. The TXT records are required to associate the domain entries with the Voter Client Azure Web App. Browsing to http://www.voter-demo.com or simply http://voter-demo.com brings up the Voter Client.

VoterClientAngular5_small.png

The Client sends and receives data via the Voter API, available securely at https://api.voter-demo.com.

Routing API Requests

With the Pods, Services, Ingress, and NGINX Ingress Controller created and configured, as well as the Azure Layer 4 Load Balancer and DNS Zone, HTTP requests from API consumers are properly and securely routed to the appropriate microservices. In the example below, three back-to-back requests are made to the voter/info API endpoint. HTTP requests are properly routed to one of the three Voter pod replicas using the default round-robin algorithm, as proven by the observing the different hostnames (pod names) and the IP addresses (private pod IPs) in each HTTP response.

AKS_013B_LoadBalancingReplicaSets.PNG

Final Architecture

Shown below is the final Voter API Azure architecture. To simplify the diagram, I have deliberately left out the three microservice’s ClusterIP-type Services, the three default back-end application pods, and the default back-end application’s ClusterIP-type Service. All resources shown below are within the single East US Azure region, except DNS, which is a global resource.

kub-aks-v10-small.png

Shown below is the new Azure Resource Group created by Azure during the AKS provisioning process. The Resource Group contains the various resources required to support the AKS cluster, NGINX Ingress Controller, and the Voter API. Necessary Azure resources were automatically provisioned when I provisioned AKS and when I created the new Voter API and NGINX resources.

AKS_006_FinalRG

In addition to the Resource Group above, the original Resource Group contains the AKS Container Service resource itself, Service Bus, Cosmos DB, and the DNS Zone resource.

AKS_007_FinalRG

The Voter Client Web App, consisting of the Azure App Service and App Service plan resource, is located in a third, separate Resource Group, not shown here.

Cleaning Up AKS

A nice feature of AKS, running a az aks delete command will delete all the Azure resources created as part of provisioning AKS, the API, and the Ingress Controller. You will have to delete the Cosmos DB, Service Bus, and DNS Zone resources, separately.

az aks delete \
  --name cluser_name_goes_here \
  --resource-group resource_group_name_goes_here

Conclusion

Taking advantage of Kubernetes with AKS, and the array of Azure’s enterprise-grade resources, the Voter API was shifted from a simple Docker architecture to a production-ready solution. The Voter API is now easier to manage, horizontally scalable, fault-tolerant, and marginally more secure. It is capable of reliably supporting dozens more microservices, with multiple replicas. The Voter API will handle a high volume of data transactions and event messages.

There is much more that needs to be done to productionalize the Voter API on AKS, including:

  • Add multi-region failover of Cosmos DB
  • Upgrade to Service Bus Standard or Premium Tier
  • Optimized Azure VMs and storage for anticipated traffic volumes and application-specific workloads
  • Implement Kubernetes RBAC
  • Add Monitoring, logging, and alerting with Envoy or similar
  • Secure end-to-end TLS communications with Itsio or similar
  • Secure the API with OAuth and Azure AD
  • Automate everything with DevOps – AKS provisioning, testing code, creating resources, updating microservices, and managing data

All opinions in this post are my own, and not necessarily the views of my current or past employers or their clients.

, , , , , , , , , , , , , ,

3 Comments