Archive for category Enterprise Software Development

Deploying Spring Boot Apps to AWS with Netflix Nebula and Spinnaker: Part 1 of 2

Listening to DevOps industry pundits, you might be convinced everyone is running containers in Production (or by now, serverless). Although containerization is growing at a phenomenal rate, several recent surveys¹ indicate less than 50% of enterprises are deploying containers in Production. Filter those results further with the fact, of those enterprises, only a small percentage of their total application portfolios are containerized, let alone in Production.

As a DevOps Consultant, I regularly work with corporations whose global portfolios are in the thousands of applications. Indeed, some percentage of their applications are containerized, with less running in Production. However, a majority of those applications, even those built on modern, light-weight, distributed architectures, are still being deployed to bare-metal and virtualized public cloud and private data center infrastructure, for a variety of reasons.

Enterprise Deployment

Due to the scale and complexity of application portfolios, many organizations have invested in enterprise deployment tools, either commercially available or developed in-house. The enterprise deployment tool’s primary objective is to standardize the process of securely, reliably, and repeatably packaging, publishing, and deploying both containerized and non-containerized applications to large fleets of virtual machines and bare-metal servers, across multiple, geographically dispersed data centers and cloud providers. Enterprise deployment tools are particularly common in tightly regulated and compliance-driven organizations, as well as organizations that have undertaken large amounts of M&A, resulting in vastly different application technology stacks.

Enterprise CI/CD/Release Workflow

Better-known examples of commercially available enterprise deployment tools include IBM UrbanCode Deploy (aka uDeploy), XebiaLabs XL Deploy, CA Automic Release Automation, Octopus Deploy, and Electric Cloud ElectricFlow. While commercial tools continue to gain market share³, many organizations are tightly coupled to their in-house solutions through years of use and fear of widespread process disruption, given current economic, security, compliance, and skills-gap sensitivities.

Deployment Tool Anatomy

Most Enterprise deployment tools are compatible with standard binary package types, including Debian (.deb) and Red Hat  (RPM) Package Manager (.rpm) packages for Linux, NuGet (.nupkg) packages for Windows, and Node Package Manager (.npm) and Bower for JavaScript. There are equivalent package types for other popular languages and formats, such as Go, Python, Ruby, SQL, Android, Objective-C, Swift, and Docker. Packages usually contain application metadata, a signature to ensure the integrity and/or authenticity², and a compressed payload.

Enterprise deployment tools are normally integrated with open-source packaging and publishing tools, such as Apache Maven, Apache Ivy/Ant, Gradle, NPMNuGet, BundlerPIP, and Docker.

Binary packages (and images), built with enterprise deployment tools, are typically stored in private, open-source or commercial binary (artifact) repositories, such as SpacewalkJFrog Artifactory, and Sonatype Nexus Repository. The latter two, Artifactory and Nexus, support a multitude of modern package types and repository structures, including Maven, NuGet, PyPI, NPM, Bower, Ruby Gems, CocoaPods, Puppet, Chef, and Docker.

Mature binary repositories provide many features in addition to package management, including role-based access control, vulnerability scanning, rich APIs, DevOps integration, and fault-tolerant, high-availability architectures.

Lastly, enterprise deployment tools generally rely on standard package management systems to retrieve and install cryptographically verifiable packages and images. These include YUM (Yellowdog Updater, Modified), APT (aptitude), APK (Alpine Linux), NuGet, Chocolatey, NPM, PIP, Bundler, and Docker. Packages are deployed directly to running infrastructure, or indirectly to intermediate deployable components as Amazon Machine Images (AMI), Google Compute Engine machine images, VMware machines, Docker Images, or CoreOS rkt.

Open-Source Alternative

One such enterprise with an extensive portfolio of both containerized and non-containerized applications is Netflix. To standardize their deployments to multiple types of cloud infrastructure, Netflix has developed several well-known open-source software (OSS) tools, including the Nebula Gradle plugins and Spinnaker. I discussed Spinnaker in my previous post, Managing Applications Across Multiple Kubernetes Environments with Istio, as an alternative to Jenkins for deploying container workloads to Kubernetes on Google (GKE).

As a leader in OSS, Netflix has documented their deployment process in several articles and presentations, including a post from 2016, ‘How We Build Code at Netflix.’ According to the article, the high-level process for deployment to Amazon EC2 instances involves the following steps:

  • Code is built and tested locally using Nebula
  • Changes are committed to a central git repository
  • Jenkins job executes Nebula, which builds, tests, and packages the application for deployment
  • Builds are “baked” into Amazon Machine Images (using Spinnaker)
  • Spinnaker pipelines are used to deploy and promote the code change

The Nebula plugins and Spinnaker leverage many underlying, open-source technologies, including Pivotal Spring, Java, Groovy, Gradle, Maven, Apache Commons, Redline RPM, HashiCorp Packer, Redis, HashiCorp Consul, Cassandra, and Apache Thrift.

Both the Nebula plugins and Spinnaker have been battle tested in Production by Netflix, as well as by many other industry leaders after Netflix open-sourced the tools in 2014 (Nebula) and 2015 (Spinnaker). Currently, there are approximately 20 Nebula Gradle plugins available on GitHub. Notable core-contributors in the development of Spinnaker include Google, Microsoft, Pivotal, Target, Veritas, and Oracle, to name a few. A sign of its success, Spinnaker currently has over 4,600 Stars on GitHub!

Part Two: Demonstration

In Part Two, we will deploy a production-ready Spring Boot application, the Election microservice, to multiple Amazon EC2 instances, behind an Elastic Load Balancer (ELB). We will use a fully automated DevOps workflow. The build, test, package, bake, deploy process will be handled by the Netflix Nebula Gradle Linux Packaging Plugin, Jenkins, and Spinnaker. The high-level process will involve the following steps:

  • Configure Gradle to build a production-ready fully executable application for Unix systems (executable JAR)
  • Using deb-s3 and GPG Suite, create a secure, signed APT (Debian) repository on Amazon S3
  • Using Jenkins and the Netflix Nebula plugin, build a Debian package, containing the executable JAR and configuration files
  • Using Jenkins and deb-s3, publish the package to the S3-based APT repository
  • Using Spinnaker (HashiCorp Packer under the covers), bake an Ubuntu Amazon Machine Image (AMI), replete with the executable JAR installed from the Debian package
  • Deploy an auto-scaling set of Amazon EC2 instances from the baked AMI, behind an ELB, running the Spring Boot application using both the Red/Black and Highlander deployment strategies
  • Be able to repeat the entire automated build, test, package, bake, deploy process, triggered by a new code push to GitHub

The overall build, test, package, bake, deploy process will look as follows.

DebianPackageWorkflow12

References

 

All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

¹ Recent Surveys: ForresterPortworx,  Cloud Foundry Survey
² Courtesy Wikipedia – rpm
³ XebiaLabs Kicks Off 2017 with Triple-Digit Growth in Enterprise DevOps

, , , , , , , , , , , ,

1 Comment

Updating and Maintaining Gradle Project Dependencies

As a DevOps Consultant, I often encounter codebases that have not been properly kept up-to-date. Likewise, I’ve authored many open-source projects on GitHub, which I use for training, presentations, and articles. Those projects often sit dormant for months at a time, #myabandonware.

Poorly maintained and dormant projects often become brittle or break, as their dependencies and indirect dependencies continue to be updated. However, blindly updating project dependencies is often the quickest way to break, or further break an application. Ask me, I’ve given in to temptation and broken my fair share of applications as a result. Nonetheless, it is helpful to be able to quickly analyze a project’s dependencies and discover available updates. Defects, performance issues, and most importantly, security vulnerabilities, are often fixed with dependency updates.

For Node.js projects, I prefer David to discover dependency updates. I have other favorites for Ruby, .NET, and Python, including OWASP Dependency-Check, great for vulnerabilities. In a similar vein, for Gradle-based Java Spring projects, I recently discovered Ben Manes’ Gradle Versions Plugin, gradle-versions-plugin. The plugin is described as a ‘Gradle plugin to discover dependency updates’. The plugin’s GitHub project has over 1,350 stars! According to the plugin project’s README file, this plugin is similar to the Versions Maven Plugin. The project further indicates there are similar Gradle plugins available, including gradle-use-latest-versionsgradle-libraries-plugin, and gradle-update-notifier.

To try the Gradle Versions Plugin, I chose a recent Gradle-based Java Spring Boot API project. I added the plugin to the gradle.build file with a single line of code.

plugins {
  id 'com.github.ben-manes.versions' version '0.17.0'
}

By executing the single Gradle task, dependencyUpdates, the plugin generates a report detailing the status of all project’s dependencies, including plugins. The plugin includes a revision task property, which controls the resolution strategy of determining what constitutes the latest version of a dependency. The property supports three strategies: release, milestone (default), and integration (i.e. SNAPSHOT), which are detailed in the plugin project’s README file.

As expected, the plugin will properly resolve any variables. Using a variable is an efficient practice for setting the Spring Boot versions for multiple dependencies (i.e. springBootVersion).

ext {
    springBootVersion = '2.0.1.RELEASE'
}

dependencies {
    compile('com.h2database:h2:1.4.197')
    compile("io.springfox:springfox-swagger-ui:2.8.0")
    compile("io.springfox:springfox-swagger2:2.8.0")
    compile("org.liquibase:liquibase-core:3.5.5")
    compile("org.sonarsource.scanner.gradle:sonarqube-gradle-plugin:2.6.2")
    compile("org.springframework.boot:spring-boot-starter-actuator:${springBootVersion}")
    compile("org.springframework.boot:spring-boot-starter-data-jpa:${springBootVersion}")
    compile("org.springframework.boot:spring-boot-starter-data-rest:${springBootVersion}")
    compile("org.springframework.boot:spring-boot-starter-hateoas:${springBootVersion}")
    compile("org.springframework.boot:spring-boot-starter-web:${springBootVersion}")
    compileOnly('org.projectlombok:lombok:1.16.20')
    runtime("org.postgresql:postgresql:42.2.2")
    testCompile("org.springframework.boot:spring-boot-starter-test:${springBootVersion}")
}

My first run, using the default revision level, resulted in the following output. The report indicated three of my project’s dependencies were slightly out of date:

> Configure project :
Inferred project: spring-postgresql-demo, version: 4.3.0-dev.2.uncommitted+929c56e

> Task :dependencyUpdates
Failed to resolve ::apiElements
Failed to resolve ::implementation
Failed to resolve ::runtimeElements
Failed to resolve ::runtimeOnly
Failed to resolve ::testImplementation
Failed to resolve ::testRuntimeOnly

------------------------------------------------------------
: Project Dependency Updates (report to plain text file)
------------------------------------------------------------

The following dependencies are using the latest milestone version:
- com.github.ben-manes.versions:com.github.ben-manes.versions.gradle.plugin:0.17.0
- com.netflix.nebula:gradle-ospackage-plugin:4.9.0-rc.1
- com.h2database:h2:1.4.197
- io.spring.dependency-management:io.spring.dependency-management.gradle.plugin:1.0.5.RELEASE
- org.projectlombok:lombok:1.16.20
- com.netflix.nebula:nebula-release-plugin:6.3.3
- org.sonarqube:org.sonarqube.gradle.plugin:2.6.2
- org.springframework.boot:org.springframework.boot.gradle.plugin:2.0.1.RELEASE
- org.postgresql:postgresql:42.2.2
- org.sonarsource.scanner.gradle:sonarqube-gradle-plugin:2.6.2
- org.springframework.boot:spring-boot-starter-actuator:2.0.1.RELEASE
- org.springframework.boot:spring-boot-starter-data-jpa:2.0.1.RELEASE
- org.springframework.boot:spring-boot-starter-data-rest:2.0.1.RELEASE
- org.springframework.boot:spring-boot-starter-hateoas:2.0.1.RELEASE
- org.springframework.boot:spring-boot-starter-test:2.0.1.RELEASE
- org.springframework.boot:spring-boot-starter-web:2.0.1.RELEASE

The following dependencies have later milestone versions:
- org.liquibase:liquibase-core [3.5.5 -> 3.6.1]
- io.springfox:springfox-swagger-ui [2.8.0 -> 2.9.0]
- io.springfox:springfox-swagger2 [2.8.0 -> 2.9.0]

Generated report file build/dependencyUpdates/report.txt

After reading the release notes for the three available updates, and confident I had sufficient unit, smoke, and integration tests to validate any project changes, I manually updated the dependencies. Re-running the Gradle task generated the following abridged output.

------------------------------------------------------------
: Project Dependency Updates (report to plain text file)
------------------------------------------------------------

The following dependencies are using the latest milestone version:
- com.github.ben-manes.versions:com.github.ben-manes.versions.gradle.plugin:0.17.0
- com.netflix.nebula:gradle-ospackage-plugin:4.9.0-rc.1
- com.h2database:h2:1.4.197
- io.spring.dependency-management:io.spring.dependency-management.gradle.plugin:1.0.5.RELEASE
- org.liquibase:liquibase-core:3.6.1
- org.projectlombok:lombok:1.16.20
- com.netflix.nebula:nebula-release-plugin:6.3.3
- org.sonarqube:org.sonarqube.gradle.plugin:2.6.2
- org.springframework.boot:org.springframework.boot.gradle.plugin:2.0.1.RELEASE
- org.postgresql:postgresql:42.2.2
- org.sonarsource.scanner.gradle:sonarqube-gradle-plugin:2.6.2
- org.springframework.boot:spring-boot-starter-actuator:2.0.1.RELEASE
- org.springframework.boot:spring-boot-starter-data-jpa:2.0.1.RELEASE
- org.springframework.boot:spring-boot-starter-data-rest:2.0.1.RELEASE
- org.springframework.boot:spring-boot-starter-hateoas:2.0.1.RELEASE
- org.springframework.boot:spring-boot-starter-test:2.0.1.RELEASE
- org.springframework.boot:spring-boot-starter-web:2.0.1.RELEASE
- io.springfox:springfox-swagger-ui:2.9.0
- io.springfox:springfox-swagger2:2.9.0

Generated report file build/dependencyUpdates/report.txt

BUILD SUCCESSFUL in 3s
1 actionable task: 1 executed

After running a series of automated unit, smoke, and integration tests, to confirm no conflicts with the updates, I committed my changes to GitHub. The Gradle Versions Plugin is a simple and effective solution to Gradle dependency management.

All opinions expressed in this post are my own, and not necessarily the views of my current or past employers, or their clients.

Gradle logo courtesy Gradle.org, © Gradle Inc. 

, , , , , , ,

Leave a comment

Managing Applications Across Multiple Kubernetes Environments with Istio: Part 2

In this two-part post, we are exploring the creation of a GKE cluster, replete with the latest version of Istio, often referred to as IoK (Istio on Kubernetes). We will then deploy, perform integration testing, and promote an application across multiple environments within the cluster.

Part Two

In Part One of this post, we created a Kubernetes cluster on the Google Cloud Platform, installed Istio, provisioned a PostgreSQL database, and configured DNS for routing. Under the assumption that v1 of the Election microservice had already been released to Production, we deployed v1 to each of the three namespaces.

In Part Two of this post, we will learn how to utilize the advanced API testing capabilities of Postman and Newman to ensure v2 is ready for UAT and release to Production. We will deploy and perform integration testing of a new v2 of the Election microservice, locally on Kubernetes Minikube. Once confident v2 is functioning as intended, we will promote and test v2 across the dev, test, and uat namespaces.

Source Code

As a reminder, all source code for this post can be found on GitHub. The project’s README file contains a list of the Election microservice’s endpoints. To get started quickly, use one of the two following options (gist).

Code samples in this post are displayed as Gists, which may not display correctly on some mobile and social media browsers. Links to gists are also provided.

This project includes a kubernetes sub-directory, containing all the Kubernetes resource files and scripts necessary to recreate the example shown in the post.

Testing Locally with Minikube

Deploying to GKE, no matter how automated, takes time and resources, whether those resources are team members or just compute and system resources. Before deploying v2 of the Election service to the non-prod GKE cluster, we should ensure that it has been thoroughly tested locally. Local testing should include the following test criteria:

  1. Source code builds successfully
  2. All unit-tests pass
  3. A new Docker Image can be created from the build artifact
  4. The Service can be deployed to Kubernetes (Minikube)
  5. The deployed instance can connect to the database and execute the Liquibase changesets
  6. The deployed instance passes a minimal set of integration tests

Minikube gives us the ability to quickly iterate and test an application, as well as the Kubernetes and Istio resources required for its operation, before promoting to GKE. These resources include Kubernetes Namespaces, Secrets, Deployments, Services, Route Rules, and Istio Ingresses. Since Minikube is just that, a miniature version of our GKE cluster, we should be able to have a nearly one-to-one parity between the Kubernetes resources we apply locally and those applied to GKE. This post assumes you have the latest version of Minikube installed, and are familiar with its operation.

This project includes a minikube sub-directory, containing all the Kubernetes resource files and scripts necessary to recreate the Minikube deployment example shown in this post. The three included scripts are designed to be easily adapted to a CI/CD DevOps workflow. You may need to modify the scripts to match your environment’s configuration. Note this Minikube-deployed version of the Election service relies on the external Amazon RDS database instance.

Local Database Version

To eliminate the AWS costs, I have included a second, alternate version of the Minikube Kubernetes resource files, minikube_db_local This version deploys a single containerized PostgreSQL database instance to Minikube, as opposed to relying on the external Amazon RDS instance. Be aware, the database does not have persistent storage or an Istio sidecar proxy.

istio_100.png

Minikube Cluster

If you do not have a running Minikube cluster, create one with the minikube start command.

istio_081

Minikube allows you to use normal kubectl CLI commands to interact with the Minikube cluster. Using the kubectl get nodes command, we should see a single Minikube node running the latest Kubernetes v1.10.0.

istio_082

Istio on Minikube

Next, install Istio following Istio’s online installation instructions. A basic Istio installation on Minikube, without the additional add-ons, should only require a single Istio install script.

istio_083

If successful, you should observe a new istio-system namespace, containing the four main Istio components: istio-ca, istio-ingress, istio-mixer, and istio-pilot.

istio_084

Deploy v2 to Minikube

Next, create a Minikube Development environment, consisting of a dev Namespace, Istio Ingress, and Secret, using the part1-create-environment.sh script. Next, deploy v2 of the Election service to thedev Namespace, along with an associated Route Rule, using the part2-deploy-v2.sh script. One v2 instance should be sufficient to satisfy the testing requirements.

istio_085

Access to v2 of the Election service on Minikube is a bit different than with GKE. When routing external HTTP requests, there is no load balancer, no external public IP address, and no public DNS or subdomains. To access the single instance of v2 running on Minikube, we use the local IP address of the Minikube cluster, obtained with the minikube ip command. The access port required is the Node Port (nodePort) of the istio-ingress Service. The command is shown below (gist) and included in the part3-smoke-test.sh script.

The second part of our HTTP request routing is the same as with GKE, relying on an Istio Route Rules. The /v2/ sub-collection resource in the HTTP request URL is rewritten and routed to the v2 election Pod by the Route Rule. To confirm v2 of the Election service is running and addressable, curl the /v2/actuator/health endpoint. Spring Actuator’s /health endpoint is frequently used at the end of a CI/CD server’s deployment pipeline to confirm success. The Spring Boot application can take a few minutes to fully start up and be responsive to requests, depending on the speed of your local machine.

istio_093.png

Using the Kubernetes Dashboard, we should see our deployment of the single Election service Pod is running successfully in Minikube’s dev namespace.

istio_087

Once deployed, we run a battery of integration tests to confirm that the new v2 functionality is working as intended before deploying to GKE. In the next section of this post, we will explore the process creating and managing Postman Collections and Postman Environments, and how to automate those Collections of tests with Newman and Jenkins.

istio_088

Integration Testing

The typical reason an application is deployed to lower environments, prior to Production, is to perform application testing. Although definitions vary across organizations, testing commonly includes some or all of the following types: Integration Testing, Functional Testing, System Testing, Stress or Load Testing, Performance Testing, Security Testing, Usability Testing, Acceptance Testing, Regression Testing, Alpha and Beta Testing, and End-to-End Testing. Test teams may also refer to other testing forms, such as Whitebox (Glassbox), Blackbox Testing, Smoke, Validation, or Sanity Testing, and Happy Path Testing.

The site, softwaretestinghelp.com, defines integration testing as, ‘testing of all integrated modules to verify the combined functionality after integration is termed so. Modules are typically code modules, individual applications, client and server applications on a network, etc. This type of testing is especially relevant to client/server and distributed systems.

In this post, we are concerned that our integrated modules are functioning cohesively, primarily the Election service, Amazon RDS database, DNS, Istio Ingress, Route Rules, and the Istio sidecar Proxy. Unlike Unit Testing and Static Code Analysis (SCA), which is done pre-deployment, integration testing requires an application to be deployed and running in an environment.

Postman

I have chosen Postman, along with Newman, to execute a Collection of integration tests before promoting to the next environment. The integration tests confirm the deployed application’s name and version. The integration tests then perform a series of HTTP GET, POST, PUT, PATCH, and DELETE actions against the service’s resources. The integration tests verify a successful HTTP response code is returned, based on the type of request made.

istio_055

Postman tests are written in JavaScript, similar to other popular, modern testing frameworks. Postman offers advanced features such as test-chaining. Tests can be chained together through the use of environment variables to store response values and pass them onto to other tests. Values shared between tests are also stored in the Postman Environments. Below, we store the ID of the new candidate, the result of an HTTP POST to the /candidates endpoint. We then use the stored candidate ID in proceeding HTTP GET, PUT, and PATCH test requests to the same /candidates endpoint.

istio_056

Environment-specific variables, such as the resource host, port, and environment sub-collection resource, are abstracted and stored as key/value pairs within Postman Environments, and called through variables in the request URL and within the tests. Thus, the same Postman Collection of tests may be run against multiple environments using different Postman Environments.

istio_057

Postman Runner allows us to run multiple iterations of our Collection. We also have the option to build in delays between tests. Lastly, Postman Runner can load external JSON and CSV formatted test data, which is beyond the scope of this post.

istio_058

Postman contains a simple Run Summary UI for viewing test results.

istio_060

Test Automation

To support running tests from the command line, Postman provides Newman. According to Postman, Newman is a command-line collection runner for Postman. Newman offers the same functionality as Postman’s Collection Runner, all part of the newman CLI. Newman is Node.js module, installed globally as an npm package, npm install newman --global.

Typically, Development and Testing teams compose Postman Collections and define Postman Environments, locally. Teams run their tests locally in Postman, during their development cycle. Then, those same Postman Collections are executed from the command line, or more commonly as part of a CI/CD pipeline, such as with Jenkins.

Below, the same Collection of integration tests ran in the Postman Runner UI, are run from the command line, using Newman.

istio_061

Jenkins

Without a doubt, Jenkins is the leading open-source CI/CD automation server. The building, testing, publishing, and deployment of microservices to Kubernetes is relatively easy with Jenkins. Generally, you would build, unit-test, push a new Docker image, and then deploy your application to Kubernetes using a series of CI/CD pipelines. Below, we see examples of these pipelines using Jenkins Blue Ocean, starting with a continuous integration pipeline, which includes unit-testing and Static Code Analysis (SCA) with SonarQube.

istio_108

Followed by a pipeline to build the Docker Image, using the build artifact from the above pipeline, and pushes the Image to Docker Hub.

istio_109

The third pipeline that demonstrates building the three Kubernetes environments and deploying v1 of the Election service to the dev namespace. This pipeline is just for demonstration purposes; typically, you would separate these functions.

istio_110

Spinnaker

An alternative to Jenkins for the deployment of microservices is Spinnaker, created by Netflix. According to Netflix, ‘Spinnaker is an open source, multi-cloud continuous delivery platform for releasing software changes with high velocity and confidence.’ Spinnaker is designed to integrate easily with Jenkins, dividing responsibilities for continuous integration and delivery, with deployment. Below, Spinnaker two sample deployment pipelines, similar to Jenkins, for deploying v1 and v2 of the Election service to the non-prod GKE cluster.

spin_07

Below, Spinnaker has deployed v2 of the Election service to dev using a Highlander deployment strategy. Subsequently, Spinnaker has deployed v2 to test using a Red/Black deployment strategy, leaving the previously released v1 Server Group in place, in case a rollback is required.

spin_08

Once Spinnaker is has completed the deployment tasks, the Postman Collections of smoke and integration tests are executed by Newman, as part of another Jenkins CI/CD pipeline.

istio_101B.png

In this pipeline, a set of basic smoke tests is run first to ensure the new deployment is running properly, and then the integration tests are executed.

istio_102

In this simple example, we have a three-stage pipeline created from a Jenkinsfile (gist).

Test Results

Newman offers several options for displaying test results. For easy integration with Jenkins, Newman results can be delivered in a format that can be displayed as JUnit test reports. The JUnit test report format, XML, is a popular method of standardizing test results from different testing tools. Below is a truncated example of a test report file (gist).

Translating Newman test results to JUnit reports allows the percentage of test cases successfully executed, to be tracked over multiple deployments, a universal testing metric. Below we see the JUnit Test Reports Test Result Trend graph for a series of test runs.

istio_103

Deploying to Development

Development environments typically have a rapid turnover of application versions. Many teams use their Development environment as a continuous integration environment, where every commit that successfully builds and passes all unit tests, is deployed. The purpose of the CI deployments is to ensure build artifacts will successfully deploy through the CI/CD pipeline, start properly, and pass a basic set of smoke tests.

Other teams use the Development environments as an extension of their local Minikube environment. The Development environment will possess some or all of the required external integration points, which the Developer’s local Minikube environment may not. The goal of the Development environment is to help Developers ensure their application is functioning correctly and is ready for the Test teams to evaluate, prior to promotion to the Test environment.

Some external integration points, such as external payment gateways, customer relationship management (CRM) systems, content management systems (CMS), or data analytics engines, are often stubbed-out in lower environments. Generally, third-party providers only offer a limited number of parallel non-Production integration environments. While an application may pass through several non-prod environments, testing against all external integration points will only occur in one or two of those environments.

With v2 of the Election service ready for testing on GKE, we deploy it to the GKE cluster’s dev namespace using the part4a-deploy-v2-dev.sh script. We will also delete the previous v1 version of the Election service. Similar to the v1 deployment script, the v2 scripts perform a kube-inject command, which manually injects the Istio sidecar proxy alongside the Election service, into each election v2 Pod. The deployment script also deploys an alternate Istio Route Rule, which routes requests to api.dev.voter-demo.com/v2/* resource of v2 of the Election service.

istio_054.png

Once deployed, we run our Postman Collection of integration tests with Newman or as part of a CI/CD pipeline. In the Development environment, we may choose to run a limited set of tests for the sake of expediency, or because not all external integration points are accessible.

Promotion to Test

With local Minikube and Development environment testing complete, we promote and deploy v2 of the Election service to the Test environment, using the part4b-deploy-v2-test.sh script. In Test, we will not delete v1 of the Election service.

istio_062

Often, an organization will maintain a running copy of all versions of an application currently deployed to Production, in a lower environment. Let’s look at two scenarios where this is common. First, v1 of the Election service has an issue in Production, which needs to be confirmed and may require a hot-fix by the Development team. Validation of the v1 Production bug is often done in a lower environment. The second scenario for having both versions running in an environment is when v1 and v2 both need to co-exist in Production. Organizations frequently support multiple API versions. Cutting over an entire API user-base to a new API version is often completed over a series of releases, and requires careful coordination with API consumers.

Testing All Versions

An essential role of integration testing should be to confirm that both versions of the Election service are functioning correctly, while simultaneously running in the same namespace. For example, we want to verify traffic is routed correctly, based on the HTTP request URL, to the correct version. Another common test scenario is database schema changes. Suppose we make what we believe are backward-compatible database changes to v2 of the Election service. We should be able to prove, through testing, that both the old and new versions function correctly against the latest version of the database schema.

There are different automation strategies that could be employed to test multiple versions of an application without creating separate Collections and Environments. A simple solution would be to templatize the Environments file, and then programmatically change the Postman Environment’s version variable injected from a pipeline parameter (abridged environment file shown below).

istio_095.png

Once initial automated integration testing is complete, Test teams will typically execute additional forms of application testing if necessary, before signing off for UAT and Performance Testing to begin.

User-Acceptance Testing

With testing in the Test environments completed, we continue onto UAT. The term UAT suggest that a set of actual end-users (API consumers) of the Election service will perform their own testing. Frequently, UAT is only done for a short, fixed period of time, often with a specialized team of Testers. Issues experienced during UAT can be expensive and impact the ability to release an application to Production on-time if sign-off is delayed.

After deploying v2 of the Election service to UAT, and before opening it up to the UAT team, we would naturally want to repeat the same integration testing process we conducted in the previous Test environment. We must ensure that v2 is functioning as expected before our end-users begin their testing. This is where leveraging a tool like Jenkins makes automated integration testing more manageable and repeatable. One strategy would be to duplicate our existing Development and Test pipelines, and re-target the new pipeline to call v2 of the Election service in UAT.

istio_104.png

Again, in a JUnit report format, we can examine individual results through the Jenkins Console.

istio_105.png

We can also examine individual results from each test run using a specific build’s Console Output.

istio_106.png

Testing and Instrumentation

To fully evaluate the integration test results, you must look beyond just the percentage of test cases executed successfully. It makes little sense to release a new version of an application if it passes all functional tests, but significantly increases client response times, unnecessarily increases memory consumption or wastes other compute resources, or is grossly inefficient in the number of calls it makes to the database or third-party dependencies. Often times, integration testing uncovers potential performance bottlenecks that are incorporated into performance test plans.

Critical intelligence about the performance of the application can only be obtained through the use of logging and metrics collection and instrumentation. Istio provides this telemetry out-of-the-box with Zipkin, Jaeger, Service Graph, Fluentd, Prometheus, and Grafana. In the included Grafana Istio Dashboard below, we see the performance of v1 of the Election service, under test, in the Test environment. We can compare request and response payload size and timing, as well as request and response times to external integration points, such as our Amazon RDS database. We are able to observe the impact of individual test requests on the application and all its integration points.

istio_067

As part of integration testing, we should monitor the Amazon RDS CloudWatch metrics. CloudWatch allows us to evaluate critical database performance metrics, such as the number of concurrent database connections, CPU utilization, read and write IOPS, Memory consumption, and disk storage requirements.

istio_043

A discussion of metrics starts moving us toward load and performance testing against Production service-level agreements (SLAs). Using a similar approach to integration testing, with load and performance testing, we should be able to accurately estimate the sizing requirements our new application for Production. Load and Performance Testing helps answer questions like the type and size of compute resources are required for our GKE Production cluster and for our Amazon RDS database, or how many compute nodes and number of instances (Pods) are necessary to support the expected user-load.

All opinions expressed in this post are my own, and not necessarily the views of my current or past employers, or their clients.

, , , , , , , , , , , , , ,

4 Comments

Managing Applications Across Multiple Kubernetes Environments with Istio: Part 1

In the following two-part post, we will explore the creation of a GKE cluster, replete with the latest version of Istio, often referred to as IoK (Istio on Kubernetes). We will then deploy, perform integration testing, and promote an application across multiple environments within the cluster.

Application Environment Management

Container orchestration engines, such as Kubernetes, have revolutionized the deployment and management of microservice-based architectures. Combined with a Service Mesh, such as Istio, Kubernetes provides a secure, instrumented, enterprise-grade platform for modern, distributed applications.

One of many challenges with any platform, even one built on Kubernetes, is managing multiple application environments. Whether applications run on bare-metal, virtual machines, or within containers, deploying to and managing multiple application environments increases operational complexity.

As Agile software development practices continue to increase within organizations, the need for multiple, ephemeral, on-demand environments also grows. Traditional environments that were once only composed of Development, Test, and Production, have expanded in enterprises to include a dozen or more environments, to support the many stages of the modern software development lifecycle. Current application environments often include Continous Integration and Delivery (CI), Sandbox, Development, Integration Testing (QA), User Acceptance Testing (UAT), Staging, Performance, Production, Disaster Recovery (DR), and Hotfix. Each environment requiring its own compute, security, networking, configuration, and corresponding dependencies, such as databases and message queues.

Environments and Kubernetes

There are various infrastructure architectural patterns employed by Operations and DevOps teams to provide Kubernetes-based application environments to Development teams. One pattern consists of separate physical Kubernetes clusters. Separate clusters provide a high level of isolation. Isolation offers many advantages, including increased performance and security, the ability to tune each cluster’s compute resources to meet differing SLAs, and ensuring a reduced blast radius when things go terribly wrong. Conversely, separate clusters often result in increased infrastructure costs and operational overhead, and complex deployment strategies. This pattern is often seen in heavily regulated, compliance-driven organizations, where security, auditability, and separation of duties are paramount.

Kube Clusters Diagram F15

Namespaces

An alternative to separate physical Kubernetes clusters is virtual clusters. Virtual clusters are created using Kubernetes Namespaces. According to Kubernetes documentation, ‘Kubernetes supports multiple virtual clusters backed by the same physical cluster. These virtual clusters are called namespaces’.

In most enterprises, Operations and DevOps teams deliver a combination of both virtual and physical Kubernetes clusters. For example, lower environments, such as those used for Development, Test, and UAT, often reside on the same physical cluster, each in a separate virtual cluster (namespace). At the same time, environments such as Performance, Staging, Production, and DR, often require the level of isolation only achievable with physical Kubernetes clusters.

In the Cloud, physical clusters may be further isolated and secured using separate cloud accounts. For example, with AWS you might have a Non-Production AWS account and a Production AWS account, both managed by an AWS Organization.

Kube Clusters Diagram v2 F3

In a multi-environment scenario, a single physical cluster would contain multiple namespaces, into which separate versions of an application or applications are independently deployed, accessed, and tested. Below we see a simple example of a single Kubernetes non-prod cluster on the left, containing multiple versions of different microservices, deployed across three namespaces. You would likely see this type of deployment pattern as applications are deployed, tested, and promoted across lower environments, before being released to Production.

Kube Clusters Diagram v2 F5.png

Example Application

To demonstrate the promotion and testing of an application across multiple environments, we will use a simple election-themed microservice, developed for a previous post, Developing Cloud-Native Data-Centric Spring Boot Applications for Pivotal Cloud Foundry. The Spring Boot-based application allows API consumers to create, read, update, and delete, candidates, elections, and votes, through an exposed set of resources, accessed via RESTful endpoints.

Source Code

All source code for this post can be found on GitHub. The project’s README file contains a list of the Election microservice’s endpoints. To get started quickly, use one of the two following options (gist).

Code samples in this post are displayed as Gists, which may not display correctly on some mobile and social media browsers. Links to gists are also provided.

This project includes a kubernetes sub-directory, containing all the Kubernetes resource files and scripts necessary to recreate the example shown in the post. The scripts are designed to be easily adapted to a CI/CD DevOps workflow. You will need to modify the script’s variables to match your own environment’s configuration.

istio_107small

Database

The post’s Spring Boot application relies on a PostgreSQL database. In the previous post, ElephantSQL was used to host the PostgreSQL instance. This time, I have used Amazon RDS for PostgreSQL. Amazon RDS for PostgreSQL and ElephantSQL are equivalent choices. For simplicity, you might also consider a containerized version of PostgreSQL, managed as part of your Kubernetes environment.

Ideally, each environment should have a separate database instance. Separate database instances provide better isolation, fine-grained RBAC, easier test data lifecycle management, and improved performance. Although, for this post, I suggest a single, shared, minimally-sized RDS instance.

The PostgreSQL database’s sensitive connection information, including database URL, username, and password, are stored as Kubernetes Secrets, one secret for each namespace, and accessed by the Kubernetes Deployment controllers.

istio_043.png

Istio

Although not required, Istio makes the task of managing multiple virtual and physical clusters significantly easier. Following Istio’s online installation instructions, download and install Istio 0.7.1.

To create a Google Kubernetes Engine (GKE) cluster with Istio, you could use gcloud CLI’s container clusters create command, followed by installing Istio manually using Istio’s supplied Kubernetes resource files. This was the method used in the previous post, Deploying and Configuring Istio on Google Kubernetes Engine (GKE).

Alternatively, you could use Istio’s Google Cloud Platform (GCP) Deployment Manager files, along with the gcloud CLI’s deployment-manager deployments create command to create a Kubernetes cluster, replete with Istio, in a single step. Although arguably simpler, the deployment-manager method does not provide the same level of fine-grain control over cluster configuration as the container clusters create method. For this post, the deployment-manager method will suffice.

istio_001

The latest version of the Google Kubernetes Engine, available at the time of this post, is 1.9.6-gke.0. However, to install this version of Kubernetes Engine using the Istio’s supplied deployment Manager Jinja template requires updating the hardcoded value in the istio-cluster.jinja file from 1.9.2-gke.1. This has been updated in the next release of Istio.

istio_002

Another change, the latest version of Istio offered as an option in the istio-cluster-jinja.schema file. Specifically, the installIstioRelease configuration variable is only 0.6.0. The template does not include 0.7.1 as an option. Modify the istio-cluster-jinja.schema file to include the choice of 0.7.1. Optionally, I also set 0.7.1 as the default. This change should also be included in the next version of Istio.

istio_075.png

There are a limited number of GKE and Istio configuration defaults defined in the istio-cluster.yaml file, all of which can be overridden from the command line.

istio_002B.png

To optimize the cluster, and keep compute costs to a minimum, I have overridden several of the default configuration values using the properties flag with the gcloud CLI’s deployment-manager deployments create command. The README file provided by Istio explains how to use this feature. Configuration changes include the name of the cluster, the version of Istio (0.7.1), the number of nodes (2), the GCP zone (us-east1-b), and the node instance type (n1-standard-1). I also disabled automatic sidecar injection and chose not to install the Istio sample book application onto the cluster (gist).

Cluster Provisioning

To provision the GKE cluster and deploy Istio, first modify the variables in the part1-create-gke-cluster.sh file (shown above), then execute the script. The script also retrieves your cluster’s credentials, to enable command line interaction with the cluster using the kubectl CLI.

istio_002C.png

Once complete, validate the version of Istio by examining Istio’s Docker image versions, using the following command (gist).

The result should be a list of Istio 0.7.1 Docker images.

istio_076.png

The new cluster should be running GKE version 1.9.6.gke.0. This can be confirmed using the following command (gist).

Or, from the GCP Cloud Console.

istio_037

The new GKE cluster should be composed of (2) n1-standard-1 nodes, running in the us-east-1b zone.

istio_038

As part of the deployment, all of the separate Istio components should be running within the istio-system namespace.

istio_040

As part of the deployment, an external IP address and a load balancer were provisioned by GCP and associated with the Istio Ingress. GCP’s Deployment Manager should have also created the necessary firewall rules for cluster ingress and egress.

istio_010.png

Building the Environments

Next, we will create three namespaces,dev, test, and uat, which represent three non-production environments. Each environment consists of a Kubernetes Namespace, Istio Ingress, and Secret. The three environments are deployed using the part2-create-environments.sh script.

istio_048.png

Deploying Election v1

For this demonstration, we will assume v1 of the Election service has been previously promoted, tested, and released to Production. Hence, we would expect v1 to be deployed to each of the lower environments. Additionally, a new v2 of the Election service has been developed and tested locally using Minikube. It is ready for deployment to the three environments and will undergo integration testing (detailed in Part Two of the post).

If you recall from our GKE/Istio configuration, we chose manual sidecar injection of the Istio proxy. Therefore, all election deployment scripts perform a kube-inject command. To connect to our external Amazon RDS database, this kube-inject command requires the includeIPRanges flag, which contains two cluster configuration values, the cluster’s IPv4 CIDR (clusterIpv4Cidr) and the service’s IPv4 CIDR (servicesIpv4Cidr).

Before deployment, we export the includeIPRanges value as an environment variable, which will be used by the deployment scripts, using the following command, export IP_RANGES=$(sh ./get-cluster-ip-ranges.sh). The get-cluster-ip-ranges.sh script is shown below (gist).

Using this method with manual sidecar injection is discussed in the previous post, Deploying and Configuring Istio on Google Kubernetes Engine (GKE).

To deploy v1 of the Election service to all three namespaces, execute the part3-deploy-v1-all-envs.sh script.

istio_051.png

We should now have two instances of v1 of the Election service, running in the dev, test, and uat namespaces, for a total of six election-v1 Kubernetes Pods.

istio_052

HTTP Request Routing

Before deploying additional versions of the Election service in Part Two of this post, we should understand how external HTTP requests will be routed to different versions of the Election service, in multiple namespaces. In the post’s simple example, we have a matrix of three namespaces and two versions of the Election service. That means we need a method to route external traffic to up to six different election versions. There multiple ways to solve this problem, each with their own pros and cons. For this post, I found a combination of DNS and HTTP request rewriting is most effective.

DNS

First, to route external HTTP requests to the correct namespace, we will use subdomains. Using my current DNS management solution, Azure DNS, I create three new A records for my registered domain, voter-demo.com. There is one A record for each namespace, including api.dev, api.test, and api.uat.

istio_077.png

All three subdomains should resolve to the single external IP address assigned to the cluster’s load balancer.

istio_010.png

As part of the environments creation, the script deployed an Istio Ingress, one to each environment. The ingress accepts traffic based on a match to the Request URL (gist).

The istio-ingress service load balancer, running in the istio-system namespace, routes inbound external traffic, based on the Request URL, to the Istio Ingress in the appropriate namespace.

istio_053.png

The Istio Ingress in the namespace then directs the traffic to one of the Kubernetes Pods, containing the Election service and the Istio sidecar proxy.

istio_068.png

HTTP Rewrite

To direct the HTTP request to v1 or v2 of the Election service, an Istio Route Rule is used. As part of the environment creation, along with a Namespace and Ingress resources, we also deployed an Istio Route Rule to each environment. This particular route rule examines the HTTP request URL for a /v1/ or /v2/ sub-collection resource. If it finds the sub-collection resource, it performs a HTTPRewrite, removing the sub-collection resource from the HTTP request. The Route Rule then directs the HTTP request to the appropriate version of the Election service, v1 or v2 (gist).

According to Istio, ‘if there are multiple registered instances with the specified tag(s), they will be routed to based on the load balancing policy (algorithm) configured for the service (round-robin by default).’ We are using the default load balancing algorithm to distribute requests across multiple copies of each Election service.

The final external HTTP request routing for the Election service in the Non-Production GKE cluster is shown on the left, in the diagram, below. Every Election service Pod also contains an Istio sidecar proxy instance.

Kube Clusters Diagram F14

Below are some examples of HTTP GET requests that would be successfully routed to our Election service, using the above-described routing strategy (gist).

Part Two

In Part One of this post, we created the Kubernetes cluster on the Google Cloud Platform, installed Istio, provisioned a PostgreSQL database, and configured DNS for routing. Under the assumption that v1 of the Election microservice had already been released to Production, we deployed v1 to each of the three namespaces.

In Part Two of this post, we will learn how to utilize the sophisticated API testing capabilities of Postman and Newman to ensure v2 is ready for UAT and release to Production. We will deploy and perform integration testing of a new, v2 of the Election microservice, locally, on Kubernetes Minikube. Once we are confident v2 is functioning as intended, we will promote and test v2, across the dev, test, and uat namespaces.

All opinions expressed in this post are my own, and not necessarily the views of my current or past employers, or their clients.

, , , , , , , , , , ,

3 Comments

Deploying and Configuring Istio on Google Kubernetes Engine (GKE)

GKE_021B

Introduction

Unquestionably, Kubernetes has quickly become the leading Container-as-a-Service (CaaS) platform. In late September 2017, Rancher Labs announced the release of Rancher 2.0, based on Kubernetes. In mid-October, at DockerCon Europe 2017, Docker announced they were integrating Kubernetes into the Docker platform. In late October, Microsoft released the public preview of Managed Kubernetes for Azure Container Service (AKS). In November, Google officially renamed its Google Container Engine to Google Kubernetes Engine. Most recently, at AWS re:Invent 2017, Amazon announced its own manged version of Kubernetes, Amazon Elastic Container Service for Kubernetes (Amazon EKS).

The recent abundance of Kuberentes-based CaaS offerings makes deploying, scaling, and managing modern distributed applications increasingly easier. However, as Craig McLuckie, CEO of Heptio, recently stated, “…it doesn’t matter who is delivering Kubernetes, what matters is how it runs.” Making Kubernetes run better is the goal of a new generation of tools, such as Istio, EnvoyProject Calico, Helm, and Ambassador.

What is Istio?

One of those new tools and the subject of this post is Istio. Released in Alpha by Google, IBM and Lyft, in May 2017, Istio is an open platform to connect, manage, and secure microservices. Istio describes itself as, “…an easy way to create a network of deployed services with load balancing, service-to-service authentication, monitoring, and more, without requiring any changes in service code. You add Istio support to services by deploying a special sidecar proxy throughout your environment that intercepts all network communication between microservices, configured and managed using Istio’s control plane functionality.

Istio contains several components, split between the data plane and a control plane. The data plane includes the Istio Proxy (an extended version of Envoy proxy). The control plane includes the Istio Mixer, Istio Pilot, and Istio-Auth. The Istio components work together to provide behavioral insights and operational control over a microservice-based service mesh. Istio describes a service mesh as a “transparently injected layer of infrastructure between a service and the network that gives operators the controls they need while freeing developers from having to bake solutions to distributed system problems into their code.

In this post, we will deploy the latest version of Istio, v0.4.0, on Google Cloud Platform, using the latest version of Google Kubernetes Engine (GKE), 1.8.4-gke.1. Both versions were just released in mid-December, as this post is being written. Google, as you probably know, was the creator of Kubernetes, now an open-source CNCF project. Google was the first Cloud Service Provider (CSP) to offer managed Kubernetes in the Cloud, starting in 2014, with Google Container Engine (GKE), which used Kubernetes. This post will outline the installation of Istio on GKE, as well as the deployment of a sample application, integrated with Istio, to demonstrate Istio’s observability features.

Getting Started

All code from this post is available on GitHub. You will need to change some variables within the code, to meet your own project’s needs (gist).

The scripts used in this post are as follows, in order of execution (gist).

Code samples in this post are displayed as Gists, which may not display correctly on some mobile and social media browsers. Links to gists are also provided.

Creating GKE Cluster

First, we create the Google Kubernetes Engine (GKE) cluster. The GKE cluster creation is highly-configurable from either the GCP Cloud Console or from the command line, using the Google Cloud Platform gcloud CLI. The CLI will be used throughout the post. I have chosen to create a highly-available, 3-node cluster (1 node/zone) in GCP’s South Carolina us-east1 region (gist).

Once built, we need to retrieve the cluster’s credentials.

Having chosen to use Kubernetes’ Alpha Clusters feature, the following warning is displayed, warning the Alpha cluster will be deleted in 30 days (gist).

The resulting GKE cluster will have the following characteristics (gist).

Installing Istio

With the GKE cluster created, we can now deploy Istio. There are at least two options for deploying Istio on GCP. You may choose to manually install and configure Istio in a GKE cluster, as I will do in this post, following these instructions. Alternatively, you may choose to use the Istio GKE Deployment Manager. This all-in-one GCP service will create your GKE cluster, and install and configure Istio and the Istio add-ons, including their Book Info sample application.

G002_DeployCluster

There were a few reasons I chose not to use the Istio GKE Deployment Manager option. First, until very recently, you could not install the latest versions of Istio with this option (as of 12/21 you can now deploy v0.3.0 and v0.4.0). Secondly, currently, you only have the choice of GKE version 1.7.8-gke.0. I wanted to test the latest v1.8.4 release with a stable GA version of RBAC. Thirdly, at least three out of four of my initial attempts to use the Istio GKE Deployment Manager failed during provisioning for unknown reasons. Lastly, you will learn more about GKE, Kubernetes, and Istio by doing it yourself, at least the first time.

Istio Code Changes

Before installing Istio, I had to make several minor code changes to my existing Kubernetes resource files. The requirements are detailed in Istio’s Pod Spec Requirements. These changes are minor, but if missed, cause errors during deployment, which can be hard to identify and resolve.

First, you need to name your Service ports in your Service resource files. More specifically, you need to name your service ports, http, as shown in the Candidate microservice’s Service resource file, below (note line 10) (gist).

Second, an app label is required for Istio. I added an app label to each Deployment and Service resource file, as shown below in the Candidate microservice’s Deployment resource files (note lines 5 and 6) (gist).

The next set of code changes were to my existing Ingress resource file. The requirements for an Ingress resource using Istio are explained here. The first change, Istio ignores all annotations other than kubernetes.io/ingress.class: istio (note line 7, below). The next change, if using HTTPS, the secret containing your TLS/SSL certificate and private key must be called istio-ingress-certs; all other names will be ignored (note line 10, below). Related and critically important, that secret must be deployed to the istio-system namespace, not the application’s namespace. The last change, for my particular my prefix match routing rules, I needed to change the rules from /{service_name} to /{service_name}/.*. The /.* is a special Istio notation that is used to indicate a prefix match (note lines 14, 18, and 22, below) (gist).

Installing Istio

To install Istio, you first will need to download and uncompress the correct distribution of Istio for your OS. Istio provides instructions for installation on various platforms.

My install-istio.sh script contains a variable, ISTIO_HOME, which should point to the root of your local Istio directory. We will also deploy all the current Istio add-ons, including Prometheus, Grafana, ZipkinService Graph, and Zipkin-to-Stackdriver (gist).

Once installed, from the GCP Cloud Console, an alternative to the native Kubernetes Dashboard, we should see the following Istio resources deployed and running. Below, note the three nodes are distributed across three zones within the GCP us-east-1 region, the correct version of GKE is employed, Stackdriver logging and monitoring are enabled, and the Alpha Clusters features are also enabled.

GKE_001

And here, we see the nodes that comprise the GKE cluster.

GKE_001_1

GKE_001_2.PNG

Below, note the four components that comprise Istio: istio-ca, istio-ingress, istio-mixer, and istio-pilot. Additionally, note the five components that comprise the Istio add-ons.

GKE_002

Below, observe the Istio Ingress has automatically been assigned a public IP address by GCP, accessible on ports 80 and 443. This IP address is how we will communicate with applications running on our GKE cluster, behind the Istio Ingress Load Balancer. Later, we will see how the Istio Ingress Load Balancer knows how to route incoming traffic to those application endpoints, using the Voter API’s Ingress configuration.

GKE_003.PNG

Istio makes ample use of Kubernetes Config Maps and Secrets, to store configuration, and to store certificates for mutual TLS.

GKE_004

Creation of the GKE cluster and deployed Istio to the cluster is complete. Following, I will demonstrate the deployment of the Voter API to the cluster. This will be used to demonstrate the capabilities of Istio on GKE.

Kubernetes Dashboard

In addition to the GCP Cloud Console, the native Kubernetes Dashboard is also available. To open, use the kubectl proxy command and connect to the Kubernetes Dashboard at https://127.0.0.1:8001/ui. You should now be able to view and edit all resources, from within the Kubernetes Dashboard.

GKE_005_5

Sample Application

To demonstrate the functionality of Istio and GKE, I will deploy the Voter API. I have used variations of the sample Voter API application in several previous posts, including Architecting Cloud-Optimized Apps with AKS (Azure’s Managed Kubernetes), Azure Service Bus, and Cosmos DB and Eventual Consistency: Decoupling Microservices with Spring AMQP and RabbitMQ. I suggest reading these two post to better understand the Voter API’s design.

AKS

For this post, I have reconfigured the Voter API to use MongoDB’s Atlas Database-as-a-Service (DBaaS) as the NoSQL data-source for each microservice. The Voter API is connected to a MongoDB Atlas 3-node M10 instance cluster in GCP’s us-east1 (South Carolina) region. With Atlas, you have the choice of deploying clusters to GCP or AWS.

GKE_014

The Voter API will use CloudAMQP’s RabbitMQ-as-a-Service for its decoupled, eventually consistent, message-based architecture. For this post, the Voter API is configured to use a RabbitMQ cluster in GCP’s us-east1 (South Carolina) region; I chose a minimally-configured free version of RabbitMQ. CloudAMQP allows you to provide a much more robust multi-node clusters for Production, on GCP or AWS.

GKE_015_1.PNG

CloudAMQP provides access to their own Management UI, in addition to access to RabbitMQ’s Management UI.

GKE_015B

With the Voter API running and taking traffic, we can see each Voter API microservice instance, nine replicas in total, connected to RabbitMQ. They are each publishing and consuming messages off the two queues.

GKE_016

The GKE, MongoDB Atlas, and RabbitMQ clusters are all running in the same GCP Region. Optimizing the Voter API cloud architecture on GCP, within a single Region, greatly reduces network latency, increases API performance, and improves end-to-end application and infrastructure observability and traceability.

Installing the Voter API

For simplicity, I have divided the Voter API deployment into three parts. First, we create the new voter-api Kubernetes Namespace, followed by creating a series of Voter API Kuberentes Secrets (gist).

There are a total of five secrets, one secret for each of the three microservice’s MongoDB databases, one secret for the RabbitMQ connection string (shown below), and one secret containing a Let’s Encrypt SSL/TLS certificate chain and private key for the Voter API’s domain, api.voter-demo.com (shown below).

GKE_011

GKE_006.PNG

GKE_007.PNG

Next, we create the microservice pods, using the Kubernetes Deployment files, create three ClusterIP-type Kubernetes Services, and a Kubernetes Ingress. The Ingress contains the service endpoint configuration, which Istio Ingress will use to correctly route incoming external API traffic (gist).

Three Kubernetes Pods for each of the three microservice should be created, for a total of nine pods. In the GCP Cloud UI’s Workloads (Kubernetes Deployments), you should see the following three resources. Note each Workload has three pods, each containing one replica of the microservice.

GKE_010

In the GCP Cloud UI’s Discovery and Load Balancing tab, you should observe the following four resources. Note the Voter API Ingress endpoints for the three microservices, which are used by the Istio Proxy, discussed below.

GKE_009.PNG

Istio Proxy

Examining the Voter API deployment more closely, you will observe that each of the nine Voter API microservice pods have two containers running within them (gist).

Along with the microservice container, there is an Istio Proxy container, commonly referred to as a sidecar container. Istio Proxy is an extended version of the Envoy proxy, Lyfts well-known, highly performant edge and service proxy. The proxy sidecar container is injected automatically when the Voter API pods are created. This is possible because we deployed the Istio Initializer (istio-initializer.yaml). The Istio Initializer guarantees that Istio Proxy will be automatically injected into every microservice Pod. This is referred to as automatic sidecar injection. Below we see an example of one of three Candidate pods running the istio-proxy sidecar.

GKE_012

In the example above, all traffic to and from the Candidate microservice now passes through the Istio Proxy sidecar. With Istio Proxy, we gain several enterprise-grade features, including enhanced observability, service discovery and load balancing, credential injection, and connection management.

Manual Sidecar Injection

What if we have application components we do not want automatically managed with Istio Proxy. In that case, manual sidecar injection might be preferable to automatic sidecar injection with Istio Initializer. For manual sidecar injection, we execute a istioctl kube-inject command for each of the Kubernetes Deployments. The command manually injects the Istio Proxy container configuration into the Deployment resource file, alongside each Voter API microservice container. On Mac and Linux, this command is similar to the following. Proxy injection is discussed in detail, here (gist).

External Service Egress

Whether you choose automatic or manual sidecar injection of the Istio Proxy, Istio’s egress rules currently only support HTTP and HTTPS requests. The Voter API makes external calls to its backend services, using two alternate protocols, MongoDB Wire Protocol (mongodb://) and RabbitMQ AMQP (amqps://). Since we cannot use an Istio egress rule for either protocol, we will use the includeIPRanges option with the istioctl kube-inject command to open egress to the two backend services. This will completely bypass Istio for a specific IP range. You can read more about calling external services directly, on Istio’s website.

You will need to modify the includeIPRanges argument within the create-voter-api-part3.sh script, adding your own GKE cluster’s IP ranges to the IP_RANGES variable. The two IP ranges can be found using the following GCP CLI command (gist).

The create-voter-api-part3.sh script also contains a modified version the istioctl kube-inject command for each Voter API Deployment. Using the modified command, the original Deployment files are not altered, instead, a temporary copy of the Deployment file is created into which Istio injects the required modifications. The temporary Deployment file is then used for the deployment, and then immediately deleted (gist).

Some would argue not having the actual deployed version of the file checked into in source code control is an anti-pattern; in this case, I would disagree. If I need to redeploy, I would just run the istioctl kube-inject command again. You can always view, edit, and import the deployed YAML file, from the GCP CLI or GKE Management UI.

The amount of Istio configuration injected into each microservice Pod’s Deployment resource file is considerable. The Candidate Deployment file swelled from 68 lines to 276 lines of code! This hints at the power, as well as the complexity of Istio. Shown below is a snippet of the Candidate Deployment YAML, after Istio injection.

GKE_025

Confirming Voter API

Installation of the Voter API is now complete. We can validate the Voter API is working, and that traffic is being routed through Istio, using Postman. Below, we see a list of candidates successfully returned from the Voter microservice, through the Voter API. This means, not only us the API running, but that messages have been successfully passed between the services, using RabbitMQ, and saved to the microservice’s corresponding MongoDB databases.

GKE_030

Below, note the server and x-envoy-upstream-service-time response headers. They both confirm the Voter API HTTPS traffic is being managed by Istio.

GKE_031.PNG

Observability

Observability is certainly one of the primary advantages of implementing Istio. For anyone like myself, who has spent many long and often frustrating hours installing, configuring, and managing monitoring systems for distributed platforms, Istio’s observability features are most welcome. Istio provides Prometheus, Grafana, ZipkinService Graph, and Zipkin-to-Stackdriver add-ons. Combined with the monitoring capabilities of Backend-as-a-Service providers, such as MongoDB Altas and CloudAMQP RabvbitMQ, you get considerable visibility into your application, out-of-the-box.

Prometheus
First, we will look at Prometheus, a leading open-source monitoring solution. The easiest way to access the Prometheus UI, or any of the other add-ons, including Prometheus, is using port-forwarding. For example with Prometheus, we use the following command (gist).

Alternatively, you could securely expose any of the Istio add-ons through the Istio Ingress, similar to how the Voter API microservice endpoints are exposed.

Prometheus collects time series metrics from both the Istio and Voter API components. Below we see two examples of typical metrics being collected; they include the 201 responses generated by the Candidate microservice, and the outflow of bytes by the Election microservice, over a given period of time.

GKE_022

GKE_022_1

Grafana
Although Prometheus is an excellent monitoring solution, Grafana, the leading open source software for time series analytics, provides a much easier way to visualize the metrics collected by Prometheus. Conveniently, Istio provides a dynamically-configured Grafana Dashboard, which will automatically display metrics for components deployed to GKE.

GKE_020B.PNG

Below, note the metrics collected for the Candidate and Election microservice replicas. Out-of-the-box, Grafana displays common HTTP KPIs, such as request rate, success rate, response codes, response time, and response size. Based on the version label included in the Deployment resource files, we can delineate metrics collected by the version of the Voter API microservices, in this case, v1 of the Candidate and Election microservices.

GKE_021B

Zipkin
Next, we have Zipkin, a leading distributed tracing system.

GKE_018

Since the Voter API application uses RabbitMQ to decouple communications between services, versus direct HTTP-based IPC, we won’t see any complex multi-segment traces. We will only see traces representing traffic to and from the microservices, which passes through the Istio Ingress.

GKE_019

Service Graph
Similar to Zipkin, Service Graph is not as valuable with the Voter API application as it could be with more complex applications. Below is a Service Graph view of the Voter API showing microservice version and requests/second to each microservice.

GKE_024

Stackdriver

One last tool we have to monitor our GKE cluster is Stackdriver. Stackdriver provides fine-grain monitoring, logging, and diagnostics. If you recall, we enabled Stackdriver logging and monitoring when we first provisioned the GKE cluster. Stackdrive allows us to examine the performance of the GKE cluster’s resources, review logs, and set alerts.

GKE_028

GKE_029

GKE_027

Zipkin-to-Stackdriver

When we installed Istio, we also installed the Zipkin-to-Stackdriver add-on. The Stackdriver Trace Zipkin Collector is a drop-in replacement for the standard Zipkin HTTP collector that writes to Google’s free Stackdriver Trace distributed tracing service. To use Stackdriver for traces originating from Zipkin, there is additional configuration required, which is commented out of the current version of the zipkin-to-stackdriver.yaml file (gist).

Instructions to configure the Zipkin-to-Stackdriver feature can be found here. Below is an example of how you might add the necessary configuration using a Kubernetes ConfigMap to inject the required user credentials JSON file (zipkin-to-stackdriver-creds.json) into the zipkin-to-stackdriver container. The new configuration can be seen on lines 27-44 (gist).

Conclusion

Istio provides a significant amount of fine-grained management control to Kubernetes. Managed Kubernetes CaaS offerings like GKE, coupled with tools like Istio, will soon make running reliable and secure containerized applications in Production, commonplace.

References

All opinions in this post are my own, and not necessarily the views of my current or past employers or their clients.

, , , , , , , , , , ,

1 Comment

Docker Enterprise Edition: Multi-Environment, Single Control Plane Architecture for AWS

Final_DockerEE_21 (1)

Designing a successful, cloud-based containerized application platform requires a balance of performance and security with cost, reliability, and manageability. Ensuring that a platform meets all functional and non-functional requirements, while remaining within budget and is easily maintainable, can be challenging.

As Cloud Architect and DevOps Team Lead, I recently participated in the development of two architecturally similar, lightweight, cloud-based containerized application platforms. From the start, both platforms were architected to maximize security and performance, while minimizing cost and operational complexity. The later platform was built on AWS with Docker Enterprise Edition.

Docker Enterprise Edition

Released in March of this year, Docker Enterprise Edition (Docker EE) is a secure, full-featured container-based management platform. There are currently eight versions of Docker EE, available for Windows Server, Azure, AWS, and multiple Linux distros, including RHEL, CentOS, Ubuntu, SUSE, and Oracle.

Docker EE is one of several production-grade container orchestration Platforms as a Service (PaaS). Some of the other container platforms in this category include:

Docker Community Edition (CE), Kubernetes, and Apache Mesos are free and open-source. Some providers, such as Rancher Labs, offer enterprise support for an additional fee. Cloud-based services, such as Red Hat Openshift Online, AWS, GCE, and ACS, charge the typical usage monthly fee. Docker EE, similar to Mesosphere Enterprise DC/OS and Red Hat OpenShift, is priced on a per node/per year annual subscription model.

Docker EE is currently offered in three subscription tiers, including Basic, Standard, and Advanced. Additionally, Docker offers Business Day and Business Critical support. Docker EE’s Advanced Tier adds several significant features, including secure multi-tenancy with node-based isolation, and image security scanning and continuous vulnerability scanning, as part of Docker EE’s Docker Trusted Registry.

Architecting for Affordability and Maintainability

Building an enterprise-scale application platform, using public cloud infrastructure, such as AWS, and a licensed Containers-as-a-Service (CaaS) platform, such as Docker EE, can quickly become complex and costly to build and maintain. Based on current list pricing, the cost of a single Linux node ranges from USD 75 per month for basic support, up to USD 300 per month for Docker Enterprise Edition Advanced with Business Critical support. Although cost is relative to the value generated by the application platform, none the less, architects should always strive to avoid unnecessary complexity and cost.

Reoccurring operational costs, such as licensed software subscriptions, support contracts, and monthly cloud-infrastructure charges, are often overlooked by project teams during the build phase. Accurately forecasting reoccurring costs of a fully functional Production platform, under expected normal load, is essential. Teams often overlook how Docker image registries, databases, data lakes, and data warehouses, quickly swell, inflating monthly cloud-infrastructure charges to maintain the platform. The need to control cloud costs have led to the growth of third-party cloud management solutions, such as CloudCheckr Cloud Management Platform (CMP).

Shared Docker Environment Model

Most software development projects require multiple environments in which to continuously develop, test, demonstrate, stage, and release code. Creating separate environments, replete with their own Docker EE Universal Control Plane (aka Control Plane or UCP), Docker Trusted Registry (DTR), AWS infrastructure, and third-party components, would guarantee a high-level of isolation and performance. However, replicating all elements in each environment would add considerable build and run costs, as well as unnecessary complexity.

On both recent projects, we choose to create a single AWS Virtual Private Cloud (VPC), which contained all of the non-production environments required by our project teams. In parallel, we built an entirely separate Production VPC for the Production environment. I’ve seen this same pattern repeated with Red Hat OpenStack and Microsoft Azure.

Production

Isolating Production from the lower environments is essential to ensure security, and to eliminate non-production traffic from impacting the performance of Production. Corporate compliance and regulatory policies often dictate complete Production isolation. Having separate infrastructure, security appliances, role-based access controls (RBAC), configuration and secret management, and encryption keys and SSL certificates, are all required.

For complete separation of Production, different AWS accounts are frequently used. Separate AWS accounts provide separate billing, usage reporting, and AWS Identity and Access Management (IAM), amongst other advantages.

Performance and Staging

Unlike Production, there are few reasons to completely isolate lower-environments from one another. The exception I’ve encountered is Performance and Staging. These two environments are frequently separated from other environments to ensure the accuracy of performance testing and release staging activities. Performance testing, in particular, can generate enormous load on systems, which if not isolated, will impair adjacent environments, applications, and monitoring systems.

On a few recent projects, to reduce cost and complexity, we repurposed the UAT environment for performance testing, once user-acceptance testing was complete. Performance testing was conducted during off-peak development and testing periods, with access to adjacent environments blocked.

The multi-purpose UAT environment further served as a Staging environment. Applications were deployed and released to the UAT and Performance environments, following a nearly-identical process used for Production. Hotfixes to Production were also tested in this environment.

Example of Shared Environments

To demonstrate how to architect a shared non-production Docker EE environment, which minimizes cost and complexity, let’s examine the example shown below. In the example, built on AWS with Docker EE, there are four typical non-production environments, CI/CD, Development, Test, and UAT, and one Production environment.

Docker_EE_AWS_Diagram_01

In the example, there are two separate VPCs, the Production VPC, and the Non-Production VPC. There is no reason to configure VPC Peering between the two VPCs, as there is no need for direct communication between the two. Within the Non-Production VPC, to the left in the diagram, there is a cluster of three Docker EE UCP Manager EC2 nodes, a cluster of three DTR Worker EC2 nodes, and the four environments, consisting of varying numbers of EC2 Worker nodes. Production, to the right of the diagram, has its own cluster of three UCP Manager EC2 nodes and a cluster of six EC2 Worker nodes.

Single Non-Production UCP

As a primary means of reducing cost and complexity, in the example, a single minimally-sized Docker EE UCP cluster of three Manager nodes orchestrate activities across all four non-production environments. Alternately, you would have to create a UCP cluster for each environment; that means nine more Worker Nodes to configure and maintain.

The UCP users, teams, organizations, access controls, Docker Secrets, overlay networks, and other UCP features, for all non-production environments, are managed through the single Control Plane. All deployments to all the non-production environments, from the CI/CD server, are performed through the single Control Plane. Each UCP Manager node is deployed to a different AWS Availability Zone (AZ) to ensure high-availability.

Shared DTR

As another means of reducing cost and complexity, in the example, a Docker EE DTR cluster of three Worker nodes contain all Docker image repositories. Both the non-production and the Production environments use this DTR as a secure source of all Docker images. Not having to replicate image repositories, access controls, infrastructure, and figuring out how to migrate images between two separate DTR clusters, is a significant time, cost, and complexity savings. Each DTR Worker node is also deployed to a different AZ to ensure high-availability.

Using a shared DTR between non-production and Production is an important security consideration your project team needs to consider. A single DTR, shared between non-production and Production, comes with inherent availability and security risks, which should be understood in advance.

Separate Non-Production Worker Nodes

In the shared non-production environments example, each environment has dedicated AWS EC2 instances configured as Docker EE Worker nodes. The number of Worker nodes is determined by the requirements for each environment, as dictated by the project’s Development, Testing, Security, and DevOps teams. Like the UCP and DTR clusters, each Worker node, within an individual environment, is deployed to a different AZ to ensure high-availability and mimic the Production architecture.

Minimizing the number of Worker nodes in each environment, as well as the type and size of each EC2 node, offers a significant potential cost and administrative savings.

Separate Environment Ingress

In the example, the UCP, DTR, and each of the four environments are accessed through separate URLs, using AWS Hosted Zone CNAME records (subdomains). Encrypted HTTPS traffic is routed through a series of security appliances, depending on traffic type, to individual private AWS Elastic Load Balancers (ELB), one for both UCPs, the DTR, and each of the environments. Each ELB load-balances traffic to the Docker EE nodes associated the specific traffic. All firewalls, ELBs, and the UCP and DTR are secured with a high-grade wildcard SSL certificate.

AWS_ELB

Separate Data Sources

In the shared non-production environments example, there is one Amazon Relational Database Service‎ (RDS) instance in non-Production and one Production. Both RDS instances are replicated across multiple Availability Zones. Within the single shared non-production RDS instance, there are four separate databases, one per non-production environment. This architecture sacrifices the potential database performance of separate RDS instances for additional cost and complexity.

Maintaining Environment Separation

Node Labels

To obtain sufficient environment separation while using a single UCP, each Docker EE Worker node is tagged with an environment node label. The node label indicates which environment the Worker node is associated with. For example, in the screenshot below, a Worker node is assigned to the Development environment by tagging it with the key of environment and the value of dev.

Node_Label

* The Docker EE screens shown here are from UCP 2.1.5, not the recently released 2.2.x, which has an updated UI appearance.Each service’s Docker Compose file uses deployment placement constraints, which indicate where Docker should or should not deploy services. In the hello-world Docker Compose file example below, the node.labels.environment constraint is set to the ENVIRONMENT variable, which is set during container deployment by the CI/CD server. This constraint directs Docker to only deploy the hello-world service to nodes which contain the placement constraint of node.labels.environment, whose value matches the ENVIRONMENT variable value.

Deploying from CI/CD Server

The ENVIRONMENT value is set as an environment variable, which is then used by the CI/CD server, running a docker stack deploy or a docker service update command, within a deployment pipeline. Below is an example of how to use the environment variable as part of a Jenkins pipeline as code Jenkinsfile.

Centralized Logging and Metrics Collection

Centralized logging and metrics collection systems are used for application and infrastructure dashboards, monitoring, and alerting. In the shared non-production environment examples, the centralized logging and metrics collection systems are internal to each VPC, but reside on separate EC2 instances and are not registered with the Control Plane. In this way, the logging and metrics collection systems should not impact the reliability, performance, and security of the applications running within Docker EE. In the example, Worker nodes run a containerized copy of fluentd, which collects and pushes logs to ELK’s Elasticsearch.

Logging and metrics collection systems could also be supplied by external cloud-based SaaS providers, such as LogglySysdig and Datadog, or by the platform’s cloud-provider, such as Amazon CloudWatch.

With four environments running multiple containerized copies of each service, figuring out which log entry came from which service instance, requires multiple data points. As shown in the example Kibana UI below, the environment value, along with the service name and container ID, as well as the git commit hash and branch, are added to each log entry for easier troubleshooting. To include the environment, the value of the ENVIRONMENT variable is passed to Docker’s fluentd log driver as an env option. This same labeling method is used to tag metrics.

ELK

Separate Docker Service Stacks

For further environment separation within the single Control Plane, services are deployed as part of the same Docker service stack. Each service stack contains all services that comprise an application running within a single environment. Multiple stacks may be required to support multiple, distinct applications within the same environment.

For example, in the screenshot below, a hello-world service container, built with a Docker image, tagged with build 59 of the Jenkins continuous integration pipeline, is deployed as part of both the Development (dev) and Test service stacks. The CD and UAT service stacks each contain different versions of the hello-world service.

Hello-World-UCP

Separate Docker Overlay Networks

For additional environment separation within the single non-production UCP, all Docker service stacks associated with an environment, reside on the same Docker overlay network. Overlay networks manage communications among the Docker Worker nodes, enabling service-to-service communication for all services on the same overlay network while isolating services running on one network from services running on another network.

in the example screenshot below, the hello-world service, a member of the test service stack, is running on the test_default overlay network.

Network

Cleaning Up

Having distinct environment-centric Docker service stacks and overlay networks makes it easy to clean up an environment, without impacting adjacent environments. Both service stacks and overlay networks can be removed to clear an environment’s contents.

Separate Performance Environment

In the alternative example below, a Performance environment has been added to the Non-Production VPC. To ensure a higher level of isolation, the Performance environment has its own UPC, RDS, and ELBs. The Performance environment shares the DTR, as well as the security, logging, and monitoring components, with the rest of the non-production environments.

Below, the Performance environment has half the number of Worker nodes as Production. Performance results can be scaled for expected Production performance, given more nodes. Alternately, the number of nodes can be scaled up temporarily to match Production, then scaled back down to a minimum after testing is complete.

Docker_EE_AWS_Diagram_02

Shared DevOps Tooling

All environments leverage shared Development and DevOps resources, deployed to a separate VPC. Resources include Agile Application Lifecycle Management (ALM), such as JIRA or CA Agile Central, source control repository management (SCM), such as GitLab or Bitbucket, binary repository management, such as Artifactory or Nexus, and a CI/CD solution, such as Jenkins, TeamCity, or Bamboo.

From the DevOps VPC, Docker images are pushed and pulled from the DTR in the Non-Production VPC. Deployments of container-based application are executed from the DevOps VPC CI/CD server to the non-production, Performance, and Production UCPs. Separate DevOps CI/CD pipelines and access controls are essential in maintaining the separation of the non-production and Production environments.

Docker_EE_AWS_Diagram_03

Complete Platform

Several common components found in a Docker EE cloud-based AWS platform were discussed in the post. However, a complete AWS application platform has many more moving parts. Below is a comprehensive list of components, including DevOps tooling, organized into two categories: 1) common components that can be potentially shared across the non-production environments to save cost and complexity, and 2) components that should be replicated in each non-environment for security and performance.

Shared Non-Production Components:

  • AWS
    • Virtual Private Cloud (VPC), Region, Availability Zones
    • Route Tables, Network ACLs, Internet Gateways
    • Subnets
    • Some Security Groups
    • IAM Groups, User, Roles, Policies (RBAC)
    • Relational Database Service‎ (RDS)
    • ElastiCache
    • API Gateway, Lambdas
    • S3 Buckets
    • Bastion Servers, NAT Gateways
    • Route 53 Hosted Zone (Registered Domain)
    • EC2 Key Pairs
    • Hardened Linux AMI
  • Docker EE
    • UCP and EC2 Manager Nodes
    • DTR and EC2 Worker Nodes
    • UCP and DTR Users, Teams, Organizations
    • DTR Image Repositories
    • Secret Management
  • Third-Party Components/Products
    • SSL Certificates
    • Security Components: Firewalls, Virus Scanning, VPN Servers
    • Container Security
    • End-User IAM
    • Directory Service
    • Log Aggregation
    • Metric Collection
    • Monitoring, Alerting
    • Configuration and Secret Management
  • DevOps
    • CI/CD Pipelines as Code
    • Infrastructure as Code
    • Source Code Repositories
    • Binary Artifact Repositories

Isolated Non-Production Components:

  • AWS
    • Route 53 Hosted Zones and Associated Records
    • Elastic Load Balancers (ELB)
    • Elastic Compute Cloud (EC2) Worker Nodes
    • Elastic IPs
    • ELB and EC2 Security Groups
    • RDS Databases (Single RDS Instance with Separate Databases)

All opinions in this post are my own and not necessarily the views of my current employer or their clients.

, , , , , , , ,

2 Comments

Decoupling Microservices using Message-based RPC IPC, with Spring, RabbitMQ, and AMPQ

RabbitMQ_Screen_3

Introduction

There has been a considerable growth in modern, highly scalable, distributed application platforms, built around fine-grained RESTful microservices. Microservices generally use lightweight protocols to communicate with each other, such as HTTP, TCP, UDP, WebSockets, MQTT, and AMQP. Microservices commonly communicate with each other directly using REST-based HTTP, or indirectly, using messaging brokers.

There are several well-known, production-tested messaging queues, such as Apache Kafka, Apache ActiveMQAmazon Simple Queue Service (SQS), and Pivotal’s RabbitMQ. According to Pivotal, of these messaging brokers, RabbitMQ is the most widely deployed open source message broker.

RabbitMQ supports multiple messaging protocols. RabbitMQ’s primary protocol, the Advanced Message Queuing Protocol (AMQP), is an open standard wire-level protocol and semantic framework for high-performance enterprise messaging. According to Spring, ‘AMQP has exchanges, routes, and queues. Messages are first published to exchanges. Routes define on which queue(s) to pipe the message. Consumers subscribing to that queue then receive a copy of the message.

Pivotal’s Spring AMQP project applies core Spring concepts to the development of AMQP-based messaging solutions. The project’s libraries facilitate management of AMQP resources while promoting the use of dependency injection and declarative configuration. The project provides a ‘template’ (RabbitTemplate) as a high-level abstraction for sending and receiving messages.

In this post, we will explore how to start moving Spring Boot Java services away from using synchronous REST HTTP for inter-process communications (IPC), and toward message-based IPC. Moving from synchronous IPC to messaging queues and asynchronous IPC decouples services from one another, allowing us to more easily build, test, and release individual microservices.

Message-Based RPC IPC

Decoupling services using asynchronous IPC is considered optimal by many enterprise software architects when developing modern distributed platforms. However, sometimes it is not easy or possible to get away from synchronous communications. Rightly or wrongly, often times services are architected, such that one service needs to retrieve data from another service or services, in order to process its own requests. It can be said, that service has a direct dependency on the other services. Many would argue, services, especially RESTful microservices, should not be coupled in this way.

There are several ways to break direct service-to-service dependencies using asynchronous IPC. We might implement request/async response REST HTTP-based IPC. We could also use publish/subscribe or publish/async response messaging queue-based IPC. These are all described by NGINX, in their article, Building Microservices: Inter-Process Communication in a Microservices Architecture; a must-read for anyone working with microservices. We might also implement an architecture which supports eventual consistency, eliminating the need for one service to obtain data from another service.

So what if we cannot implement asynchronous methods to break direct service dependencies, but we want to move toward message-based IPC? One answer is message-based Remote Procedure Call (RPC) IPC. I realize the mention of RPC might send cold shivers down the spine of many seasoned architected. Traditional RPC has several challenges, many which have been overcome with more modern architectural patterns.

According to Wikipedia, ‘in distributed computing, a remote procedure call (RPC) is when a computer program causes a procedure (subroutine) to execute in another address space (commonly on another computer on a shared network), which is coded as if it were a normal (local) procedure call, without the programmer explicitly coding the details for the remote interaction.

Although still a form of RPC and not asynchronous, it is possible to replace REST HTTP IPC with message-based RPC IPC. Using message-based RPC, services have no direct dependencies on other services. A service only depends on a response to a message request it makes to that queue. The services are now decoupled from one another. The requestor service (the client) has no direct knowledge of the respondent service (the server).

RPC with RabbitMQ and AMQP

RabbitMQ has an excellent set of six tutorials, which cover the basics of creating messaging applications, applying different architectural patterns, using RabbitMQ, in several different programming languages. The sixth and final tutorial covers using RabbitMQ for RPC-based IPC, with the request/reply architectural pattern.

Pivotal recently added Spring AMPQ implementations to each RabbitMQ tutorial, based on their Spring AMQP project. If you recall, the Spring AMQP project applies core Spring concepts to the development of AMQP-based messaging solutions.

This post’s RPC IPC example is closely based on the architectural pattern found in the Spring AMQP RabbitMQ tutorial.

Sample Code

To demonstrate Spring AMQP-based RPC IPC messaging with RabbitMQ, we will use a pair of simple Spring Boot microservices. These services, the Voter and Candidate services, have been used in several previous posts, and for training and testing DevOps engineers. Both services are backed by MongoDB. The services and MongoDB, along with RabbitMQ, are all part of the Voter API project. The Voter API project also contains an HAProxy-based API Gateway, which provides indirect, load-balanced access to the two services.

All code necessary to build this post’s example is available on GitHub, within three projects. The Voter Service project repository contains the Voter service source code, along with the scripts and Docker Compose files required to deploy the project. The Candidate Service project repository and the Voter API Gateway project repository are also available on GitHub. For this post, you need only clone the Voter Service project repository.

Deploying Voter API

All components, including the two Spring services, MongoDB, RabbitMQ, and the API Gateway, are individually deployed using Docker. Each component is publicly available as a Docker Image, on Docker Hub.

The Voter Service repository contains scripts to deploy the entire set of Dockerized components, locally. The repository also contains optional scripts to provision a Docker Swarm, using Docker’s newer swarm mode, and deploy the components. We will only deploy the services locally for this post.

To clone and deploy the components locally, including the two Spring services, MongoDB, RabbitMQ, and the API Gateway, execute the following commands. If this is your first time running the commands, it may take a few minutes for your system to download all the required Docker Images from Docker Hub.

If everything was deployed successfully, you should see the following output. You should observe five running Docker containers.

Using Voter API

The Voter Service and Candidate Service GitHub repositories both contain README files, which detail all the API endpoints each service exposes, and how to call them.

In addition to casting votes for candidates, the Voter service has the ability to simulate election results. By calling a /simulation endpoint, and indicating the desired election, the Voter service will randomly generate a number of votes for each candidate in that election. This will save us the burden of casting votes for this demonstration. However, the Voter service has no knowledge of elections or candidates. To obtain a list of candidates, the Voter service depends on the Candidate service.

The Candidate service manages electoral candidates, their political affiliation, and the election in which they are running. Like the Voter service, the Candidate service also has a /simulation endpoint. The service will create a list of candidates based on the 2012 and 2016 US Presidential Elections. The simulation capability of the service saves us the burden of inputting candidates for this demonstration.

REST HTTP Endpoint

The Voter service exposes two almost identical endpoints. Both endpoints generate random votes. However, below the covers, the two endpoints are very different. Calling the /voter/simulation/http/{election} endpoint, prompts the Voter service to request a list of candidates from the Candidate service, based on the election parameter you input. This request is done using synchronous REST HTTP. The Voter service uses the HTTP GET method to request the data from the Candidate service. The Voter service then waits for a response.

The HTTP request is received by the Candidate service. The Candidate service responds to the Voter service with a list of candidates, in JSON format. The Voter service receives the response containing the list of candidates. The Voter service then proceeds to generate a random number of votes for each candidate. Finally, each new vote object (MongoDB document) is written back to the vote collection in the Voter service’s voters  database.

Message Queue Diagram 1D

Message-based RPC Endpoint

Similarly, calling the /voter/simulation/rpc/{election} endpoint with a specific election prompts the Voter service to request the same list of candidates. However, this time, the Voter service (the client), produces a request message and places in RabbitMQ’s voter.rpc.requests queue. The Voter service then waits for a response. The Voter service has no direct dependency on the Candidate service. It only depends on a response to its message request. In this way, it is still a form of synchronous IPC, but the Voter service is now decoupled from the Candidate service.

The request message is consumed by the Candidate service (the server), who is listening to that queue. In response, the Candidate service produces a message containing the list of candidates, serialized to JSON. The Candidate service (the server) sends a response back to the Voter service (the client), through RabbitMQ. This is done using the Direct reply-to feature of RabbitMQ or using a unique response queue, specified in the reply-to header of the request message, sent by the Voter Service.

According to RabbitMQ, ‘the direct reply-to feature allows RPC clients to receive replies directly from their RPC server, without going through a reply queue. (“Directly” here still means going through AMQP and the RabbitMQ server; there is no separate network connection between RPC client and RPC server.)

According to Spring, ‘starting with version 3.4.0, the RabbitMQ server now supports Direct reply-to; this eliminates the main reason for a fixed reply queue (to avoid the need to create a temporary queue for each request). Starting with Spring AMQP version 1.4.1 Direct reply-to will be used by default (if supported by the server) instead of creating temporary reply queues. When no replyQueue is provided (or it is set with the name amq.rabbitmq.reply-to), the RabbitTemplate will automatically detect whether Direct reply-to is supported and use it, or fall back to using a temporary reply queue. When using Direct reply-to, a reply-listener is not required and should not be configured.’ We are using the latest versions of both RabbitMQ and Spring AMQP, which should support Direct reply-to.

The Voter service receives the message containing the list of candidates. The Voter service deserializes the JSON payload to Candidate objects and proceeds to generate a random number of votes for each candidate in the list. Finally, each new vote object (MongoDB document) is written back to the vote collection in the Voter service’s voters  database.

Message Queue Diagram 2D

Exploring the RPC Code

We will not examine the REST HTTP IPC code in this post. Instead, we will explore the RPC code. You are welcome to download the source code and explore the REST HTTP code pattern; it uses some advanced features of Spring Boot and Spring Data.

Spring Dependencies

In order to use RabbitMQ, we need to add a project dependency on org.springframework.boot.spring-boot-starter-amqp. Below is a snippet from the Candidate service’s build.gradle file, showing project dependencies. The Voter service’s dependencies are identical.

AMQP Configuration

Next, we need to add a small amount of RabbitMQ AMQP configuration to both services. We accomplish this by using Spring’s @Configuration annotation on our configuration classes. Below is the configuration class for the Voter service.

And here, the configuration class for the Candidate service.

Candidate Service Code

With the dependencies and configuration in place, we define the method in the Voter service, which will request the candidates from the Candidate service, using RabbitMQ. Below is an abridged version of the Voter service’s CandidateListService class, containing the getCandidatesMessageRpc method. This method calls the rabbitTemplate.convertSendAndReceive method (see line 5, below).

Voter Service Code

Next, we define a method in the Candidate service, which will process the Voter service’s request. Below is an abridged version of the CandidateController class, containing the getCandidatesMessageRpc method. This method is decorated with Spring’s @RabbitListener annotation (see line 1, below). This annotation marks c to be the target of a Rabbit message listener on the voter.rpc.requests queue.

Also shown, are the getCandidatesMessageRpc method’s two helper methods, getByElection and serializeToJson. These methods query MongoDB for the list of candidates and serialize the list to JSON.

Demonstration

To demonstrate both the synchronous REST HTTP IPC code and the Spring AMQP-based RPC IPC code, we will make a few REST HTTP calls to the Voter API Gateway. For convenience, I have provided a shell script, demostrate_ipc.sh, which executes all the API calls necessary. I have added sleep commands to slow the output to the terminal down a bit, for easier analysis. The script requires HTTPie, a great time saver when working with RESTful services.

The demostrate_ipc.sh script does three things. First, it calls the Candidate service to generate a group of sample candidates. Next, the script calls the Voter service to simulate votes, using synchronous REST HTTP. Lastly, the script repeats the voter simulation, this time using message-based RPC IPC. All API calls are done through the Voter API Gateway on port 8080. To understand the API calls, examine the script, below.

Below is the list of candidates for the 2016 Presidential Election, generated by the Candidate service. The JSON payload was retrieved using the Voter service’s /voter/candidates/rpc/{election} endpoint. This endpoint uses the same RPC IPC method as the Voter service’s /voter/simulation/rpc/{election} endpoint.

Based on the list of candidates, below are the simulated election results. This JSON payload was retrieved using the Voter service’s /voter/results endpoint.

RabbitMQ Management Console

The easiest way to observe what is happening with our messages is using the RabbitMQ Management Console. To access the console, point your web-browser to localhost, on port 15672. The default login credentials for the console are guest/guest.

As you successfully send and receive messages between the services through RabbitMQ, you should see activity on the Overview tab. In addition, you should see a number of Connections, Channels, Exchanges, Queues, and Consumers.

RabbitMQ_Screen_3

In the Queues tab, you should find a single queue, the voter.rpc.requests queue. This queue was configured in the Candidate service’s configuration class, shown previously.

RabbitMQ_Screen_2

In the Exchanges tab, you should see one exchange, voter.rpc, which we configured in both the Voter and the Candidate service’s configuration classes (aka DirectExchange). Also, visible in the Exchanges tab, should be the routing key rpc, which we configured in the Candidate service’s configuration class (aka Binding).

The route binds the exchange to the voter.rpc.requests queue. If you recall Spring’s description, AMQP has exchanges (DirectExchange), routes (Binding), and queues (Queue). Messages are first published to exchanges. Routes define on which queue(s) to pipe the message. Consumers subscribing to that queue then receive a copy of the message.

RabbitMQ_Screen_1

In the Channels tab, you should note two connections, the single instances of the Voter and Candidate services. Likewise, there are two channels, one for each service. You can differentiate the channels by the presence of the consumer tag. The consumer tag, in this example, amq.ctag-Anv7GXs7ZWVoznO64euyjQ, uniquely identifies the consumer. In this example, the Voter service is the consumer. For a more complete explanation of the consumer tag, check out RabbitMQ’s AMQP documentation.

RabbitMQ_Screen_4.png

Message Structure

Messages cannot be viewed directly in the RabbitMQ Management Console. One way I have found to view messages is using your IDE’s debugger. Below, I have added a breakpoint on the Candidate service’ getCandidatesMessageRpc method, using IntelliJ IDEA. You can view the Voter service’s request message, as it is received by the Candidate service.

Debug_RPC_Message.png

Note the message payload, the requested election. Note the twelve message header elements. The headers include the AMQP exchange, queue, and binding. The message headers also include the consumer tag. The message also uniquely identifies the reply-to queue to use, if the server does not support Direct reply-to (see earlier explanation).

Service Logs

In addition to the RabbitMQ Management Console, we may obverse communications between the two services, by looking at the Voter and Candidate service’s logs. I have grabbed a snippet of both service’s logs and added a few comments to show where different processes are being executed. First the Voter service logs.

Next, the Candidate service logs.

Performance

What about the performance of Spring AMQP RPC IPC versus REST HTTP IPC? RabbitMQ has proven to be very performant, having been clocked at one million messages per second on GCE. I performed a series of fairly ‘unscientific’ performance tests, completing 250, 500, and then 1,000 requests. The tests were performed on a six-node Docker Swarm cluster with three instances of each service in a round-robin load-balanced configuration, and a single instance of RabbitMQ. The scripts to create the swarm cluster can be found in the Voter service GitHub project.

Based on consistent test results, the speed of the two methods was almost identical. Both methods performed between 3.1 to 3.2 responses per second. For example, the Spring AMQP RPC IPC method successfully completed 1,000 requests in 5 minutes and 11 seconds, while the REST HTTP IPC method successfully completed 1,000 requests in 5 minutes and 18 seconds, 7 seconds slower than the RPC method.

RabbitMQ on Docker Swarm

There are many variables to consider, which could dramatically impact IPC performance. For example, RabbitMQ was not clustered. Also, we did not use any type of caching, such as Varnish, Memcached, or Redis. Both these could dramatically increase IPC performance.

There are also several notable differences between the two methods from a code perspective. The REST HTTP method relies on Spring Data Projection combined with Spring Data MongoDB Repository, to obtain the candidate list from MongoDB. Somewhat differently, the RPC method makes use of Spring Data MongoDB Aggregation to return a list of candidates. Therefore, the test results should be taken with a grain of salt.

Production Considerations

The post demonstrated a simple example of RPC communications between two services using Spring AMQP. In an actual Production environment, there are a few things that must be considered, as Pivotal points out:

  • How should either service react on startup if RabbitMQ is not available? What if RabbitMQ fails after the services have started?
  • How should the Voter server (the client) react if there are no Candidate service instances (the server) running?
  • Should the Voter service have a timeout for the RPC response to return? What should happen if the request times out?
  • If the Candidate service malfunctions and raises an exception, should it be forwarded to the Voter service?
  • How does the Voter service protect against invalid incoming messages (eg checking bounds of the candidate list) before processing?
  • In all of the above scenarios, what, if any, response is returned to the API end user?

Conclusion

Although in this post we did not achieve asynchronous inter-process communications, we did achieve a higher level of service decoupling, using message-based RPC IPC. Adopting a message-based, loosely-coupled architecture, whether asynchronous or synchronous, wherever possible, will improve the overall functionality and deliverability of a microservices-based platform.

References

All opinions in this post are my own and not necessarily the views of my current employer or their clients.

, , , , , , , , ,

2 Comments