Designing a successful, cloud-based containerized application platform requires a balance of performance and security with cost, reliability, and manageability. Ensuring that a platform meets all functional and non-functional requirements, while remaining within budget and is easily maintainable, can be challenging.
As Cloud Architect and DevOps Team Lead, I recently participated in the development of two architecturally similar, lightweight, cloud-based containerized application platforms. From the start, both platforms were architected to maximize security and performance, while minimizing cost and operational complexity. The later platform was built on AWS with Docker Enterprise Edition.
Docker Enterprise Edition
Released in March of this year, Docker Enterprise Edition (Docker EE) is a secure, full-featured container-based management platform. There are currently eight versions of Docker EE, available for Windows Server, Azure, AWS, and multiple Linux distros, including RHEL, CentOS, Ubuntu, SUSE, and Oracle.
Docker EE is one of several production-grade container orchestration Platforms as a Service (PaaS). Some of the other container platforms in this category include:
- Google Container Engine (GCE, based on Google’s Kubernetes)
- AWS EC2 Container Service (ECS)
- Microsoft Azure Container Service (ACS)
- Mesosphere Enterprise DC/OS (based on Apache Mesos)
- Red Hat OpenShift (based on Kubernetes)
- Rancher Labs (based on Kubernetes, Cattle, Mesos, or Swarm)
Docker Community Edition (CE), Kubernetes, and Apache Mesos are free and open-source. Some providers, such as Rancher Labs, offer enterprise support for an additional fee. Cloud-based services, such as Red Hat Openshift Online, AWS, GCE, and ACS, charge the typical usage monthly fee. Docker EE, similar to Mesosphere Enterprise DC/OS and Red Hat OpenShift, is priced on a per node/per year annual subscription model.
Docker EE is currently offered in three subscription tiers, including Basic, Standard, and Advanced. Additionally, Docker offers Business Day and Business Critical support. Docker EE’s Advanced Tier adds several significant features, including secure multi-tenancy with node-based isolation, and image security scanning and continuous vulnerability scanning, as part of Docker EE’s Docker Trusted Registry.
Architecting for Affordability and Maintainability
Building an enterprise-scale application platform, using public cloud infrastructure, such as AWS, and a licensed Containers-as-a-Service (CaaS) platform, such as Docker EE, can quickly become complex and costly to build and maintain. Based on current list pricing, the cost of a single Linux node ranges from USD 75 per month for basic support, up to USD 300 per month for Docker Enterprise Edition Advanced with Business Critical support. Although cost is relative to the value generated by the application platform, none the less, architects should always strive to avoid unnecessary complexity and cost.
Reoccurring operational costs, such as licensed software subscriptions, support contracts, and monthly cloud-infrastructure charges, are often overlooked by project teams during the build phase. Accurately forecasting reoccurring costs of a fully functional Production platform, under expected normal load, is essential. Teams often overlook how Docker image registries, databases, data lakes, and data warehouses, quickly swell, inflating monthly cloud-infrastructure charges to maintain the platform. The need to control cloud costs have led to the growth of third-party cloud management solutions, such as CloudCheckr Cloud Management Platform (CMP).
Shared Docker Environment Model
Most software development projects require multiple environments in which to continuously develop, test, demonstrate, stage, and release code. Creating separate environments, replete with their own Docker EE Universal Control Plane (aka Control Plane or UCP), Docker Trusted Registry (DTR), AWS infrastructure, and third-party components, would guarantee a high-level of isolation and performance. However, replicating all elements in each environment would add considerable build and run costs, as well as unnecessary complexity.
On both recent projects, we choose to create a single AWS Virtual Private Cloud (VPC), which contained all of the non-production environments required by our project teams. In parallel, we built an entirely separate Production VPC for the Production environment. I’ve seen this same pattern repeated with Red Hat OpenStack and Microsoft Azure.
Isolating Production from the lower environments is essential to ensure security, and to eliminate non-production traffic from impacting the performance of Production. Corporate compliance and regulatory policies often dictate complete Production isolation. Having separate infrastructure, security appliances, role-based access controls (RBAC), configuration and secret management, and encryption keys and SSL certificates, are all required.
For complete separation of Production, different AWS accounts are frequently used. Separate AWS accounts provide separate billing, usage reporting, and AWS Identity and Access Management (IAM), amongst other advantages.
Performance and Staging
Unlike Production, there are few reasons to completely isolate lower-environments from one another. The exception I’ve encountered is Performance and Staging. These two environments are frequently separated from other environments to ensure the accuracy of performance testing and release staging activities. Performance testing, in particular, can generate enormous load on systems, which if not isolated, will impair adjacent environments, applications, and monitoring systems.
On a few recent projects, to reduce cost and complexity, we repurposed the UAT environment for performance testing, once user-acceptance testing was complete. Performance testing was conducted during off-peak development and testing periods, with access to adjacent environments blocked.
The multi-purpose UAT environment further served as a Staging environment. Applications were deployed and released to the UAT and Performance environments, following a nearly-identical process used for Production. Hotfixes to Production were also tested in this environment.
Example of Shared Environments
To demonstrate how to architect a shared non-production Docker EE environment, which minimizes cost and complexity, let’s examine the example shown below. In the example, built on AWS with Docker EE, there are four typical non-production environments, CI/CD, Development, Test, and UAT, and one Production environment.
In the example, there are two separate VPCs, the Production VPC, and the Non-Production VPC. There is no reason to configure VPC Peering between the two VPCs, as there is no need for direct communication between the two. Within the Non-Production VPC, to the left in the diagram, there is a cluster of three Docker EE UCP Manager EC2 nodes, a cluster of three DTR Worker EC2 nodes, and the four environments, consisting of varying numbers of EC2 Worker nodes. Production, to the right of the diagram, has its own cluster of three UCP Manager EC2 nodes and a cluster of six EC2 Worker nodes.
Single Non-Production UCP
As a primary means of reducing cost and complexity, in the example, a single minimally-sized Docker EE UCP cluster of three Manager nodes orchestrate activities across all four non-production environments. Alternately, you would have to create a UCP cluster for each environment; that means nine more Worker Nodes to configure and maintain.
The UCP users, teams, organizations, access controls, Docker Secrets, overlay networks, and other UCP features, for all non-production environments, are managed through the single Control Plane. All deployments to all the non-production environments, from the CI/CD server, are performed through the single Control Plane. Each UCP Manager node is deployed to a different AWS Availability Zone (AZ) to ensure high-availability.
As another means of reducing cost and complexity, in the example, a Docker EE DTR cluster of three Worker nodes contain all Docker image repositories. Both the non-production and the Production environments use this DTR as a secure source of all Docker images. Not having to replicate image repositories, access controls, infrastructure, and figuring out how to migrate images between two separate DTR clusters, is a significant time, cost, and complexity savings. Each DTR Worker node is also deployed to a different AZ to ensure high-availability.
Using a shared DTR between non-production and Production is an important security consideration your project team needs to consider. A single DTR, shared between non-production and Production, comes with inherent availability and security risks, which should be understood in advance.
Separate Non-Production Worker Nodes
In the shared non-production environments example, each environment has dedicated AWS EC2 instances configured as Docker EE Worker nodes. The number of Worker nodes is determined by the requirements for each environment, as dictated by the project’s Development, Testing, Security, and DevOps teams. Like the UCP and DTR clusters, each Worker node, within an individual environment, is deployed to a different AZ to ensure high-availability and mimic the Production architecture.
Minimizing the number of Worker nodes in each environment, as well as the type and size of each EC2 node, offers a significant potential cost and administrative savings.
Separate Environment Ingress
In the example, the UCP, DTR, and each of the four environments are accessed through separate URLs, using AWS Hosted Zone CNAME records (subdomains). Encrypted HTTPS traffic is routed through a series of security appliances, depending on traffic type, to individual private AWS Elastic Load Balancers (ELB), one for both UCPs, the DTR, and each of the environments. Each ELB load-balances traffic to the Docker EE nodes associated the specific traffic. All firewalls, ELBs, and the UCP and DTR are secured with a high-grade wildcard SSL certificate.
Separate Data Sources
In the shared non-production environments example, there is one Amazon Relational Database Service (RDS) instance in non-Production and one Production. Both RDS instances are replicated across multiple Availability Zones. Within the single shared non-production RDS instance, there are four separate databases, one per non-production environment. This architecture sacrifices the potential database performance of separate RDS instances for additional cost and complexity.
Maintaining Environment Separation
To obtain sufficient environment separation while using a single UCP, each Docker EE Worker node is tagged with an
environment node label. The node label indicates which environment the Worker node is associated with. For example, in the screenshot below, a Worker node is assigned to the Development environment by tagging it with the key of
environment and the value of
* The Docker EE screens shown here are from UCP 2.1.5, not the recently released 2.2.x, which has an updated UI appearance.Each service’s Docker Compose file uses deployment placement constraints, which indicate where Docker should or should not deploy services. In the hello-world Docker Compose file example below, the
node.labels.environment constraint is set to the
ENVIRONMENT variable, which is set during container deployment by the CI/CD server. This constraint directs Docker to only deploy the hello-world service to nodes which contain the placement constraint of
node.labels.environment, whose value matches the
ENVIRONMENT variable value.
Deploying from CI/CD Server
ENVIRONMENT value is set as an environment variable, which is then used by the CI/CD server, running a
docker stack deploy or a
docker service update command, within a deployment pipeline. Below is an example of how to use the environment variable as part of a Jenkins pipeline as code Jenkinsfile.
Centralized Logging and Metrics Collection
Centralized logging and metrics collection systems are used for application and infrastructure dashboards, monitoring, and alerting. In the shared non-production environment examples, the centralized logging and metrics collection systems are internal to each VPC, but reside on separate EC2 instances and are not registered with the Control Plane. In this way, the logging and metrics collection systems should not impact the reliability, performance, and security of the applications running within Docker EE. In the example, Worker nodes run a containerized copy of fluentd, which collects and pushes logs to ELK’s Elasticsearch.
Logging and metrics collection systems could also be supplied by external cloud-based SaaS providers, such as Loggly, Sysdig and Datadog, or by the platform’s cloud-provider, such as Amazon CloudWatch.
With four environments running multiple containerized copies of each service, figuring out which log entry came from which service instance, requires multiple data points. As shown in the example Kibana UI below, the environment value, along with the service name and container ID, as well as the git commit hash and branch, are added to each log entry for easier troubleshooting. To include the environment, the value of the
ENVIRONMENT variable is passed to Docker’s fluentd log driver as an
env option. This same labeling method is used to tag metrics.
Separate Docker Service Stacks
For further environment separation within the single Control Plane, services are deployed as part of the same Docker service stack. Each service stack contains all services that comprise an application running within a single environment. Multiple stacks may be required to support multiple, distinct applications within the same environment.
For example, in the screenshot below, a hello-world service container, built with a Docker image, tagged with build 59 of the Jenkins continuous integration pipeline, is deployed as part of both the Development (dev) and Test service stacks. The CD and UAT service stacks each contain different versions of the hello-world service.
Separate Docker Overlay Networks
For additional environment separation within the single non-production UCP, all Docker service stacks associated with an environment, reside on the same Docker overlay network. Overlay networks manage communications among the Docker Worker nodes, enabling service-to-service communication for all services on the same overlay network while isolating services running on one network from services running on another network.
in the example screenshot below, the hello-world service, a member of the test service stack, is running on the
test_default overlay network.
Having distinct environment-centric Docker service stacks and overlay networks makes it easy to clean up an environment, without impacting adjacent environments. Both service stacks and overlay networks can be removed to clear an environment’s contents.
Separate Performance Environment
In the alternative example below, a Performance environment has been added to the Non-Production VPC. To ensure a higher level of isolation, the Performance environment has its own UPC, RDS, and ELBs. The Performance environment shares the DTR, as well as the security, logging, and monitoring components, with the rest of the non-production environments.
Below, the Performance environment has half the number of Worker nodes as Production. Performance results can be scaled for expected Production performance, given more nodes. Alternately, the number of nodes can be scaled up temporarily to match Production, then scaled back down to a minimum after testing is complete.
Shared DevOps Tooling
All environments leverage shared Development and DevOps resources, deployed to a separate VPC. Resources include Agile Application Lifecycle Management (ALM), such as JIRA or CA Agile Central, source control repository management (SCM), such as GitLab or Bitbucket, binary repository management, such as Artifactory or Nexus, and a CI/CD solution, such as Jenkins, TeamCity, or Bamboo.
From the DevOps VPC, Docker images are pushed and pulled from the DTR in the Non-Production VPC. Deployments of container-based application are executed from the DevOps VPC CI/CD server to the non-production, Performance, and Production UCPs. Separate DevOps CI/CD pipelines and access controls are essential in maintaining the separation of the non-production and Production environments.
Several common components found in a Docker EE cloud-based AWS platform were discussed in the post. However, a complete AWS application platform has many more moving parts. Below is a comprehensive list of components, including DevOps tooling, organized into two categories: 1) common components that can be potentially shared across the non-production environments to save cost and complexity, and 2) components that should be replicated in each non-environment for security and performance.
Shared Non-Production Components:
- Virtual Private Cloud (VPC), Region, Availability Zones
- Route Tables, Network ACLs, Internet Gateways
- Some Security Groups
- IAM Groups, User, Roles, Policies (RBAC)
- Relational Database Service (RDS)
- API Gateway, Lambdas
- S3 Buckets
- Bastion Servers, NAT Gateways
- Route 53 Hosted Zone (Registered Domain)
- EC2 Key Pairs
- Hardened Linux AMI
- Docker EE
- UCP and EC2 Manager Nodes
- DTR and EC2 Worker Nodes
- UCP and DTR Users, Teams, Organizations
- DTR Image Repositories
- Secret Management
- Third-Party Components/Products
- SSL Certificates
- Security Components: Firewalls, Virus Scanning, VPN Servers
- Container Security
- End-User IAM
- Directory Service
- Log Aggregation
- Metric Collection
- Monitoring, Alerting
- Configuration and Secret Management
- CI/CD Pipelines as Code
- Infrastructure as Code
- Source Code Repositories
- Binary Artifact Repositories
Isolated Non-Production Components:
- Route 53 Hosted Zones and Associated Records
- Elastic Load Balancers (ELB)
- Elastic Compute Cloud (EC2) Worker Nodes
- Elastic IPs
- ELB and EC2 Security Groups
- RDS Databases (Single RDS Instance with Separate Databases)
All opinions in this post are my own and not necessarily the views of my current employer or their clients.