Posts Tagged High Availability

Considerations for Architecting Resilient Multi-Region Workloads

What to consider when evaluating a ‘multi-region’ strategy as part of business continuity and disaster recovery planning

Audio version of the blog post on YouTube

Introduction

Increasingly, I hear the term ‘multi-region’ used within the IT community and in conversations with peers and customers, most often within the context of disaster recovery. In my experience, ‘multi-region’ is a cloud provider-agnostic phrase that can mean different things to different organizations. A few examples:

  • Multiple, independent, regionally-deployed application instances that better serve a geographically-diverse customer base, for regulated ‘locality-restricted’ workloads, to ensure data sovereignty, distribute system load, or minimize the blast radius of a regional disaster event. Although a disaster recovery plan may be required, the primary driver of this architecture is often not disaster recovery.
  • An active-passive failover strategy in which a second DR Region hosts a mixture of cold, warm, and hot copies of workloads and serves as a failover in response to a disaster event in the Primary Region. In my experience, this is probably the most common use case when someone refers to ‘multi-region.’
  • An active-active architecture in which data is continually replicated and traffic can be seamlessly routed based on geolocation between all services within two or more geo-redundant regions, making it resilient to the impact of a regional disaster event. Some might describe this architecture as having both inner-regional and inter-regional high availability.
Copyright: peshkov
Copyright: peshkov

Terminology

The following terminology is commonly used when discussing Business Continuity and Disaster Recovery Planning. Teams should be familiar with these concepts before undertaking planning activities:

  • Fault Tolerance (FT), High Availability (HA), Disaster Recovery (DR), and Business Continuity (BC), and the distinct differences between the four concepts
  • Business Continuity Plan/Planning (BCP) and Disaster Recovery Plan/Planning (DRP), and the differences between the two types of plans (source)
  • Business Continuity and Disaster Recovery (BCDR or BC/DR) (source)
  • Business Impact Analysis (BIA) and Risk Assessment (source)
  • Categories of Disaster: Natural Disasters, Technical Failures, and Human Actions, both intentional and unintentional (source)
  • Resiliency, which includes both Disaster Recovery (service restoration) and Availability (preventing loss of service) (source)
  • Crisis Management: Critical vs. Non-Critical Systems and Mission Critical vs. Business Critical Systems (source)
  • Regions vs. Availability Zones (aka Zones), common constructs to all major Cloud Service Providers (CSP): AWS, Google Cloud, Microsoft Azure, IBM Cloud, and Oracle Cloud
  • Primary (aka Active) Region vs. DR (aka Passive or Standby) Region (source)
  • Active-Active vs. Active-Passive DR Strategies (source)
  • SHARE’s 7 Tiers of Disaster Recovery (source)
  • Disaster Recovery Site Types: Cold, Warm, and Hot (source)
  • AWS Multi-Region Disaster Recovery Strategies: 1) Backup and restore, 2) active-passive Pilot light, 3) active-passive Warm standby, or 4) Multi-region (multi-site) active-active (source)
  • Recovery Time Objective (RTO) and Recovery Point Objective (RPO), and the methods and costs to achieve varying levels of each SLA (source)
  • Failover and Failback Operations (source)
  • Partial vs. Complete Regional Outage, and the implications to Disaster Recovery Planning (source)
  • Single Points of Failure (SPOF) (source)

BCDR Planning Considerations

When developing BCDR Plans that include multi-region, there are several technical aspects of your workloads that need to be considered. The following list is not designed to be exhaustive, nor is it intended to suggest that multi-region DR is an unattainable task. On the contrary, this list is meant to encourage thorough planning and suggest ways to continually improve an organization’s plan.

  • Configuration Data Management
  • Secret Management
  • Cryptographic Key Management
  • Hardware Security Module (HSM)
  • Credential Management
  • SSL/TLS Certificate Management
  • Authentication (AuthN) and Authorization (AuthZ)
  • Domain Name System (DNS), DNS Failover and Failback, Global Traffic Management (GTM)
  • Content delivery network (CDN)
  • Specialized Workloads, such as SAP, VMware, SharePoint, Citrix, Oracle, SQL Server, SAP HANA, and IBM Db2
  • On-premises workload dependencies, and wide-area network (WAN) connectivity between on-premises data centers and the Cloud
  • Remote access from on-premises and remote employees to cloud-based backend and enterprise systems
  • Edge compute, such as connected devices, IoT, storage gateways, and remotely-managed local cloud-infrastructure (e.g., AWS Outposts
  • DevOps, CI/CD, Release Management
  • Infrastructure as Code (IaC)
  • Public and private artifact repositories, including Docker and Virtual Machine (VM) Image repositories
  • Source code in Version Control Systems (VCS), also known as Source Control Management (SCM)
  • Software licensing for the self-managed and hosted services
  • Observability, monitoring, logging, alerting, and notification
  • Regional differences of a Cloud Provider’s service offerings, cost, performance, and support
  • Latency, including latency between Primary and DR Region and between end-users and partners and DR Region
  • Data residency and data sovereignty requirements, which will impact choice of DR Region
  • Automated, event-driven Failover process vs. manual processes
  • Failback process
  • Playbooks, documentation, training, and regular testing
  • Support, help desk, and call center coordination (and potential impact of disaster event on Cloud-based call center technologies)

Disaster Recovery Planning Process

In my opinion, many disaster planning discussions I’m involved in begin by focusing on the wrong things. Logically, engineering teams often jump right to questions about specific service capabilities, such as “is my database capable of cross-region replication?” or “how do I support multi-region cryptographic keys for data encryption and decryption.” Yet, higher-level business continuity planning or workload assessments haven’t yet been conducted. Based on my experience, I suggest the following approach to get started with disaster recovery planning (again, not an exhaustive list):

  1. Workload Portfolio: Identify the organization’s complete workload portfolio, including all distinct applications and their associated infrastructure, datastores, and other dependencies.
  2. DR Workloads: From the portfolio, identify which workloads are considered business-critical or mission-critical systems and must be part of the disaster recovery planning.
  3. Classification: Classify each DR workload based on Business Impact Analysis, Risk Assessment, and SLAs such as availability, RTO, and RPO. Do the requirements demand an active-passive or active-active DR strategy? In AWS terms, do the requirements dictate Backup and Restore, Pilot Light, Warm Standby, or Multi-Site Active-Active?
  4. Documentation: Obtain current documentation and architectural and process-flow diagrams showing all components and dependencies, including cross-workload and third-party dependencies such as SaaS vendors. Review and verify accuracy of documentation and diagrams.
  5. Current Regions: Identify the Regions into which the existing workload is deployed.
  6. Service-level Review: Review each workload’s individual components to ensure they can meet the DR requirements, such as compute, storage, databases, security, networking, edge, CDN, mobile, frontend web, and end-user compute (e.g., “is the workload’s specific NoSQL database capable of cross-region replication and automatic failover?”).
  7. Third-party Dependencies: Identify and review each workload’s third-party dependencies, such as SaaS partners. Is their service essential to a critical workload’s functionality? What is your partner’s Disaster Recovery Plan?
  8. DR-capable Workload: Determine how much re-engineering is required to deploy and operate the workload to the DR Region.
  9. Data Residency and Data Sovereignty: Review data residency and sovereignty requirements for the workload, which could impact the choice of DR Regions.
  10. Choose DR Region: Not all Cloud Provider’s Regions offer the same services. Therefore, choose a DR Region(s) that can support all services utilized by the workload.
  11. Disaster Planning Considerations: Review all items shown in the previous ‘Disaster Recovery Planning Considerations’ section for each workload.
  12. Prepare for Partial Failures: Decide how you will handle partial versus complete regional outages. Regional disruptions of specific services are the most common type of Cloud outage, often resulting in partial impairment of a workload.
  13. Cost: Calculate the cost of the workload based on the required DR Service Level and DR Region. Investigate Cloud-provider’s volume pricing agreements to reduce costs.
  14. Budget: Adjust DR Service Level requirements to meet budgetary constraints if necessary.
  15. Re-engineer Workloads: Construct timelines and budgets to re-engineer workloads for DR if required.
  16. DR Proof of Concept: Build out a Proof of Concept (POC) DR Region to validate the plan’s major assumptions and adjust if necessary; include failover and failback operations.
  17. DR Buildout: Construct timelines and budgets to build out the DR environment.
  18. Workload Deployment: Construct timelines and budgets to provision, deploy, configure, test, and monitor workloads in the DR Region.
  19. Documentation, Training, and Testing: Ensure all playbooks, documentation, training, and testing procedures are completed and regularly reviewed, updated, and tested, including failover and failback operations.

Before Considering Multi-Region

Workloads built to be resilient, fault-tolerant, highly available, easily deployable and configurable, backed-up, and monitored will help an organization withstand the most common disruptions in the Cloud. Before considering a multi-region disaster recovery strategy, I strongly recommend ensuring the following aspects of your workloads are adequately addressed:

  • Fault Tolerance: Workloads are architected to be fault-tolerant such that they can withstand the failure of individual components and operate in a degraded state. Eliminate any single point of failure (SPOF).
  • High Availability: Workloads are designed to be highly available, which with most cloud providers means resources are spread across multiple, discrete, regionally dispersed data centers or Availability Zones (AZ) and can tolerate the loss of a data center or AZ.
  • Backup: All workload components, source code, data, and configuration are regularly backed up using automated processes. All backups are verified. Backups are periodically restored to test restore procedures. As the most basic form of disaster recovery, developing and testing a backup and restore strategy will help teams to think more deeply about disaster planning.
  • Observability: Workloads have adequate observability, monitoring, logging, alerting, and notification processes in place.
  • Automation: Workloads and all required infrastructure and configuration are codified, documented, and can be efficiently and consistently deployed and configured without requiring manual intervention, using mature DevOps and CI/CD practices. Ensuring workloads can be consistently deployed and re-deployed will help ensure they could be built out in a second region if multi-region is a potential goal.
  • Environment-agnostic: Workloads are environment-agnostic, with no hard-coded application or infrastructure dependencies or configurations. Confirming workloads are environment-agnostic will help to ensure they are portable across regions if multi-region is a potential goal.
  • Multi-environment: Workloads are deployed to one or more SDLC environments prior to Production, such as Development, Test, Staging, or UAT. The environment should be a different Cloud account than Production. A second environment will help to ensure workloads are portable across regions if multi-region is a potential goal.
  • Chaos Engineering: Workloads are regularly tested to ensure that they can withstand unexpected disruptions.

References

In addition to the references already listed, here are some useful references to learn more about the topics introduced in this post:

Conclusion

In this post, we explored some of the potential meanings of the term ‘multi-region’. We then reviewed Business Continuity and Disaster Recovery Planning terminologies, considerations, and a recommended approach to get started. Lastly, we discovered some best-practices to enact before considering a multi-region disaster recovery strategy. What does ‘multi-region’ mean to your organization? Do you have comprehensive Business Continuity and Disaster Recovery Plans for your Cloud-based workloads? I would value your feedback and thoughts.


This blog represents my own viewpoints and not of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.

, , , , ,

Leave a comment