Which feature of Amazon RDS helps maintain the availability and resiliency of your database?

In our cloud native world, enterprises are handling more data than ever. A survey by IDG showed that the total size of data under management averaged almost 348 Terabytes. For app developers, databases are the de facto standard for data storage. The challenge is that building, maintaining and securing database services on top of an application is a task by itself - you need to install patches and updates, maintain backups according to your backup policies, and monitor the performance of the database management system that you choose to run. To make the management of those critical operational tasks easier, Amazon provides a relational database service called Amazon RDS. In this article, we will discuss the features and benefits of Amazon RDS and compare it with some alternatives in the managed database sector.

What is Amazon RDS

Amazon RDS is a managed SQL database service that makes it easy to set up, operate, and scale a relational database in the cloud. It provides resizable capacity while managing time-consuming administration tasks such as hardware provisioning, backups, software patching, monitoring and crash recovery. Amazon RDS automatically detects and recovers from instance and disk failures with minimal impact on applications. 

With Amazon RDS, you can run a relational database on AWS without worrying about setting up, managing, and scaling a relational database management system. You can use the RDS console to manage your Amazon RDS resources and perform basic database administration tasks for your DB instances. You can also use the AWS CLI, the AWS SDKs, or APIs to manage your Amazon RDS resources.

Amazon RDS gives a choice of multiple types of database engines:

  1. RDS for MySQL
  2. RDS for MariaDB
  3. RDS for PostgreSQL
  4. RDS for Oracle Database
  5. RDS for SQL
Why Amazon RDS is the preferred choice of developers:

  • Manageability: One of the main benefits of Amazon RDS is helping enterprises to manage complex and large relational databases.
  • Ease of use: With RDS, you don’t need to worry about learning various database management tools. You can run multiple database instances using one unified management console
  • Time-effective: As maintenance tasks such as backing and patching are automated, time spent maintaining instances is reduced.
  • Availability: Amazon RDS makes sure that your data is available at any time and any location via Multi Avalaibility Zone deployment which maintains a redundant copy of your data in a separate location.
  • Scalability: Amazon RDS splits compute and storage making it easier to scale them independently.

But is there a better alternative?

Although Amazon RDS has many benefits, it comes with some limitations, too. When you start using RDS, it may look cost-effective and inexpensive but with growing database resource demands, the cost can increase exponentially. With RDS, you don’t have full control of all the runtime configuration of your databases - it can be difficult to extend the database engine with additional configuration or integrate non-Amazon plugins and extensions. The biggest limitation is in a disaster recovery scenario, where spinning up a new RDS cluster can take more than 10 minutes.

To overcome these challenges, enterprises are shifting to build their application using a Kubernetes-native database operator with container-native storage like Ondat. 

Let’s see how a Postgres operator like EDB and Ondat can work together:

  • Ondat provides dynamic volume provisioning to make the management of persistent storage easy.

  • EDB provides the automation and logic to enable easier management of the database and abstractions for common operational processes

  • EDB with Ondat is entirely configurable from the Kubernetes level, enabling developers to rapidly create, scale and terminate databases according to business requirements with the same interface they use to manage their applications.

  • EDB with Ondat enables you to use the same pool of resources for databases that you do for any other compute and storage workload and to control scaling yourself. The EDB operator can create and run databases tailored to your use case that are entirely within your own environment and under your full control. 

  • Ondat features enable EDB to make the best use of any type of storage, on-premises or in-cloud enabling a consistent configuration to be used to provide high availability with disaster recovery at the data and application levels in any environment and on any platform. On AWS, storage can be chosen that best fits the use case whether high-performing EC2 Instance Store or IO2 or cost-efficient GP3.

  • Ondat adds per-volume encryption, enabling a fine-knit federation of database access. Each database can be controlled individually with this configuration and is within a single application failure domain, greatly decreasing operational risks versus a common 'all-in-one' RDS topology.

  • Ondat provides best-in-class replication and resilience, enabling workloads to continue running even in disasters that would impact RDS availability such as availability zone failures.

  • EDB with Ondat can help you reduce your operational cost by up to 50% - instead of paying for an abstracted ‘vertical’, you pay for the raw storage and compute you require. This pool of storage and compute can be shared across your other workloads.

Conclusion:

RDS makes it easy to set up, operate, and scale a relational database in the cloud but if cost and budget is an important factor in your organization, using a database operator like EDB with Ondat is often the best option to reduce operational costs. With Ondat's recently launched community edition, you can run your application with no limits on clusters and nodes and with all the features like industry-leading performance, snapshots, and volume encryption.

Sign up for Ondat's new community edition providing unlimited nodes and upto 1 TiB per cluster for free. For more information on how to get started, visit our docs site.

Achieving global availability for our services in Transaction Banking (TxB) at Goldman Sachs is essential to our business and a major differentiator for our platform. We also have a responsibility to demonstrate to regulators as well as external and internal auditors that we can continue to run our business in the event of major geographical outages or major service level outages from our cloud providers.

It's tempting when thinking about complex challenges, like availability at scale, that moving to the cloud will automatically solve all of our problems. Unfortunately, this is not the case. There are a number of reasons why resiliency on a global scale is challenging, particularly for a stateful service like a relational database. We have to ensure we can failover that state without a significant service interruption, and maintain the integrity of that state to ensure business can resume after the failover event has taken place. We have to ensure that any mechanism we have for replicating state between geographical locations takes latency into account.

  • Should we replicate data synchronously and pay a large performance penalty whilst we wait for global consistency? Or should we replicate asynchronously knowing that means accepting the risk that transactions at the primary may not have replicated to the secondary when disaster strikes?
  • How do we ensure our state is replicated securely with the appropriate controls in place?
  • How do we start taking regional application and provider components and deploying them in a global configuration? 
  • When we need to trigger a failover event, how do we do that? What is the impact to running applications in the primary region? In the secondary region?
  • Most importantly - how do we accomplish all of this at scale in a timely manner with critical business components on the line?

In this blog post, we will outline how Transaction Banking Engineering built on native AWS-provided resiliency options by designing and building a mechanism for failing over our relational databases in a scalable and secure way between AWS regions, along with the database credentials which control application access.

Technology Components

Our database footprint spans multiple different platforms. We utilize many platforms designed for the cloud such as Amazon DynamoDB, Amazon ElastiCache, as well as other third party and SAAS platforms. We also have a significant footprint deployed on relational database platforms. There are a number of reasons for this including dependencies on third party software providers which only support certain technologies. Our relational database footprint comprises Amazon RDS Oracle as well as the PostgreSQL compatible edition of Amazon Aurora (Amazon Aurora PostgreSQL). The focus of this blog post will be our engineering efforts to take the resiliency functionality provided by these platforms and extend it to provide a complete solution that met our requirements.

We deploy all of our components using the Infrastructure as code tool Terraform, which means we have state files to manage the resources in our AWS accounts. We must ensure that the resources in our AWS account match those in our Terraform state files to ensure ongoing deployment success. When adding new regions to our deployment, we add these as separate Terraform environments, with their own pipelines and state files independent of existing regions.

Resiliency Strategy

Our primary resiliency strategy for RDS Oracle and Amazon Aurora PostgreSQL is to utilize Multi-AZ resiliency provided by AWS.

For RDS Oracle, this involves a duplicate instance and storage being created in a secondary AZ, with synchronous storage level replication keeping the secondary storage in sync. In the event of an outage on the primary (either failure or maintenance activity), a failover is automatically triggered to the secondary, and the database is brought online and DNS updated with minimal downtime, and so potentially minimal application impact. In addition, in the event the original primary is not recovered, a new secondary is automatically created asynchronously, so full resiliency is re-established a short time later.

For Amazon Aurora PostgreSQL, the storage layer is always deployed and replicated across 3 AZs, with 2 copies of the data per AZ. This gives us 6 copies of the data for redundancy. Write quorum is achieved when a write has reached 4/6 of these copies. For the compute layer, we deploy 1 writer node, and 2 reader nodes, across 3 AZs. Applications should be configured such that write connections are sent to the writer node via the read-write cluster endpoint, and read connections are sent to the reader nodes via the read-only cluster endpoint. These endpoints are provided by Amazon Aurora PostgreSQL and the read-only endpoint utilizes round-robin load balancing to split load between reader nodes. In the event of failure of a reader node, any applications connected to that node can reconnect and be routed to the remaining healthy reader node. In the event of a failover of the writer node, Amazon Aurora PostgreSQL will promote one of the reader nodes to writer, failover the writes and update Domain Name System (DNS) with minimal downtime, and so potentially minimal application impact. 

This takes care of our primary availability. However, as described above, we still need a resiliency strategy for region or service outages.

For RDS Oracle, AWS launched cross-region read replicas in late 2019. In early 2020, global database support followed for Amazon Aurora PostgreSQL. These features were the initial building blocks that allowed us to deploy a read-only replica of our data in a secondary region with asynchronous replication. 

This helped shape our application deployment strategy. Our applications are primarily deployed using containers on AWS Fargate. Given we now have a process for having an active primary database in one region and a read-only replica database in one or more secondary regions, we made the decision to deploy our applications in active/passive mode.

A full deployment of the application exists in the primary region with the active database. Pre-provisioning the components in the secondary region simplifies the failover process, making it less prone to outages, elevated traffic rates or capacity constraints. Due to this, we have a full deployment in the secondary region with the replica database. Only one instance of the application is active at any one time. The infrastructure cost of secondary region components is relatively low, which made us choose to optimize the reliability of the failover process.

We do not support cross-region database connectivity (i.e. an application running active in one region cannot connect to a database which is running primary in a different region).

As mentioned, the replication for cross-region read replicas and global database is asynchronous. In many failure scenarios we would stop processing transactions prior to invoking the failover, allowing all databases in the secondary region to sync up with their primaries. There is, however, always the potential for data loss in the event of a failure, and this cannot be ignored. Our primary focus with this design was on availability, rather than durability. Thus, our focus during a failover is on reestablishing our business-critical components and being able to process new requests. Transactions which were in-flight at the time of the failover may require manual intervention to be completed in the secondary region.

The challenge was to build a mechanism for multi-region RDS Oracle/Amazon Aurora PostgreSQL resiliency that met the following criteria:

  • No reliance on any service or infrastructure running in the original primary region, in case of a complete regional outage.
  • No reliance on any Infrastructure as Code tools. It should be possible to failover a database without updating and deploying new code versions.
  • Ability to re-establish resiliency post-failover to the original primary region without this step becoming a hard dependency, again in case of complete regional outage.
  • As minimal impact on the application as possible as a result of failover. The application should not have to redeploy or reconfigure as a result of the failover action.
  • Must be able to scale such that multiple databases can be failed over simultaneously.
  • Must maintain the highest levels of security standards at all stages (Encryption at rest, TLS, credential management, etc).

Multi-Region RDS Oracle/Amazon Aurora PostgreSQL Failover Orchestration

 

Challenges

There were significant challenges we had to solve when building this solution. Some of the main ones included:

  • Native AWS service features: For example, at the time of design, neither RDS Oracle or Amazon Aurora PostgreSQL supported re-establishing resiliency to the original primary region using the old primary database. Instead, a brand new replica database had to be created by the orchestration engine. Since this was a new database resource not created via Terraform, the process needed to know what configuration variables to set on the new database.
  • Complexity of RDS Oracle/Amazon Aurora PostgreSQL replica management: For example, RDS Oracle replica database option group configuration was not possible via Terraform. Instead, AWS allocates system generated resources to manage replica databases. This means our failover orchestration must be aware of these system generated resources, and remove them as necessary when a replica database is promoted to a primary database, and synchronize the configuration options appropriately.  
  • Terraform state vs AWS state: As mentioned, the failover orchestration process has to make significant changes to the AWS state, such as changing database roles from replica to primary, and even creating new database resources. This means that after a failover exercise, our Terraform state and AWS state differ, sometimes significantly. This disparity has to be remediated. 
  • Failover duration: Lambdas have a maximum runtime of 15 minutes. Given the complex nature of failovers, any individual execution may require longer than 15 minutes to complete.
  • Downstream dependencies: Once a failover is executed, how do we track the progress? How do downstream applications know that a failover has completed and their replica read-only database is now available as a read-write primary?

 

Solution

In addition to the databases, the basic building blocks of our RDS Oracle/Amazon Aurora PostgreSQL failover orchestration engine are:

  • AWS Systems Manager Automation- Triggered by a maker/checker request process, this initiates the failover process.
  • AWS Lambda- The main failover orchestration engine.
  • AWS Systems Manager Parameter Store- Stores database specific configuration details required by the failover lambda.
  • Amazon Route53- Provides the application with a regional specific DB endpoint, allowing us to update the database instance in that region during failover without the application having to update its configuration.
  • AWS Secrets Manager- To store database credentials. More details later in this post.
  • AWS Simple Notification Service (SNS)- To publish failover events that downstream systems can take action on.
  • AWS CloudWatch- To publish failover process metrics and allow mission control and application teams to monitor failover progress across multiple databases.

Within our Terraform projects, we use a black/red deployment model to deploy a second database (red) alongside the existing database (black) in each region. However, whilst we create all the resources the red database requires, (e.g. AWS Key Management Service (KMS) key, Secrets, etc) we do not create the database itself. The database will be later created during the failover orchestration process. For each of these databases, we store all configuration information in AWS Systems Manager Parameter Store, so the failover process can get all the information it needs to carry out the failover steps.

We have created a lambda which breaks down the entire failover process into two steps. Step 1 promotes the replica database to be read write, and Step 2 re-establishes resiliency by creating a new replica in the original primary region. These steps can be triggered together as a single action or individually. In the case that the primary region has a hard outage, it's possible to do the promotion step only, and run the re-establish at a later time when the original primary region is available again.

In order to deal with the physical database in a particular region switching between black and red during a failover lifecycle, we deploy a region specific Route53 Canonical Name (CNAME) for each database set. The failover lambda updates this CNAME so that the application will always connect to the correct regional database, regardless of what step in the failover we are in.

To deal with the 15-minute lambda execution limit, we came up with an intelligent lambda recursion logic through which we are able to validate the state of the databases after every API call we make and then proceed with the next steps accordingly. 

The lambda publishes failover events to an SNS topic in each account, and the application team can subscribe to those events and take necessary action (e.g. restart their tasks) after a failover has taken place. 

The lambda publishes metrics for each step in the process to CloudWatch, so we can monitor the progress of each failover activity.

The failover execution flow can best be demonstrated by going through an example.

Example

We can follow the lifecycle of a multi-region failover and failback example by examining five states:

State 1: We have a primary (master) database east-db-black in a region us-east-1, replicating to a replica database west-db-black. As previously mentioned we have deployed the necessary infrastructure for red databases in both east and west, but these do not currently exist.

At this point, we trigger the complete multi-region failover process.

State 2: We have now promoted the west-db-black replica database to be primary, and the failover lambda has re-established replication back to the us-east-1 region by creating the east-db-red database as a replica database. The lambda looked up AWS Systems Manager Parameter Store in order to know how to configure the east-red-db when it was being built. The old primary database east-db-black still exists, however, it is no longer part of the replication flow; it exists in case it's required for later data reconciliation or checkouts. At this point, the us-east-1 region Route53 CNAME has been updated to point to east-db-red, so the application is always connecting to the correct active database in us-east-1.

At this point we are fully resilient across regions once again.

State 3: We have now removed the east-db-black database since it is no longer required.

At this point, we trigger the complete multi-region failback process.

State 4: We have now promoted the east-db-red replica database to be primary, and the failover lambda has re-established replication back to the us-west-2 region by creating the west-db-red database as a replica database. Just as in State 2, the failover lambda used AWS Systems Manager Parameter Store and Route53 to configure the new replica database and update the us-west-2 CNAME so the application is connecting to the correct active database in us-west-2. At this point, we are fully resilient across regions once again.

State 5: We have now removed the west-db-black database since it is no longer required, and now we are in the same resilience setup as when we started.