Which feature of Amazon RDS helps maintain the availability and resiliency of your database?
In our cloud native world, enterprises are handling more data than ever. A survey by IDG showed that the total size of data under management averaged almost 348 Terabytes. For app developers, databases are the de facto standard for data storage. The challenge is that building, maintaining and securing database services on top of an application is a task by itself - you need to install patches and updates, maintain backups according to your backup policies, and monitor the performance of the database management system that you choose to run. To make the management of those critical operational tasks easier, Amazon provides a relational database service called Amazon RDS. In this article, we will discuss the features and benefits of Amazon RDS and compare it with some alternatives in the managed database sector. Show What is Amazon RDS Amazon RDS is a managed SQL database service that makes it easy to set up, operate, and scale a relational database in the cloud. It provides resizable capacity while managing time-consuming administration tasks such as hardware provisioning, backups, software patching, monitoring and crash recovery. Amazon RDS automatically detects and recovers from instance and disk failures with minimal impact on applications. With Amazon RDS, you can run a relational database on AWS without worrying about setting up, managing, and scaling a relational database management system. You can use the RDS console to manage your Amazon RDS resources and perform basic database administration tasks for your DB instances. You can also use the AWS CLI, the AWS SDKs, or APIs to manage your Amazon RDS resources. Amazon RDS gives a choice of multiple types of database engines:
But is there a better alternative?Although Amazon RDS has many benefits, it comes with some limitations, too. When you start using RDS, it may look cost-effective and inexpensive but with growing database resource demands, the cost can increase exponentially. With RDS, you don’t have full control of all the runtime configuration of your databases - it can be difficult to extend the database engine with additional configuration or integrate non-Amazon plugins and extensions. The biggest limitation is in a disaster recovery scenario, where spinning up a new RDS cluster can take more than 10 minutes. To overcome these challenges, enterprises are shifting to build their application using a Kubernetes-native database operator with container-native storage like Ondat. Let’s see how a Postgres operator like EDB and Ondat can work together:
Conclusion:RDS makes it easy to set up, operate, and scale a relational database in the cloud but if cost and budget is an important factor in your organization, using a database operator like EDB with Ondat is often the best option to reduce operational costs. With Ondat's recently launched community edition, you can run your application with no limits on clusters and nodes and with all the features like industry-leading performance, snapshots, and volume encryption. Sign up for Ondat's new community edition providing unlimited nodes and upto 1 TiB per cluster for free. For more information on how to get started, visit our docs site.
Achieving global availability for our services in Transaction Banking (TxB) at Goldman Sachs is essential to our business and a major differentiator for our platform. We also have a responsibility to demonstrate to regulators as well as external and internal auditors that we can continue to run our business in the event of major geographical outages or major service level outages from our cloud providers. It's tempting when thinking about complex challenges, like availability at scale, that moving to the cloud will automatically solve all of our problems. Unfortunately, this is not the case. There are a number of reasons why resiliency on a global scale is challenging, particularly for a stateful service like a relational database. We have to ensure we can failover that state without a significant service interruption, and maintain the integrity of that state to ensure business can resume after the failover event has taken place. We have to ensure that any mechanism we have for replicating state between geographical locations takes latency into account.
In this blog post, we will outline how Transaction Banking Engineering built on native AWS-provided resiliency options by designing and building a mechanism for failing over our relational databases in a scalable and secure way between AWS regions, along with the database credentials which control application access. Technology ComponentsOur database footprint spans multiple different platforms. We utilize many platforms designed for the cloud such as Amazon DynamoDB, Amazon ElastiCache, as well as other third party and SAAS platforms. We also have a significant footprint deployed on relational database platforms. There are a number of reasons for this including dependencies on third party software providers which only support certain technologies. Our relational database footprint comprises Amazon RDS Oracle as well as the PostgreSQL compatible edition of Amazon Aurora (Amazon Aurora PostgreSQL). The focus of this blog post will be our engineering efforts to take the resiliency functionality provided by these platforms and extend it to provide a complete solution that met our requirements. We deploy all of our components using the Infrastructure as code tool Terraform, which means we have state files to manage the resources in our AWS accounts. We must ensure that the resources in our AWS account match those in our Terraform state files to ensure ongoing deployment success. When adding new regions to our deployment, we add these as separate Terraform environments, with their own pipelines and state files independent of existing regions. Resiliency StrategyOur primary resiliency strategy for RDS Oracle and Amazon Aurora PostgreSQL is to utilize Multi-AZ resiliency provided by AWS. For RDS Oracle, this involves a duplicate instance and storage being created in a secondary AZ, with synchronous storage level replication keeping the secondary storage in sync. In the event of an outage on the primary (either failure or maintenance activity), a failover is automatically triggered to the secondary, and the database is brought online and DNS updated with minimal downtime, and so potentially minimal application impact. In addition, in the event the original primary is not recovered, a new secondary is automatically created asynchronously, so full resiliency is re-established a short time later. For Amazon Aurora PostgreSQL, the storage layer is always deployed and replicated across 3 AZs, with 2 copies of the data per AZ. This gives us 6 copies of the data for redundancy. Write quorum is achieved when a write has reached 4/6 of these copies. For the compute layer, we deploy 1 writer node, and 2 reader nodes, across 3 AZs. Applications should be configured such that write connections are sent to the writer node via the read-write cluster endpoint, and read connections are sent to the reader nodes via the read-only cluster endpoint. These endpoints are provided by Amazon Aurora PostgreSQL and the read-only endpoint utilizes round-robin load balancing to split load between reader nodes. In the event of failure of a reader node, any applications connected to that node can reconnect and be routed to the remaining healthy reader node. In the event of a failover of the writer node, Amazon Aurora PostgreSQL will promote one of the reader nodes to writer, failover the writes and update Domain Name System (DNS) with minimal downtime, and so potentially minimal application impact. This takes care of our primary availability. However, as described above, we still need a resiliency strategy for region or service outages. For RDS Oracle, AWS launched cross-region read replicas in late 2019. In early 2020, global database support followed for Amazon Aurora PostgreSQL. These features were the initial building blocks that allowed us to deploy a read-only replica of our data in a secondary region with asynchronous replication. This helped shape our application deployment strategy. Our applications are primarily deployed using containers on AWS Fargate. Given we now have a process for having an active primary database in one region and a read-only replica database in one or more secondary regions, we made the decision to deploy our applications in active/passive mode. A full deployment of the application exists in the primary region with the active database. Pre-provisioning the components in the secondary region simplifies the failover process, making it less prone to outages, elevated traffic rates or capacity constraints. Due to this, we have a full deployment in the secondary region with the replica database. Only one instance of the application is active at any one time. The infrastructure cost of secondary region components is relatively low, which made us choose to optimize the reliability of the failover process. We do not support cross-region database connectivity (i.e. an application running active in one region cannot connect to a database which is running primary in a different region). As mentioned, the replication for cross-region read replicas and global database is asynchronous. In many failure scenarios we would stop processing transactions prior to invoking the failover, allowing all databases in the secondary region to sync up with their primaries. There is, however, always the potential for data loss in the event of a failure, and this cannot be ignored. Our primary focus with this design was on availability, rather than durability. Thus, our focus during a failover is on reestablishing our business-critical components and being able to process new requests. Transactions which were in-flight at the time of the failover may require manual intervention to be completed in the secondary region. The challenge was to build a mechanism for multi-region RDS Oracle/Amazon Aurora PostgreSQL resiliency that met the following criteria:
Multi-Region RDS Oracle/Amazon Aurora PostgreSQL Failover OrchestrationChallengesThere were significant challenges we had to solve when building this solution. Some of the main ones included:
SolutionIn addition to the databases, the basic building blocks of our RDS Oracle/Amazon Aurora PostgreSQL failover orchestration engine are:
Within our Terraform projects, we use a black/red deployment model to deploy a second database (red) alongside the existing database (black) in each region. However, whilst we create all the resources the red database requires, (e.g. AWS Key Management Service (KMS) key, Secrets, etc) we do not create the database itself. The database will be later created during the failover orchestration process. For each of these databases, we store all configuration information in AWS Systems Manager Parameter Store, so the failover process can get all the information it needs to carry out the failover steps. We have created a lambda which breaks down the entire failover process into two steps. Step 1 promotes the replica database to be read write, and Step 2 re-establishes resiliency by creating a new replica in the original primary region. These steps can be triggered together as a single action or individually. In the case that the primary region has a hard outage, it's possible to do the promotion step only, and run the re-establish at a later time when the original primary region is available again. In order to deal with the physical database in a particular region switching between black and red during a failover lifecycle, we deploy a region specific Route53 Canonical Name (CNAME) for each database set. The failover lambda updates this CNAME so that the application will always connect to the correct regional database, regardless of what step in the failover we are in. To deal with the 15-minute lambda execution limit, we came up with an intelligent lambda recursion logic through which we are able to validate the state of the databases after every API call we make and then proceed with the next steps accordingly. The lambda publishes failover events to an SNS topic in each account, and the application team can subscribe to those events and take necessary action (e.g. restart their tasks) after a failover has taken place. The lambda publishes metrics for each step in the process to CloudWatch, so we can monitor the progress of each failover activity. The failover execution flow can best be demonstrated by going through an example. Example |