This can be a visitor publish by Mircea Bud, Alexandra Munteanu, and Daniel Urzica from Tradeshift.
In 2023, Tradeshift migrated one in all its core PostgreSQL databases from self-managed Amazon Elastic Compute Cloud (Amazon EC2) cases to Amazon Relational Database Service (Amazon RDS) for PostgreSQL. The choice adopted mounting operational dangers and efficiency limits that made the prevailing setup more and more unsustainable.
The database had grown to 18TB and supported key backend companies. It ran on growing old infrastructure, with excessive storage utilization and time-consuming restoration procedures. Efficiency degradation, patching delays, and architectural drift from the remainder of our platform made continued funding within the EC2 setup unviable.
Tradeshift wanted a managed answer that might scale back downtime danger, enhance observability, and simplify ongoing operations. Amazon RDS met these necessities. On this publish, we clarify why we migrated to Amazon RDS, how we executed the migration, and spotlight the invaluable advantages it delivered when it comes to security, flexibility, and audit compliance.
The challenges of a self-managed PostgreSQL database on EC2
The self-managed EC2-based PostgreSQL setup had change into more and more tough to take care of as a result of operational overhead. Restoration workflows had been sluggish and largely guide. Restoration Time Goal (RTO) had risen to almost 48 hours. Restoration Level Goal (RPO) hovered round one hour.
The cluster was provisioned on i3.steel cases with mounted NVMe storage. Capability utilization persistently exceeded 90 %, and rising storage required downtime and reconfiguration. Because the EC2 i3.steel makes use of native fixed-size storage by design (to ship highest IOPS efficiency), extending that required reconfiguration. Different options had been wanted, like attaching EBS volumes and altering the database schema to make use of adjoining tablespaces, the place the least crucial tables might ultimately be relocated.
Patch compliance was additionally a priority. Working system and database patches weren’t utilized regularly sufficient to fulfill audit requirements. This was as a result of complicated and time-consuming guide course of required for upgrades:
- Within the case of an OS improve, the legacy answer required spawning a brand new read-replica, permitting it to sync (which took hours), selling it as the first, after which changing the previous reproduction with one other one. This workflow required a minimal of 20 minutes of downtime if all steps went properly.
- For a PostgreSQL model improve, downtime was even longer as the applying service wouldn’t begin and not using a read-replica current.
This cluster was the final in our fleet nonetheless managed with Puppet (personalized manifests derived however primarily based on the general public puppetlabs-postreql repo), which elevated danger and diminished our capability to standardize operational practices.
Why we selected Amazon RDS for PostgreSQL
After reviewing choices, we chosen Amazon RDS for PostgreSQL. It supplied a totally managed PostgreSQL setting with excessive compatibility and mature tooling.
Enhancements in availability and restoration
Amazon RDS delivered substantial enhancements in availability and restoration capabilities. A Level-in-Time Restoration (PITR) after an information corruption occasion, which can require database restoration involving snapshot restoration (RTO in minutes) and WAL-replay from the archive (RTO in hours). For {hardware} failure, RTO is within the minutes vary. RDS additionally delivered a lot shorter RPO intervals, enhancing our information safety capabilities. Automated failover, snapshotting, and backups (RDS backup) made restoration situations extra predictable and fewer reliant on guide steps.
Simplified patching and audit readiness
Amazon RDS handles common updates to each the PostgreSQL engine and the underlying working system. This simplified our audit workflows and eradicated patch drift.
Higher visibility into question habits
The built-in Efficiency Insights dashboard helped us monitor workload patterns in actual time. Our groups can determine and resolve sluggish queries extra rapidly with Amazon CloudWatch alarms and metrics. We additionally relied on a number of PostgreSQL extensions supported by RDS:
- log_fdw and
postgres_fdwfor amassing OS-level logs throughout the database pg_cronfor scheduling inner database upkeep dutiesaws_s3for interacting with Amazon Easy Storage Service (Amazon S3) as a long-term storage answer for audit logs and growing old information
These instruments improved our capability to detect and reply to efficiency points with out exterior automation or third-party brokers.
Executing the migration with minimal downtime
Migrating a manufacturing system of this dimension required cautious planning. The 18TB dataset had a gradual stream of write site visitors, and downtime needed to be saved to a minimal.
We chosen native PostgreSQL logical replication as our migration technique. It supplied full compatibility with our workloads and didn’t require OS-level entry. With logical replication we synchronized information incrementally with out blocking software writes.
Our workforce designed the preliminary load to reduce the efficiency influence on the supply database by spreading that exercise throughout a two-week interval. This concerned dividing the work over 20 publications (utilizing varied desk grouping standards) and limiting the variety of replication employees per every lively logical replication slot.
We created a brand new Amazon RDS cluster, then arrange replication slots to reflect modifications from the self-managed PostgreSQL cluster (operating on EC2). As soon as the replication lag was underneath management and the info validated, we scheduled a brief downtime window to carry out the ultimate cutover. Credentials and consumer configurations had been migrated with out modifications. Kubernetes companies had been up to date to level to the brand new database endpoints. IAM (database authentication) changed guide credential dealing with for learn entry, which aligned with our present requirements for the remainder of our Amazon RDS fleet.
Architectural enhancements: A cleaner, extra scalable structure
The migration additionally helped us take away legacy elements from our platform. We deprecated Consul-based service discovery, which had beforehand dealt with database endpoint decision. As a substitute, Kubernetes-native service names now present clear and constant connectivity.
IAM authentication changed guide consumer and password administration for operational entry. This improved safety and simplified onboarding for brand spanking new customers and companies.
We additionally launched a brand new strategy for querying databases throughout environments. By combining RDS IAM authentication with the built-in instruments for PostgreSQL (aws-cli, jq, psql) we had been in a position to difficulty cross-instance queries inside and throughout VPCs. These scripts changed fragmented customized tooling and now they supply visibility into the fleet within the type of a unified output document set.
Outcomes and enterprise influence
The migration produced measurable enhancements in a number of areas of our platform operations.
Improved availability and resilience
Restoration from failure is now sooner and extra predictable. Restoration instances decreased significantly, and the frequency of backup factors elevated, enhancing our resilience posture. The power to scale compute and IOPS independently means we are able to adapt the database to real-time platform wants, particularly throughout incidents or site visitors spikes.
Operational effectivity
Upkeep duties resembling partition administration and archiving at the moment are dealt with internally utilizing pg_cron. Whereas process scheduling may very well be performed within the EC2 setup utilizing bash-crontab, adopting the pg_cron extension means scheduling, monitoring, and sustaining scheduled duties at the moment are diminished to pure SQL, eliminating the necessity for OS interplay. Efficiency monitoring has improved as a result of deeper integration between database metrics and our observability stack.
Higher developer expertise
IAM-based entry management has changed guide credentials, making developer onboarding and consumer provisioning less complicated and safer. Question tuning and difficulty analysis are sooner as a result of visibility supplied by RDS Efficiency Insights and PostgreSQL telemetry extensions.
Platform-wide advantages
The RDS fleet now helps SSO-based cross-environment question execution. This enables our workforce to examine and stock database cases at scale. Beforehand, this degree of entry required separate instruments and infrequently a number of guide steps.
The migration has aligned one in all our most important backend companies with the remainder of our platform structure and eliminated operational debt so we are able to scale extra confidently because the enterprise grows.
Conclusion
By migrating to Amazon RDS for PostgreSQL we changed legacy infrastructure with a managed service that delivers greater availability, higher observability, and simpler scaling. The mix of logical replication and Amazon RDS instruments gave us a low-risk path to modernization, even with a big, lively dataset.
The undertaking eradicated long-standing technical debt, improved compliance, and diminished the operational overhead of sustaining a customized PostgreSQL EC2-based answer. We now have a standardized basis for PostgreSQL that may evolve with the wants of the enterprise, with out compromising reliability or efficiency.
Go to Amazon RDS to be taught extra.
In regards to the authors
