Friday, January 23, 2026

How Tradeshift bosted operational effectivity and scalability with Amazon RDS


This can be a visitor put up by Mircea Bud, Alexandra Munteanu, and Daniel Urzica from Tradeshift.

In 2023, Tradeshift migrated considered one of its core PostgreSQL databases from self-managed Amazon Elastic Compute Cloud (Amazon EC2) cases to Amazon Relational Database Service (Amazon RDS) for PostgreSQL. The choice adopted mounting operational dangers and efficiency limits that made the prevailing setup more and more unsustainable.

The database had grown to 18TB and supported key backend providers. It ran on growing old infrastructure, with excessive storage utilization and time-consuming restoration procedures. Efficiency degradation, patching delays, and architectural drift from the remainder of our platform made continued funding within the EC2 setup unviable.

Tradeshift wanted a managed resolution that would cut back downtime danger, enhance observability, and simplify ongoing operations. Amazon RDS met these necessities. On this put up, we clarify why we migrated to Amazon RDS, how we executed the migration, and spotlight the invaluable advantages it delivered by way of security, flexibility, and audit compliance.

The challenges of a self-managed PostgreSQL database on EC2

The self-managed EC2-based PostgreSQL setup had turn into more and more tough to take care of attributable to operational overhead. Restoration workflows had been gradual and largely handbook. Restoration Time Goal (RTO) had risen to just about 48 hours. Restoration Level Goal (RPO) hovered round one hour.

The cluster was provisioned on i3.metallic cases with fastened NVMe storage. Capability utilization persistently exceeded 90 %, and growing storage required downtime and reconfiguration. For the reason that EC2 i3.metallic makes use of native fixed-size storage by design (to ship highest IOPS efficiency), extending that required reconfiguration. Various options had been wanted, like attaching EBS volumes and altering the database schema to make use of adjoining tablespaces, the place the least essential tables might ultimately be relocated.

Patch compliance was additionally a priority. Working system and database patches weren’t utilized continuously sufficient to fulfill audit requirements. This was because of the complicated and time-consuming handbook course of required for upgrades:

  • Within the case of an OS improve, the legacy resolution required spawning a brand new read-replica, permitting it to sync (which took hours), selling it as the first, after which changing the previous duplicate with one other one. This workflow required a minimal of 20 minutes of downtime if all steps went properly.
  • For a PostgreSQL model improve, downtime was even longer as the applying service wouldn’t begin with out a read-replica current.

This cluster was the final in our fleet nonetheless managed with Puppet (custom-made manifests derived however based mostly on the general public puppetlabs-postreql repo), which elevated danger and diminished our capacity to standardize operational practices.

Why we selected Amazon RDS for PostgreSQL

After reviewing choices, we chosen Amazon RDS for PostgreSQL. It offered a totally managed PostgreSQL setting with excessive compatibility and mature tooling.

Enhancements in availability and restoration

Amazon RDS delivered substantial enhancements in availability and restoration capabilities. A Level-in-Time Restoration (PITR) after a knowledge corruption occasion, which can require database restoration involving snapshot restoration (RTO in minutes) and WAL-replay from the archive (RTO in hours). For {hardware} failure, RTO is within the minutes vary. RDS additionally delivered a lot shorter RPO intervals, enhancing our information safety capabilities. Automated failover, snapshotting, and backups (RDS backup) made restoration situations extra predictable and fewer reliant on handbook steps.

Simplified patching and audit readiness

Amazon RDS handles common updates to each the PostgreSQL engine and the underlying working system. This simplified our audit workflows and eradicated patch drift.

Higher visibility into question conduct

The built-in Efficiency Insights dashboard helped us monitor workload patterns in actual time. Our groups can establish and resolve gradual queries extra rapidly with Amazon CloudWatch alarms and metrics. We additionally relied on a number of PostgreSQL extensions supported by RDS:

  • log_fdw and postgres_fdw for amassing OS-level logs inside the database
  • pg_cron for scheduling inside database upkeep duties
  • aws_s3 for interacting with Amazon Easy Storage Service (Amazon S3) as a long-term storage resolution for audit logs and growing old information

These instruments improved our capacity to detect and reply to efficiency points with out exterior automation or third-party brokers.

Executing the migration with minimal downtime

Migrating a manufacturing system of this dimension required cautious planning. The 18TB dataset had a gradual stream of write site visitors, and downtime needed to be saved to a minimal.

We chosen native PostgreSQL logical replication as our migration methodology. It provided full compatibility with our workloads and didn’t require OS-level entry. With logical replication we synchronized information incrementally with out blocking utility writes.

Our group designed the preliminary load to attenuate the efficiency impression on the supply database by spreading that exercise throughout a two-week interval. This concerned dividing the work over 20 publications (utilizing varied desk grouping standards) and limiting the variety of replication staff per every energetic logical replication slot.

We created a brand new Amazon RDS cluster, then arrange replication slots to reflect adjustments from the self-managed PostgreSQL cluster (operating on EC2). As soon as the replication lag was below management and the information validated, we scheduled a brief downtime window to carry out the ultimate cutover. Credentials and person configurations had been migrated with out adjustments. Kubernetes providers had been up to date to level to the brand new database endpoints. IAM (database authentication) changed handbook credential dealing with for learn entry, which aligned with our present requirements for the remainder of our Amazon RDS fleet.

Architectural enhancements: A cleaner, extra scalable structure

The migration additionally helped us take away legacy parts from our platform. We deprecated Consul-based service discovery, which had beforehand dealt with database endpoint decision. As an alternative, Kubernetes-native service names now present clear and constant connectivity.

IAM authentication changed handbook person and password administration for operational entry. This improved safety and simplified onboarding for brand new customers and providers.

We additionally launched a brand new strategy for querying databases throughout environments. By combining RDS IAM authentication with the built-in instruments for PostgreSQL (aws-cli, jq, psql) we had been capable of situation cross-instance queries inside and throughout VPCs. These scripts changed fragmented customized tooling and now they supply visibility into the fleet within the type of a unified output file set.

Outcomes and enterprise impression

The migration produced measurable enhancements in a number of areas of our platform operations.

Improved availability and resilience

Restoration from failure is now quicker and extra predictable. Restoration instances decreased significantly, and the frequency of backup factors elevated, enhancing our resilience posture. The flexibility to scale compute and IOPS independently means we will adapt the database to real-time platform wants, particularly throughout incidents or site visitors spikes.

Operational effectivity

Upkeep duties comparable to partition administration and archiving are actually dealt with internally utilizing pg_cron. Whereas process scheduling could possibly be performed within the EC2 setup utilizing bash-crontab, adopting the pg_cron extension means scheduling, monitoring, and sustaining scheduled duties are actually diminished to pure SQL, eliminating the necessity for OS interplay. Efficiency monitoring has improved attributable to deeper integration between database metrics and our observability stack.

Higher developer expertise

IAM-based entry management has changed handbook credentials, making developer onboarding and person provisioning less complicated and safer. Question tuning and situation prognosis are quicker because of the visibility offered by RDS Efficiency Insights and PostgreSQL telemetry extensions.

Platform-wide advantages

The RDS fleet now helps SSO-based cross-environment question execution. This enables our group to examine and stock database cases at scale. Beforehand, this degree of entry required separate instruments and sometimes a number of handbook steps.

The migration has aligned considered one of our most crucial backend providers with the remainder of our platform structure and eliminated operational debt so we will scale extra confidently because the enterprise grows.

Conclusion

By migrating to Amazon RDS for PostgreSQL we changed legacy infrastructure with a managed service that delivers greater availability, higher observability, and simpler scaling. The mix of logical replication and Amazon RDS instruments gave us a low-risk path to modernization, even with a big, energetic dataset.

The challenge eradicated long-standing technical debt, improved compliance, and diminished the operational overhead of sustaining a customized PostgreSQL EC2-based resolution. We now have a standardized basis for PostgreSQL that may evolve with the wants of the enterprise, with out compromising reliability or efficiency.

Go to Amazon RDS to study extra.

In regards to the authors

Mircea Bud

Mircea is a Employees Software program Engineer at Tradeshift, the place he has managed the platform fleet of RDBMS cases since 2017, main the DBRE (aka DBA) group. Apart from the continual upkeep and monitoring actions, the DBRE group additionally retains a detailed relation with growth groups supporting the implementation of components that continuously enhance workloads effectivity and scalability throughout varied environments.

Alexandra Munteanu

Alexandra Munteanu

Alexandra, a former Tradeshift Database Reliability Engineer (2021-2024), performed a key position through the migration to RDS challenge, by designing progressive scripting automation options that made attainable the preparation, testing, validation and see via the execution of databases precise migration to modernised infrastructure.

Daniel Urzica

Daniel Urzica

Daniel, at the moment Director of Engineering at Tradeshift, performed a key position through the challenge, providing help to the DBRE group and ensuring all of the potential points had been handled forward of time. He additionally made positive there was inside buy-in for the migration, emphasising the benefits of shifting away from databases operating inside EC2 cases.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles