Netflix operates a worldwide streaming service that serves a whole lot of tens of millions of customers via a distributed microservices structure. To successfully serve these clients, the engineering group depends on its infrastructure groups that construct inner instruments and abstractions to speed up developer productiveness whereas sustaining operational excellence.
The On-line Information Shops (ODS) staff is one such group, managing persistent knowledge retailer options throughout the group. They consider developer necessities, assess manufacturing workloads, and supply material experience for knowledge retailer choices. On this submit, we study the technical and operational challenges encountered by the ODS staff with their present self-managed distributed PostgreSQL-compatible database, the analysis standards used to pick out a database answer, and why they selected emigrate to Amazon Aurora PostgreSQL-Suitable Version to satisfy their present and future efficiency wants. The migration to Aurora PostgreSQL improved their database infrastructure, reaching as much as 75% improve in efficiency and 28% value financial savings throughout crucial purposes: Spinnaker achieved 50% common latency discount and Coverage Engine decreased common latency by 75% (from 26.72 milliseconds to six.51 milliseconds).
Enterprise problem
The ODS staff’s goal is to construct knowledge infrastructure options that speed up developer productiveness, cut back operational overhead for each infrastructure and software groups, ship constant and dependable efficiency below various hundreds, and supply scalability to help rising knowledge volumes and consumer bases. Nonetheless, they acknowledged that their fragmented relational database technique was undermining these goals. Managing a number of PostgreSQL-compatible engines, together with a licensed self-managed distributed PostgreSQL-compatible database as their major relational database answer, created operational inefficiencies that impacted each infrastructure groups and the developer neighborhood.
The infrastructure staff was burdened with self-managed databases on Amazon Elastic Compute Cloud (Amazon EC2), consuming precious time with operational overhead from deployments, patching, scaling, and upkeep actions whereas going through rising licensing prices. The developer expertise was equally impacted from this fragmented method. Engineers confronted inconsistent database deployment processes throughout a number of engines, which slowed software improvement. Moreover, guide scaling procedures throughout site visitors spikes led to efficiency degradation and manufacturing incidents. The varied database panorama additionally required groups to keep up experience throughout a number of techniques, making it difficult to ascertain unified greatest practices and specialization. Recognizing these challenges, the ODS staff initiated an analysis of database options to consolidate their infrastructure and enhance each operational effectivity and developer expertise.
Database analysis standards
To evaluate database choices, Netflix established analysis standards aligned with their staff rules throughout 4 key dimensions. First, for developer productiveness, the answer wanted to make use of current builders’ PostgreSQL experience to attenuate studying curves, preserve PostgreSQL compatibility to allow software portability with minimal code modifications, and combine with current enterprise intelligence (BI) and developer instruments. Second, their operational effectivity necessities centered on lowering administration complexity via simplified reproduction administration, which adapts to Netflix’s unpredictable site visitors patterns. Netflix wanted full infrastructure abstraction that removes backup, failover, and infrastructure administration issues, so engineers can concentrate on innovation reasonably than upkeep.
Third, the staff’s efficiency reliability standards emphasised excessive availability to help Netflix’s stringent uptime necessities with near-zero downtime throughout upgrades, automated storage scaling capabilities for enhanced operational expertise, efficiency consistency that matches or improves upon current infrastructure, and multi-Area reader help for cross-Area learn replicas enabling low-latency native reads. Lastly, scalability concerns centered on cost- effectivity via decrease complete value of possession in comparison with legacy database licensing, mixed with the power to help increasing workloads and accommodate future use circumstances as Netflix’s knowledge ecosystem continues to develop.
After evaluating a number of database options towards these standards, Netflix chosen Aurora PostgreSQL as the popular database answer for relational workloads.
Why Netflix selected Aurora
On this part, we talk about the important thing the explanation why Netflix selected Aurora for its database infrastructure.
Assembly knowledge infrastructure efficiency necessities
The analysis revealed that the majority use circumstances had been single-Area workloads, whereas others required multi-Area help that could possibly be served by Aurora World Database, which makes use of asynchronous storage-based replication, with sometimes lower than 1 second cross-Area replication lag. This permits low-latency learn operations for purposes connecting from geographically distributed areas. Among the many single-Area workloads, they recognized an optimization alternative to simplify the replication mannequin utilized by the self-managed distributed PostgreSQL-compatible database and take away the coordination of Raft-style consensus, leading to decrease write latencies, decreased operational value, and improved general efficiency. The Aurora structure delivered the efficiency enhancements wanted via the next options:
- Log-based write operations – Aurora makes use of a log-based method that sends solely redo log data to the distributed storage layer as an alternative of writing full knowledge pages, in contrast to standard database engines. These log data are written in parallel to a quorum of storage nodes throughout three Availability Zones (requiring 4 of six nodes to acknowledge), enabling increased write throughput and decrease latency whereas sustaining sturdiness.
- Shared storage structure – Aurora separates compute and storage layers via a fault-tolerant distributed system that spans three Availability Zones in an AWS Area. This structure addressed the info infrastructure’s core necessities whereas sustaining excessive availability and sturdiness.
Eliminating operational overhead
Aurora is a completely managed relational database engine that alleviates the guide deployment, patching, scaling, and upkeep efforts beforehand required with self-managed PostgreSQL-compatible databases on Amazon EC2. Aurora learn replicas serve twin functions as each learn scaling options and automated failover targets, sharing the identical storage quantity as the author occasion whereas consuming the log stream asynchronously. This lag is often a lot lower than 100 milliseconds after the first occasion has written an replace. When the first author occasion experiences points, Aurora robotically fails over to one among as much as 15 learn replicas inside the similar Area, which assumes the author function with out knowledge loss through the use of the shared storage structure. This automated failover functionality avoids the complicated failover eventualities and partial outage restoration procedures that beforehand required guide intervention, offering steady availability with out operational overhead. Furthermore, with Aurora PostgreSQL, builders can use their current PostgreSQL experience with out retraining. Functions required minimal or no code modifications throughout migration resulting from PostgreSQL compatibility. This compatibility preserved improvement velocity whereas enabling groups to keep up productiveness all through the transition and profit from efficiency enhancements with out disrupting current workflows.
Ammar Khaku, Employees Software program Engineer on the Netflix On-line Information Shops staff, acknowledged
“We not must construct and deploy customized binaries on EC2 with inner safety and metrics-related patches. Switching to off-the-shelf managed Aurora PostgreSQL lets us concentrate on enterprise logic and knowledge entry patterns.”
Enhancing software responsiveness and developer expertise
Aurora considerably improved the developer expertise by minimizing cross-AZ latency overhead that had decreased software and improvement device responsiveness within the self-managed distributed PostgreSQL-compatible database. The distributed nature of the prior answer required easy learn queries to be redirected from coordinator nodes to different cluster nodes throughout completely different Availability Zones, creating a number of community hops and elevated latency. The shared storage structure in Aurora serves reads regionally whereas sustaining knowledge consistency, permitting the database engine to allocate 75% of occasion reminiscence to shared buffers by default. This increased allocation in comparison with the standard 25–40% in commonplace PostgreSQL is as a result of Aurora avoids double buffering between PostgreSQL’s shared buffers and the working system web page cache, which permits extra queries to be served from reminiscence reasonably than disk. These architectural enhancements minimized community overhead and delivered as much as 75% sooner response occasions.
Reaching cost-efficiency
Aurora’s pay-as-you-go pricing mannequin delivered 28% value financial savings in comparison with license-based pricing. As a completely managed database service, Aurora reduces heavy lifting via capabilities like storage auto-scaling as much as 256 TB and steady incremental backup to Amazon Easy Storage Service (Amazon S3) with as much as 35 days retention for automated backups, eradicating guide capability administration and backup procedures that beforehand required devoted operational sources. Moreover, learn replicas incur no further storage prices as a result of cases share the underlying storage quantity, additional reducing the fee whereas sustaining excessive availability and efficiency.
Migration outcomes
As of October 2025, Netflix has migrated a number of purposes from the self-managed distributed PostgreSQL-compatible database to Aurora PostgreSQL. On this part, we assessment the efficiency after migration of two purposes: Spinnaker (Front50) and Coverage Engine.
Spinnaker (Front50)
Front50 is the metadata microservice for Spinnaker, the continual supply system used throughout Netflix. The workload entails storing and retrieving orchestration parts akin to pipelines. Faster querying of pipeline states straight impacts Spinnaker UI responsiveness, making the administration of deployments sooner for practically all Netflix builders. Front50 noticed the next advantages from the migration:
- Common latency – Roughly 50% discount (from 67.57 milliseconds to 41.70 milliseconds)
- Most latency – Roughly 70% discount with fewer spikes
- Stability – Way more constant efficiency patterns
The next graph exhibits the latency enhancements for the Front50 microservice.
Coverage Engine
Coverage Engine is a guidelines engine and state machine for Netflix, offering a framework for implementing, evaluating, and imposing knowledge governance and effectivity insurance policies throughout knowledge retailer techniques used at Netflix. The workload entails flagging knowledge property (tables, databases, clusters) for coverage violations, managing violation state machines that robotically execute remediation actions, and notifying stakeholders. Discount in latency permits Coverage Engine jobs to run sooner, lowering the time it takes to triage alerts guaranteeing compliance gaps are closed extra swiftly. Coverage Engine noticed the next latency enhancements from the migration:
- General enchancment – Decreased latency throughout all endpoints from the July 4, 2025, migration date
- Particular endpoints – Notable decreased latency within the following endpoints:
countDatasetsdecreased from roughly 5.40 milliseconds to 1.90 millisecondsfindDatasetsdecreased from roughly 26.72 milliseconds to six.51 millisecondsgetAggregatedFilterTermsdecreased from roughly 12.11 milliseconds to three.51 milliseconds
- Stability – Constant latencies decreased to below 0.02 seconds in comparison with earlier 0.04–0.08 seconds with frequent spikes
The next graph exhibits the latency enhancements throughout key Coverage Engine endpoints.

Conclusion
Netflix’s strategic migration from a self-managed distributed PostgreSQL-compatible database to Aurora PostgreSQL demonstrates a complete method to database consolidation that delivered measurable outcomes throughout efficiency, value, and operational effectivity. The migration achieved as much as 75% efficiency enhancements and 28% value financial savings, whereas eliminating the operational overhead of self-managed databases. Following their migration efforts to Aurora PostgreSQL, Netflix has decreased efficiency spikes whereas offering predictable latency patterns for developer-focused techniques that help Netflix’s inner operations and improvement processes. These enhancements, mixed with Aurora’s absolutely managed capabilities, have established Aurora PostgreSQL as the popular relational database answer for Netflix’s developer neighborhood.
To judge Aurora PostgreSQL to your personal database workloads, confer with Getting began with Amazon Aurora and Greatest Practices with Amazon Aurora to grasp implementation steerage and optimization strategies.
In regards to the Authors
