Unnesting Scalar Subqueries into Left Outer Joins in SQL in Aurora and PostgreSQL

December 24, 2025

62

Unnesting Scalar Subqueries into Left Outer Joins in SQL in Aurora and PostgreSQL — https%3A%2F%2Fdev to uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fphkxjvhd9o82ke0h61k3

Relational databases deal with your question as a declarative description of the specified consequence and choose probably the most environment friendly execution plan. They might rewrite the question—for instance, by reworking subqueries into joins and vice versa—so the database, not the developer, manages the advanced optimisation.

Traditionally, PostgreSQL has had fewer planner transformations than many business databases. As an open-source mission, it favors easier code that promotes contributions and good SQL design, whereas business distributors can justify extra advanced planning logic when it helps revenue-generating prospects in lowering their optimization efforts. PostgreSQL doesn’t preserve a worldwide shared plan cache, so most queries are deliberate per execution. This encourages protecting planning overhead low. The one exceptions are when utilizing ready statements, features, or different conditions the place a plan will be generic, retained and reused. Transformations open extra entry paths, probably bettering execution plans, however at the price of larger planning time.

AWS not too long ago added these sorts of transformations to Amazon Aurora PostgreSQL (APG) to optimize queries with a correlated subquery within the projection:

aws.amazon.com

With out this transformation, projection expressions are evaluated per row, which at greatest yields nested-loop–like efficiency. By pushing the subquery down so it runs earlier than the projection, extra environment friendly be part of algorithms can be utilized. Amazon Aurora is a proprietary fork of PostgreSQL, and its enhancements aren’t contributed upstream, so this function shouldn’t be accessible in PostgreSQL. Nonetheless, analyzing the transformation Aurora implements can encourage related guide rewrites of queries. I’ll use the instance from the AWS weblog publish to indicate rewrite the question extra effectively. You will need to perceive the transformation fastidiously to make sure it doesn’t change the outcomes.

Right here is the DDL and DML to create the identical instance as AWS’s article:

-- Clear up from earlier runs

DROP TABLE IF EXISTS outer_table;
DROP TABLE IF EXISTS inner_table;

-- Outer desk (like prospects)

CREATE TABLE outer_table (
    id  SERIAL PRIMARY KEY,
    a   INT,
    b   INT
);

-- Interior desk (like orders)

CREATE TABLE inner_table (
    id  SERIAL PRIMARY KEY,
    a   INT,
    b   INT
);

-- Insert knowledge into inner_table:
--   10,000 rows, 'a' cycles from 1..100, 'b' random 0–999

INSERT INTO inner_table (a, b)
SELECT
    1 + mod(gs - 1, 100),
    ground(random() * 1000)::int
FROM generate_series(1, 10000) AS gs;

-- Insert outer_table:
--   First 25K rows: a = 1..100 repeated

INSERT INTO outer_table (a, b)
SELECT
    1 + mod(gs - 1, 100),
    ground(random() * 1000)::int
FROM generate_series(1, 25000) AS gs;
--   Subsequent 25K rows: distinctive a = 101..35100
INSERT INTO outer_table (a, b)
SELECT
    gs + 100,
    ground(random() * 1000)::int
FROM generate_series(1, 25000) AS gs;
-- Collect statistics
ANALYZE outer_table;
ANALYZE inner_table;

I’ve arrange an Aurora Serverless database to check it by enabling the transformation parameter and working the question, displaying the execution plan:

SET apg_enable_correlated_scalar_transform = on;

clarify (analyze , verbose, prices off)
SELECT outer_table.a, outer_table.b
       , (SELECT AVG(inner_table.b)
          FROM inner_table
          WHERE inner_table.a = outer_table.a
) FROM outer_table
;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Hash Left Be part of (precise time=4.904..15.740 rows=50000 loops=1)
   Output: outer_table.a, outer_table.b, (avg(inner_table.b))
   Interior Distinctive: true
   Hash Cond: (outer_table.a = inner_table.a)
   ->  Seq Scan on public.outer_table (precise time=0.016..2.968 rows=50000 loops=1)
         Output: outer_table.id, outer_table.a, outer_table.b
   ->  Hash (precise time=2.985..2.986 rows=100 loops=1)
         Output: (avg(inner_table.b)), inner_table.a
         Buckets: 1024  Batches: 1  Reminiscence Utilization: 13kB
         ->  HashAggregate (precise time=2.930..2.960 rows=100 loops=1)
               Output: avg(inner_table.b), inner_table.a
               Group Key: inner_table.a
               Batches: 1  Reminiscence Utilization: 32kB
               ->  Seq Scan on public.inner_table (precise time=0.016..0.637 rows=10000 loops=1)
                     Output: inner_table.id, inner_table.a, inner_table.b
 Question Identifier: -2382945993278526738
 Planning Time: 2.439 ms
 Execution Time: 23.322 ms
(18 rows)

I’ve added the verbose choice to clarify as a result of you will need to see all parts of the rewritten question. Right here, with the transformation, (SELECT AVG(inner_table.b) FROM ... WHERE ... ) has been remodeled to (avg(inner_table.b)) over a Hash Left Be part of to the deduplicated (HashAggregate) inside desk.

It’s the equal of:

clarify (analyze , verbose, prices off )
SELECT outer_table.a,  outer_table.b
       , agg.avg_b
FROM outer_table
LEFT JOIN (
    SELECT a, AVG(b) AS avg_b
    FROM inner_table
    GROUP BY a
) AS agg
ON outer_table.a = agg.a
;
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Hash Left Be part of (precise time=4.469..16.534 rows=50000 loops=1)
   Output: outer_table.a, outer_table.b, agg.avg_b
   Interior Distinctive: true
   Hash Cond: (outer_table.a = agg.a)
   ->  Seq Scan on public.outer_table (precise time=0.011..3.124 rows=50000 loops=1)
         Output: outer_table.id, outer_table.a, outer_table.b
   ->  Hash (precise time=3.804..3.806 rows=100 loops=1)
         Output: agg.avg_b, agg.a
         Buckets: 1024  Batches: 1  Reminiscence Utilization: 13kB
         ->  Subquery Scan on agg (precise time=3.733..3.778 rows=100 loops=1)
               Output: agg.avg_b, agg.a
               ->  HashAggregate (precise time=3.732..3.765 rows=100 loops=1)
                     Output: inner_table.a, avg(inner_table.b)
                     Group Key: inner_table.a
                     Batches: 1  Reminiscence Utilization: 32kB
                     ->  Seq Scan on public.inner_table (precise time=0.004..0.668 rows=10000 loops=1)
                           Output: inner_table.id, inner_table.a, inner_table.b
 Question Identifier: -3523129028670016640
 Planning Time: 1.361 ms
 Execution Time: 19.674 ms
(20 rows)

This seems to be easy, however SQL will be bizarre and never all aggregation features have the identical semantic with the absence of values. If there are not any rows from the inside desk, AVG() returns a NULL, just like the correlated subquery would return within the absence of rows. Nonetheless, if a COUNT() was used as a substitute of AVG() it could be totally different, as a subquery nonetheless returns no rows however a COUNT() would return 0.

I check the automated transformation on Aurora with a COUNT():

clarify (analyze , verbose, prices off)
SELECT outer_table.a, outer_table.b
       , (SELECT COUNT(inner_table.b)
          FROM inner_table
          WHERE inner_table.a = outer_table.a
) FROM outer_table
;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Hash Left Be part of (precise time=2.319..13.332 rows=50000 loops=1)
   Output: outer_table.a, outer_table.b, COALESCE((rely(inner_table.b)), 0)
   Interior Distinctive: true
   Hash Cond: (outer_table.a = inner_table.a)
   ->  Seq Scan on public.outer_table (precise time=0.012..3.003 rows=50000 loops=1)
         Output: outer_table.id, outer_table.a, outer_table.b
   ->  Hash (precise time=2.302..2.304 rows=100 loops=1)
         Output: (rely(inner_table.b)), inner_table.a
         Buckets: 1024  Batches: 1  Reminiscence Utilization: 13kB
         ->  HashAggregate (precise time=2.255..2.268 rows=100 loops=1)
               Output: rely(inner_table.b), inner_table.a
               Group Key: inner_table.a
               Batches: 1  Reminiscence Utilization: 24kB
               ->  Seq Scan on public.inner_table (precise time=0.003..0.640 rows=10000 loops=1)
                     Output: inner_table.id, inner_table.a, inner_table.b
 Question Identifier: 6903753335662751945
 Planning Time: 1.267 ms
 Execution Time: 15.219 ms
(18 rows)

Now the VERBOSE possibility exhibits COALESCE((rely(inner_table.b)), 0) to remodel a NULL right into a zero, with a view to match the COUNT semantic.

Right here is the equal question if you wish to do the transformation manually:

clarify (analyze , verbose )
SELECT outer_table.a,  outer_table.b
       , COALESCE(agg.cnt_b, 0) AS cnt_b
FROM outer_table
LEFT JOIN (
    SELECT a, COUNT(b) AS cnt_b
    FROM inner_table
    GROUP BY a
) AS agg
ON outer_table.a = agg.a;

                                                                  QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Hash Left Be part of  (price=208.25..1110.53 rows=50000 width=16) (precise time=2.556..77.785 rows=50000 loops=1)
   Output: outer_table.a, outer_table.b, COALESCE(agg.cnt_b, '0'::bigint)
   Interior Distinctive: true
   Hash Cond: (outer_table.a = agg.a)
   ->  Seq Scan on public.outer_table  (price=0.00..771.00 rows=50000 width=8) (precise time=0.012..3.183 rows=50000 loops=1)
         Output: outer_table.id, outer_table.a, outer_table.b
   ->  Hash  (price=207.00..207.00 rows=100 width=12) (precise time=2.532..2.534 rows=100 loops=1)
         Output: agg.cnt_b, agg.a
         Buckets: 1024  Batches: 1  Reminiscence Utilization: 13kB
         ->  Subquery Scan on agg  (price=205.00..207.00 rows=100 width=12) (precise time=2.485..2.509 rows=100 loops=1)
               Output: agg.cnt_b, agg.a
               ->  HashAggregate  (price=205.00..206.00 rows=100 width=12) (precise time=2.484..2.497 rows=100 loops=1)
                     Output: inner_table.a, rely(inner_table.b)
                     Group Key: inner_table.a
                     Batches: 1  Reminiscence Utilization: 24kB
                     ->  Seq Scan on public.inner_table  (price=0.00..155.00 rows=10000 width=8) (precise time=0.006..0.679 rows=10000 loops=1)
                           Output: inner_table.id, inner_table.a, inner_table.b
 Question Identifier: 4982770911819576582
 Planning Time: 0.151 ms
 Execution Time: 80.622 ms
(20 rows)

You want COALESCE with COUNT within the guide be part of rewrite due to how SQL aggregates behave: in contrast to different aggregates, COUNT returns 0—not NULL—when there are not any rows.

Aurora PostgreSQL’s apg_enable_correlated_scalar_transform exhibits how the planner can rewrite correlated subqueries within the SELECT record into join-based aggregates, changing per-row subqueries with set-based aggregation and hash joins for a lot better efficiency.

Even in the event you don’t use Aurora in manufacturing, you possibly can spin up a small Aurora Serverless occasion to validate your guide rewrites. Allow the parameter and examine the execution plan to your PostgreSQL model. The plans ought to match, with one structural distinction: the guide model’s grouped subquery is wrapped in a Subquery Scan, which is predicted in hand-written SQL.

For aggregates like AVG(), the rewrite preserves semantics with no further adjustments. For COUNT(), wrap the be part of output in COALESCE(..., 0) so it behaves like the unique correlated subquery when no rows are discovered.

By understanding these transformations and their semantics, you possibly can reproduce Aurora’s optimization in upstream PostgreSQL and achieve related efficiency advantages whereas protecting full management over correctness.

Unnesting Scalar Subqueries into Left Outer Joins in SQL in Aurora and PostgreSQL

Related Articles

Moltbook was peak AI theater

The 10TB lifetime cloud storage deal that pays for itself ends tomorrow

The Absolute Madness of Moltbook

LEAVE A REPLY Cancel reply

Latest Articles

Moltbook was peak AI theater

The 10TB lifetime cloud storage deal that pays for itself ends tomorrow

The Absolute Madness of Moltbook

AI Artwork Store Staff AI created Artwork

How AI-Prepared Avenue Section Information Powers Higher Location-Primarily based Choice-Making