Nested Loop and Hash Be a part of for MongoDB $lookup

SQL databases supply a number of be part of algorithms. The question planner selects essentially the most environment friendly one by evaluating cardinality and estimated value. For instance, a Nested Loop be part of is right when the outer desk has few rows and an index permits quick entry to the internal desk. In distinction, a Hash Be a part of is best suited to conditions the place the outer desk incorporates many rows and the internal desk should be totally scanned, leading to fewer pricey loops.

Whereas MongoDB offers comparable algorithms, tailored to versatile paperwork, being a NoSQL database means it shifts extra duty to the developer. Builders should design for optimum information entry, already within the information mannequin, however has the benefit of leading to extra predictable efficiency.

I am going to base my instance on a query on Reddit: Optimizing a MongoDB JOIN with $lookup and $restrict. I take advantage of a group of customers, the place every person references a profile. The profile has a standing. The question lists the customers with no profile or with a profile with a standing equal to 2.

In my demo, I arrange two profile keys: “_id,” which is robotically listed in MongoDB, and an “ID” area, which isn’t. This setup illustrates each conditions—an listed lookup desk versus a non-indexed one. Usually, you’d use only one technique, relying on which be part of algorithm you prefer.

db.profiles.drop()
db.customers.drop()

db.profiles.insertMany([
  { _id:102, ID: 102, status: 2 },
  { _id:201, ID: 201, status: 1 },
  { _id:302, ID: 302, status: 2 },
  { _id:403, ID: 403, status: 3 }
]);

db.customers.insertMany([
  { name: "Alice" , profileID: 403 },   // profile status = 3
  { name: "Bob", profileID: 102 },      // profile status = 2
  { name: "Charlie", profileID: 201 },  // profile status = 1
  { name: "Diana", profileID: 102 },    // profile status = 2
  { name: "Eve" },                      // no profile
  { name: "Franck" , profileID: 403 },  // profile status = 3
  { name: "Gaspar" , profileID: 403 },  // profile status = 3
  { name: "Hans" , profileID: 403 },    // profile status = 3
  { name: "Iona" , profileID: 403 },    // profile status = 3
  { name: "Jane" , profileID: 403 },    // profile status = 3
  { name: "Karl" , profileID: 403 },    // profile status = 3
  { name: "Lili" },                     // no profile
  { name: "Math" },                     // no profile
  { name: "Niall" },                    // no profile
  { name: "Oscar" , profileID: 403 },   // status = 3  
  { name: "Paula" , profileID: 102 },   // status = 2  
  { name: "Quentin" , profileID: 201 }, // status = 1  
  { name: "Ravi" , profileID: 102 },    // status = 2  
  { name: "Sofia" },                    // no profile  
  { name: "Takumi" , profileID: 403 },  // status = 3  
  { name: "Uma" , profileID: 403 },     // status = 3  
  { name: "Viktor" , profileID: 403 },  // status = 3  
  { name: "Wafa" , profileID: 403 },    // status = 3  
  { name: "Ximena" , profileID: 403 },  // status = 3  
  { name: "Yara" },                     // no profile  
  { name: "Zubair" },                   // no profile 
]);

Right here is my question on this small information set:

db.customers.combination([  
  {  
    $lookup: {  
      from: "profiles",  
      localField: "profileID",  
      foreignField: "ID",  
      as: "profile"  
    }  
  },  
  {  
    $match: {  
      $or: [  
        { profile: { $eq: [] } }, // no profile  
        { profile: { $elemMatch: { standing: 2 } } } // profile standing = 2  
      ]  
    }  
  },  
  {  
    $challenge: {  
      _id: 0,  
      title: 1,  
      "profile.standing": 1 // hold solely the standing area from joined profile  
    }  
  }  
]);

Notice that the primary optimization I did, is changing the be part of expression (let: { userId: "$_id" },pipeline: [ { $match: { $expr: { $eq: ["$userId", "$$userId"] } } } ]) with localField/foreignField to permit the push down of the be part of predicate (EQ_LOOKUP)

Right here is the outcome:

[
  { name: 'Bob', profile: [ { status: 2 } ] },
  { title: 'Diana', profile: [ { status: 2 } ] },
  { title: 'Eve', profile: [] },
  { title: 'Lili', profile: [] },
  { title: 'Math', profile: [] },
  { title: 'Niall', profile: [] },
  { title: 'Paula', profile: [ { status: 2 } ] },
  { title: 'Ravi', profile: [ { status: 2 } ] },
  { title: 'Sofia', profile: [] },
  { title: 'Yara', profile: [] },
  { title: 'Zubair', profile: [] },
]
Kind "it" for extra
take a look at>

To scale the variety of customers, I multiply every present person by 10,000 utilizing the next script:

const currentUsers = db.customers.discover({},{_id:0, title:1, profileID:1});
currentUsers.forEach(userDoc => {  
    print(`Inserting 10,000 paperwork for: ${JSON.stringify(userDoc)}`);  
    for (let i = 0; i < 10000; i++) { 
      const newUsers = [];
      const clone = { ...userDoc };
      clone.title=`${i}${clone.title}`
      newUsers.push(clone);
      db.customers.insertMany(newUsers);
    }
})

I’ve now 260,026 customers.

Listed nested loop be part of

I run my question with clarify("executionStats") and extract an important statistics concerning the time and the variety of paperwork examined:

x=db.customers.combination([
  { "$lookup" : {
    "from" : "profiles",
    "localField" : "profileID",
    "foreignField" : "_id",
    "as" : "profile" }},
  {
    $match: {
      $or: [
        { "profile": { $eq: [] } }, 
        { "profile": { $elemMatch: { standing: 2 } } },
      ]
    }
  },
]).clarify("executionStats")

xs=x["stages"][0]["$cursor"]["executionStats"];
xp=x["stages"][0]["$cursor"]["queryPlanner"]["winningPlan"]["queryPlan"];
print("nReturned:",           xs["nReturned"],
      "executionTimeMillis:", xs["executionTimeMillis"],
      "totalKeysExamined:",   xs["totalKeysExamined"],
      "totalDocsExamined:",   xs["totalDocsExamined"],
      xp["stage"]+" technique:",  xp["strategy"],
)

The lookup stage has returned all paperwork, because it should be joined earlier than filtering, in 2.5 seconds:

 nReturned: 260026
 executionTimeMillis: 2456
 totalKeysExamined: 190019
 totalDocsExamined: 450045
 EQ_LOOKUP technique: IndexedLoopJoin

The equality lookup used an listed loop be part of technique, with an index scan for every doc:

nReturned: 260026: All native paperwork with their joined profile arrays
executionTimeMillis: 2456: Complete execution time together with each be part of and filter phases
totalKeysExamined: 190019: Solely keys that discovered matches within the profiles assortment’s index on “_id” are accounted (lookup_query_stats). The index can decide “key not discovered” with out really inspecting a key entry.
totalDocsExamined: 450045: customers assortment scan (260,026) + profile paperwork fetched (190,019)

The variety of profiles examined is excessive in comparison with the variety of profiles within the assortment, then one other algorithm will be quicker by studying all profiles as soon as and lookup from a hash desk.

Hash Be a part of on small unindexed desk

MongoDB 8.0 chooses a hash be part of on the next circumstances:

allowDiskUse: true is ready (required for spilling if hash desk exceeds reminiscence)
Overseas assortment is sufficiently small – managed by these parameters: internalQueryCollectionMaxNoOfDocumentsToChooseHashJoin (default: 10,000 docs), internalQueryCollectionMaxDataSizeBytesToChooseHashJoin (default: 100 MB), and internalQueryCollectionMaxStorageSizeBytesToChooseHashJoin (default: 100 MB)
No suitable index exists, disk use is allowed and hash be part of is extra environment friendly

To point out that, I take advantage of the “ID” area, that’s not listed, for the lookup:

x=db.customers.combination([
  { "$lookup" : {
    "from" : "profiles",
    "localField" : "profileID",
    "foreignField" : "ID",
    "as" : "profile" }},
  {
    $match: {
      $or: [
        { "profile": { $eq: [] } }, 
        { "profile": { $elemMatch: { standing: 2 } } },
      ]
    }
  },
],{ allowDiskUse: true } ).clarify("executionStats")

xs=x["stages"][0]["$cursor"]["executionStats"];
xp=x["stages"][0]["$cursor"]["queryPlanner"]["winningPlan"]["queryPlan"];
print("nReturned:",           xs["nReturned"],
      "executionTimeMillis:", xs["executionTimeMillis"],
      "totalKeysExamined:",   xs["totalKeysExamined"],
      "totalDocsExamined:",   xs["totalDocsExamined"],
      xp["stage"]+" technique:",  xp["strategy"],
)

The HashJoin accomplished 3.3x quicker (750ms vs 2,456ms) with considerably totally different execution patterns:

 nReturned: 260026
 executionTimeMillis: 750
 totalKeysExamined: 0
 totalDocsExamined: 260030
 EQ_LOOKUP technique: HashJoin

HashJoin Works in two phases, with no index required (totalKeysExamined: 0):

Construct Part – Scans the overseas assortment as soon as to construct an in-memory hash desk keyed by foreignField values. It has learn the 4 profiles.
Probe Part – Scans the native assortment as soon as, probing the hash desk for every native key. It has learn 260,026 customers.

The whole is 260,030 paperwork examined.

Nested loop be part of with out index

A 3rd possibility is a nested loop be part of that scans the gathering in every loop, fairly than utilizing an index or constructing a hash desk. I disable disk utilization to keep away from hash be part of and use the non-indexed area. to keep away from listed nested loop:

x=db.customers.combination([
  { "$lookup" : {
    "from" : "profiles",
    "localField" : "profileID",
    "foreignField" : "ID",
    "as" : "profile" }},
  {
    $match: {
      $or: [
        { "profile": { $eq: [] } }, 
        { "profile": { $elemMatch: { standing: 2 } } },
      ]
    }
  },
],{ allowDiskUse: false } ).clarify("executionStats")

xs=x["stages"][0]["$cursor"]["executionStats"];
xp=x["stages"][0]["$cursor"]["queryPlanner"]["winningPlan"]["queryPlan"];
print("nReturned:",           xs["nReturned"],
      "executionTimeMillis:", xs["executionTimeMillis"],
      "totalKeysExamined:",   xs["totalKeysExamined"],
      "totalDocsExamined:",   xs["totalDocsExamined"],
      xp["stage"]+" technique:",  xp["strategy"],
)

Listed here are the execution statistics:

 nReturned: 260026
 executionTimeMillis: 1618
 totalKeysExamined: 0
 totalDocsExamined: 1300130
 EQ_LOOKUP technique: NestedLoopJoin

Like different algorithms, 260,026 customers have been learn. Since there was no index on the overseas area, threre’s no index scan in any respect (totalKeysExamined: 0). This prompted the system to scan the 4 profiles for every person, leading to a complete of 260,026 + 4 × 260,026 = 1300130 paperwork examined.

On this instance, this method is 5 instances extra pricey than IndexedLoopJoin when it comes to paperwork examined, and 3 times greater than HashJoin as a result of NestedLoopJoin requires repeatedly scanning the overseas assortment. Curiously, as a result of the lookup assortment may be very small and sequential scans are cache-friendly, the execution time surpasses that of the listed nested loop on this case, as index seeks incur further prices.

Listed nested loop with overlaying index

When you’ve got your key in one other area than “_id”, like right here with “ID”, you may create an index on it and the plan will use an listed nested loop be part of. This has the identical efficiency because the question utilizing “_id” as a result of, apart from clustered collections, all indexes are secondary in MongoDB and entry to the doc through an inner document identifier. Utilizing “_id” avoids creating another index. Nevertheless, one purpose to create one other index is to get it cowl all essential fields:

db.profiles.createIndex({ ID: 1, standing: 1 })

Now, the earlier question utilizing “ID” because the overseas area will use the index and get the identical efficiency because the listed nested loop be part of above when utilizing “_id”. Nevertheless, it isn’t a overlaying index as a result of there are not any projections outlined.

x=db.customers.combination([
  { "$lookup" : {
    "from" : "profiles",
    "localField" : "profileID",
    "foreignField" : "ID",
    "as" : "profile" }},
  {
    $match: {
      $or: [
        { "profile": { $eq: [] } }, 
        { "profile": { $elemMatch: { standing: 2 } } },
      ]
    }
  },
]).clarify("executionStats")

xs=x["stages"][0]["$cursor"]["executionStats"];
xp=x["stages"][0]["$cursor"]["queryPlanner"]["winningPlan"]["queryPlan"];
print("nReturned:",           xs["nReturned"],
      "executionTimeMillis:", xs["executionTimeMillis"],
      "totalKeysExamined:",   xs["totalKeysExamined"],
      "totalDocsExamined:",   xs["totalDocsExamined"],
      xp["stage"]+" technique:",  xp["strategy"],
)

Listed here are the execution statistics, the identical as with the index on “_id”, working in 2.5 seconds:

 nReturned: 260026
 executionTimeMillis: 2484
 totalKeysExamined: 190019
 totalDocsExamined: 450045
 EQ_LOOKUP technique: IndexedLoopJoin

I can add a projection to the lookup pipeline, however watch out, it is likely to be slower:

x=db.customers.combination([
  { "$lookup" : {
    "from" : "profiles",
    "localField" : "profileID",
    "foreignField" : "ID",
    pipeline: [ 
     { $project: { _id: 0, ID: 1, status: 1 } } 
    ],
    "as" : "profile" }},
  {
    $match: {
      $or: [
        { "profile": { $eq: [] } }, 
        { "profile": { $elemMatch: { standing: 2 } } },
      ]
    }
  },
] ).clarify("executionStats")

xs=x["stages"][1];

The index is used as a overlaying index however sadly, this does not use the Slot-Based mostly Question Execution Engine (SBE) and at last takes longer:

{
  '$lookup': {
    from: 'profiles',
    as: 'profile',
    localField: 'profileID',
    foreignField: 'ID',
    let: {},
    pipeline: [ { '$project': { _id: 0, ID: 1, status: 1 } } ]
  },
  totalDocsExamined: Lengthy('0'),
  totalKeysExamined: Lengthy('190019'),
  collectionScans: Lengthy('0'),
  indexesUsed: [ 'ID_1_status_1' ],
  nReturned: Lengthy('260026'),
  executionTimeMillisEstimate: Lengthy('21309')
}

The absence of assortment scan (collectionScans: 0) confirms that an index is used, and it is the index I’ve created (indexesUsed: [ 'ID_1_status_1' ]). The absence of doc examined (totalDocsExamined: 0) confirms that it makes use of a overlaying index. Nevertheless, it took 21 seconds to execute.

When a pipeline is added to the lookup stage, SBE can not be used, so the basic engine plans every lookup question at runtime, which explains the longer elapsed time. SBE’s optimized execution mannequin for joins requires the be part of to be fully specified at planning time, with solely localField/foreignField and no pipeline (determineSbeCompatibility()). In distinction, overlaying indexes want a pipeline for runtime projection management, making it unattainable to make use of each optimizations collectively. Moreover, the present model (findSbeCompatibleStagesForPushdown) limits lookup to totally native collections and considers whether or not an unwind follows. If you realize there might be just one ingredient, use { $arrayElemAt:["$profiles.status",0]} as a substitute of $unwind: "$profiles" for higher effectivity.

Nested Loop with materialization

There may be another chance that may be quicker than a nested loop, just like a hash be part of, however with out constructing a hash desk, simply an array for every doc and do the be part of within the projection. It may well examine to PostgreSQL’s Nested Loop with Materialize.

To learn the lookup solely as soon as, we use an empty pipeline with no be part of situation. As I need to take a look at the overlaying index, as a substitute of an empty pipeline (pipeline: [ ]) I add a filter and projection. This provides all profiles to every person. I outline the take part a projection after the lookup, the place I discover the array index for the profile, with $indexOfArray, and get the standing for this merchandise:

x=db.customers.combination([  
  {  
    $lookup: {  
      from: "profiles",  
      let: { pid: "$profileID" },  
      pipeline: [  
       { $match: { ID: { $ne: null } } },  
       { $project: { _id: 0, ID: 1, status: 1 } }  
      ],  
      as: "profiles"  
    }  
  },  
  {  
  $challenge: {  
    standing: {  
      $arrayElemAt: [  
        "$profiles.status",  
        { $indexOfArray: ["$profiles.ID", "$profileID"] }  
      ]  
    },  
    matchFound: {  
      $ne: [  
        { $indexOfArray: ["$profiles.ID", "$profileID"] },  
        -1  
      ]  
    }  
  }  
  },
  {  
    $match: {  
      $or: [  
        { matchFound: false },       // true “no profile”  
        { status: 2 }  
      ]  
    }  
  }  
]).clarify("executionStats")

The execution takes 3.5 seconds, not utilizing the SBE, and the lookup stage reads the profile solely as soon as, from the overlaying index:

take a look at> xs=x["stages"][1]

{
  '$lookup': {
    from: 'profiles',
    as: 'profiles',
    let: { pid: '$profileID' },
    pipeline: [
      { '$match': { ID: { '$ne': null } } },
      { '$project': { _id: 0, ID: 1, status: 1 } }
    ]
  },
  totalDocsExamined: Lengthy('0'),
  totalKeysExamined: Lengthy('4'),
  collectionScans: Lengthy('0'),
  indexesUsed: [ 'ID_1_status_1' ],
  nReturned: Lengthy('260026'),
  executionTimeMillisEstimate: Lengthy('3447')
}

Quite than being a local be part of algorithm, this method is just like executing two separate queries after which becoming a member of the ends in the appliance. Nevertheless, with this technique, the be part of is carried out utilizing aggregation pipeline operators. It reads much less paperwork, and makes use of the overlaying index, however does not profit from the SBE optimisations.

Abstract

Like with any database, you will need to perceive the be part of algorithms. Whenever you use a lookup within the aggregtion pipleline, MongoDB selects the be part of technique based mostly on the question, collections and present indexes. The three algorithms are:

Listed Loop Be a part of, used when suitable index exists on foreignField. It’s best when Low to medium match price or overseas assortment is giant
Hash Be a part of, used when allowDiskUse is ready and no suitable index, however solely when the overseas assortment is small (< 10,000 docs, < 100MB). It’s best with excessive match price on plenty of paperwork
Nested Loop Be a part of is a fallback when the earlier ones can’t be used. It’s acceptable just for tiny overseas collections

Not like SQL databases, the place the optimizer makes all selections however also can result in surprises, MongoDB shifts duty to builders. You should:

Design your schema with be part of efficiency in thoughts. For bounded relationships, embedding could also be the perfect answer.
Perceive your information (assortment sizes, match charges) to foretell which technique might be be the perfect.
Take a look at totally different methods by creating/dropping indexes and toggling allowDiskUse. Create an index solely if you need to use it.
Measure efficiency utilizing clarify("executionStats") to validate your selections.

This design favors predictability and management over counting on automated optimization. So if you hear statements like ‘joins are gradual’ or ‘lookups are quick’, take time to know the details, how these operations are literally executed, earlier than forming an opinion.

Nested Loop and Hash Be a part of for MongoDB $lookup

Listed nested loop be part of

Hash Be a part of on small unindexed desk

Nested loop be part of with out index

Listed nested loop with overlaying index

Nested Loop with materialization

Abstract

Related Articles

What does the longer term maintain for generative AI? | MIT Information

Finest Soundcore deal: Save $30.01 on Soundcore by Anker Q20i headphones at Amazon

How can we execute selenium automation script, with out Visible Studio?

LEAVE A REPLY Cancel reply

Latest Articles

What does the longer term maintain for generative AI? | MIT Information

Finest Soundcore deal: Save $30.01 on Soundcore by Anker Q20i headphones at Amazon

How can we execute selenium automation script, with out Visible Studio?

5 Kimi K2.5 Options for Builders [MUST TRY]

Honeywell Launches Guided Work Options to Streamline Workforce Operations