In a relational database, a one-to-many relationship is often applied with two tables. The “one” aspect is the dad or mum and has a main key that’s listed to ensure uniqueness. The “many” aspect is the kid and references the dad or mum’s main key with a international key. An index is commonly added to the international key as effectively, in order that operations on the dad or mum can effectively find its baby rows.
Whereas this design appears simple and splendid from a database administrator’s standpoint—the place knowledge integrity is the first concern—it might shock builders. This normalized knowledge mannequin doesn’t account for entry patterns or cardinalities: the identical construction is used whether or not the “many” aspect comprises thousands and thousands of rows and retains rising or just a few gadgets.
MongoDB takes a distinct method: its knowledge mannequin is optimized for a particular utility, primarily based on identified entry patterns and cardinalities. In a doc mannequin, a one-to-many relationship will be applied both as a number of paperwork linked by references or as an embedded array or subdocument inside a single doc. In each instances, you’ll be able to select whether or not to embed or reference from the dad or mum (the “one” aspect) or from the kid (the “many” aspect).
An instance
I exploit an HR Dataset with staff that I load in an “staff” assortment. It has two million paperwork:
cd /var/tmp
## Obtain HR_Dataset
from Kaggle (https://www.kaggle.com/datasets/kadirduran/hr-dataset)
## Unzip and import
curl -L -o hr-data-mnc.zip https://www.kaggle.com/api/v1/datasets/obtain/rohitgrewal/hr-data-mnc &&
unzip -o hr-data-mnc.zip &&
mongoimport -d mdb -c staff --type=csv --headerline --drop 'HR_Data_MNC_Data Science Lovers.csv'
As soon as imported, I join with mongosh and replace the efficiency ranking so as to add random decimal digits, to acknowledge them higher when evaluating the outcomes:
use mdb;
db.staff.updateMany( {}, [
{
$set: {
Performance_Rating: {
$add: [ { $toDouble: "$Performance_Rating" }, { $rand: {} } ]
}
}
}
]);
This assortment comprises staff related to a division title. I’ll add extra particulars about every division, similar to an outline, after which discover various fashions for this one-to-many relationship, the place one division has many staff and every worker belongs to 1 division.
For every mannequin, I will take a look at the efficiency of the next question:
Let’s determine the highest 10 most excellent energetic staff within the IT division and listing their names together with their efficiency rankings.
Embed within the “many” aspect
The numerous-to-one relationship right here is employee-to-department, as every worker has a division title:
{
_id: ObjectId('693f1d61e235ef0960ae2b53'),
'Unnamed: 0': 22,
Employee_ID: 'EMP0000023',
Full_Name: 'James Valdez',
Division: 'R&D',
Job_Title: 'Analysis Scientist',
Hire_Date: '2017-10-30',
Location: 'East Scott, Mauritius',
Performance_Rating: 5.882713091486423,
Experience_Years: 7,
Standing: 'Retired',
Work_Mode: 'On-site',
Salary_INR: 789283
}
As we typically embed multiple area, I add an outline and construction it as a sub-object:
db.staff.mixture([
{
$addFields: {
Department: {
$switch: {
branches: [
{ case: { $eq: ["$Department", "Finance"] }, then: { Title: "Finance", Description: "Manages the corporate’s budgets, bills, and monetary planning to make sure fiscal well being." } },
{ case: { $eq: ["$Department", "HR"] }, then: { Title: "HR", Description: "Handles recruitment, worker relations, and organizational improvement initiatives." } },
{ case: { $eq: ["$Department", "IT"] }, then: { Title: "IT", Description: "Maintains know-how infrastructure, software program methods, and cybersecurity protections." } },
{ case: { $eq: ["$Department", "Marketing"] }, then: { Title: "Advertising and marketing", Description: "Promotes the corporate’s services by way of strategic campaigns and market analysis." } },
{ case: { $eq: ["$Department", "Operations"] }, then: { Title: "Operations", Description: "Oversees each day enterprise actions, logistics, and course of optimization for effectivity." } },
{ case: { $eq: ["$Department", "R&D"] }, then: { Title: "R&D", Description: "Researches and develops progressive services to help future development." } },
{ case: { $eq: ["$Department", "Sales"] }, then: { Title: "Gross sales", Description: "Builds buyer relationships and drives income by way of product and repair gross sales." } }
],
default: { Title: "$Division", Description: "No description out there" }
}
}
}
},
{
$merge: {
into: "staff", // identical assortment
whenMatched: "merge", // replace present docs
whenNotMatched: "fail"
}
}
])
The consequence for a similar worker is:
{
_id: ObjectId('693f1d61e235ef0960ae2b53'),
'Unnamed: 0': 22,
Employee_ID: 'EMP0000023',
Full_Name: 'James Valdez',
Division: {
Title: 'R&D',
Description: 'Researches and develops progressive services to help future development.'
},
Job_Title: 'Analysis Scientist',
Hire_Date: '2017-10-30',
Location: 'East Scott, Mauritius',
Performance_Rating: 5.882713091486423,
Experience_Years: 7,
Standing: 'Retired',
Work_Mode: 'On-site',
Salary_INR: 789283
}
I need to retrieve the highest 10 best-performing energetic staff from the IT division and show their names and efficiency rankings:
db.staff.discover(
{ "Standing": "Energetic", "Division.Title": "IT" },
{ _id: 0, "Full_Name": 1, "Performance_Rating": 1 }
).type({ "Performance_Rating": -1 }).restrict(10)
// consequence:
[
{ Full_Name: 'Stuart Lopez', Performance_Rating: 5.999973276392604 },
{ Full_Name: 'Mr. Ethan Morton', Performance_Rating: 5.9999561502903065 },
{ Full_Name: 'Lee White', Performance_Rating: 5.999935393136708 },
{ Full_Name: 'Amber Coleman', Performance_Rating: 5.999919949194189 },
{ Full_Name: 'Eugene Brown', Performance_Rating: 5.999917240114123 },
{ Full_Name: 'Nicole Edwards', Performance_Rating: 5.999914413630196 },
{ Full_Name: 'Erika Stewart', Performance_Rating: 5.999902351452448 },
{ Full_Name: 'Jenna King', Performance_Rating: 5.999896490219257 },
{ Full_Name: 'Douglas Hill', Performance_Rating: 5.999886177014563 },
{ Full_Name: 'Richard Gonzalez', Performance_Rating: 5.999879794558417 }
]
Since I’ve no index, it reads all paperwork, which takes 1.3 seconds:
x=db.staff.discover(
{ "Standing": "Energetic", "Division.Title": "IT" },
{ _id: 0, "Full_Name": 1, "Performance_Rating": 1 }
).type({ "Performance_Rating": -1 }).restrict(10).clarify("executionStats");
printjson({
nReturned: x.executionStats.nReturned,
executionTimeMillis: x.executionStats.executionTimeMillis,
totalKeysExamined: x.executionStats.totalKeysExamined,
totalDocsExamined: x.executionStats.totalDocsExamined,
});
// Execution statistics:
{
nReturned: 10,
executionTimeMillis: 1367,
totalKeysExamined: 0,
totalDocsExamined: 2000000
}
One good thing about embedding on the “many” aspect is that you should use all fields to create a compound index. As an example, I can construct an index that helps my filter, type, and projection wants:
db.staff.createIndex({
"Standing":1, // for equality predicate on worker
"Division.Title":1, // for equality predicate on division
"Performance_Rating": 1, // for type and restrict (pagination)
"Full_Name": 1, // for projection (protecting index)
})
The question now immediately retrieves the highest 10 paperwork from the index:
x=db.staff.discover(
{ "Standing": "Energetic", "Division.Title": "IT" },
{ _id: 0, "Full_Name": 1, "Performance_Rating": 1 }
).type({ "Performance_Rating": -1 }).restrict(10).clarify("executionStats")
printjson({
nReturned: x.executionStats.nReturned,
executionTimeMillis: x.executionStats.executionTimeMillis,
totalKeysExamined: x.executionStats.totalKeysExamined,
totalDocsExamined: x.executionStats.totalDocsExamined,
});
// Execution statistics:
{
nReturned: 10,
executionTimeMillis: 0,
totalKeysExamined: 10,
totalDocsExamined: 0
}
Embedding knowledge on the “many” aspect helps create optimized indexes and improves response instances however includes duplicating knowledge from the “one” aspect into the “many” aspect. In our instance, every worker data the division title and outline. This results in two essential results:
- Elevated storage utilization, which will be decreased by way of compression. You may also decide to not embed all fields—similar to storing solely the division title, which is commonly queried with the worker, and preserving the outline in a separate “departments” assortment.
- Any replace to the division data have to be mirrored throughout all related worker data. That is typically manageable for rare modifications, like title updates. Usually, a division title change coincides with broader reorganizations, requiring worker report updates anyway.
Reference and lookup from the “many” aspect
To attenuate duplication, I create a separate “departments” assortment utilizing the distinctive division names and descriptions I embedded, making certain every division’s data is saved solely as soon as:
db.staff.mixture([
{
$group: {
_id: "$Department.Name",
Name: { $first: "$Department.Name" },
Description: { $first: "$Department.Description" }
}
},
{
$project: { _id: 0, Name: 1, Description: 1 }
},
{
$merge: {
into: "departments",
whenMatched: "keepExisting",
whenNotMatched: "insert"
}
}
]);
You may be stunned by the pace of this aggregation pipeline. As a substitute of scanning all paperwork, MongoDB effectively retrieved the distinctive departments by looking for distinct values within the index (a unfastened index scan).
Then, I can substitute the “Division” sub-object with a reference to the “_id” from the “departments” assortment:
db.staff.mixture([
{
$lookup: {
from: "departments",
localField: "Department.Name",
foreignField: "Name",
as: "deptInfo"
}
},
{
$addFields: {
Department: { $arrayElemAt: ["$deptInfo._id", 0] }
}
},
{ $venture: { deptInfo: 0 } },
{
$merge: {
into: "staff",
whenMatched: "merge",
whenNotMatched: "fail"
}
}
]);
Right here is the form of an worker doc with a single area for the division:
{
_id: ObjectId('693f1d61e235ef0960ae2b53'),
'Unnamed: 0': 22,
Employee_ID: 'EMP0000023',
Full_Name: 'James Valdez',
Division: ObjectId('693f2e38c2dd5ab4fbfd73b8'),
Job_Title: 'Analysis Scientist',
Hire_Date: '2017-10-30',
Location: 'East Scott, Mauritius',
Performance_Rating: 5.882713091486423,
Experience_Years: 7,
Standing: 'Retired',
Work_Mode: 'On-site',
Salary_INR: 789283
}
To search out the highest 10 highest-performing energetic staff within the IT division and show their names and rankings, I’ll be part of the division’s assortment utilizing $lookup. Since $lookup incurs a price, it is extra environment friendly to filter the information beforehand. Due to this fact, I first apply $match to filter staff by standing, then carry out the $lookup with the “departments” assortment from the reference, and filter extra with the division title fetched from the international assortment:
x=db.staff.mixture([
{ $match: { Status: "Active" } },
{
$lookup: {
from: "departments",
localField: "Department",
foreignField: "_id",
as: "deptInfo"
}
},
{ $unwind: "$deptInfo" },
{ $match: { "deptInfo.Name": "IT" } },
{
$project: {
_id: 0,
Full_Name: 1,
Performance_Rating: 1
}
},
{ $sort: { Performance_Rating: -1 } },
{ $limit: 10 }
]).clarify("executionStats")
print(x.phases[1])
// Execution plan for the lookup stage:
{
'$lookup': {
from: 'departments',
as: 'deptInfo',
localField: 'Division',
foreignField: '_id',
let: {},
pipeline: [
{ '$match': { Name: { '$eq': 'IT' } } }
],
unwinding: { preserveNullAndEmptyArrays: false }
},
totalDocsExamined: Lengthy('1401558'),
totalKeysExamined: Lengthy('1401558'),
collectionScans: Lengthy('0'),
indexesUsed: [ '_id_' ],
nReturned: Lengthy('421333'),
executionTimeMillisEstimate: Lengthy('94596')
}
I printed the statistics for the $lookup stage, which learn 1,401,558 paperwork—one per energetic worker. Every entry was fast, due to the index on “_id,” however executing it 1,000,000 instances took over a minute. It returned solely 421,333 paperwork as a result of the division title filter was pushed down by the question planner from the next $unwind and $match phases into the $lookup pipeline. The principle difficulty stays studying 1,000,000 an identical paperwork.
Two queries as an alternative of lookup
In case you have a many-to-one relationship and quite a few paperwork on the various aspect, it is higher to affix from the appliance somewhat than utilizing an aggregation pipeline lookup.
I retrieved the “_id” of the departments I’m serious about:
var itDeptId = db.departments.findOne({ Title: "IT" })._id;
If the division names are unlikely to vary, this may be run as soon as and saved within the utility’s cache.
Then I can entry staff filtered by the reference ‘_id’, so I’ll create an index for it:
db.staff.createIndex(
{ Standing: 1, Division: 1, Performance_Rating: -1, Full_Name: 1 }
)
Since I am querying only a single assortment, I do not require an aggregation pipeline:
x=db.staff.discover(
{
Standing: "Energetic",
Division: itDeptId
},
{
_id: 0,
Full_Name: 1,
Performance_Rating: 1
}
).type({ Performance_Rating: -1 }).restrict(10).clarify("executionStats");
printjson({
nReturned: x.executionStats.nReturned,
executionTimeMillis: x.executionStats.executionTimeMillis,
totalKeysExamined: x.executionStats.totalKeysExamined,
totalDocsExamined: x.executionStats.totalDocsExamined,
});
// Execution statistics:
{
nReturned: 10,
executionTimeMillis: 0,
totalKeysExamined: 10,
totalDocsExamined: 0
}
Lastly, when the appliance accesses the lookup desk first, I observe the identical efficiency with a many-to-one reference as with embedding the only merchandise into the various.
Reference from the “many” aspect, however lookup from the “one” aspect
I can carry out the identical operation as above within the utility, utilizing an aggregation pipeline that begins with the departments and consists of staff through a lookup. I exploit a lookup pipeline so as to add the filter for energetic staff:
x=db.departments.mixture([
{
$match: { Name: "IT" }
},
{
$lookup: {
from: "employees",
let: { deptId: "$_id" },
pipeline: [
{
$match: {
$expr: {
$and: [
{ $eq: ["$Department", "$$deptId"] },
{ $eq: ["$Status", "Active"] }
]
}
}
},
{
$type: { Performance_Rating: -1 }
},
{
$restrict: 10
},
{
$venture: {
_id: 0,
Full_Name: 1,
Performance_Rating: 1
}
}
],
as: "staff"
}
},
{
$venture: {
_id: 0,
Title: 1,
staff: 1
}
}
]).clarify("executionStats")
print(x.phases[1])
// Execution plan for the lookup stage:
{
'$lookup': {
from: 'staff',
as: 'staff',
let: { deptId: '$_id' },
pipeline: [
{
'$match': { '$expr': { '$and': [ [Object], [Object] ] } }
},
{ '$type': { Performance_Rating: -1 } },
{ '$restrict': 10 },
{ '$venture': { _id: 0, Full_Name: 1, Performance_Rating: 1 } }
]
},
totalDocsExamined: Lengthy('10'),
totalKeysExamined: Lengthy('10'),
collectionScans: Lengthy('0'),
indexesUsed: [ 'Status_1_Department_1_Performance_Rating_-1_Full_Name_1' ],
nReturned: Lengthy('1'),
executionTimeMillisEstimate: Lengthy('1')
}
The consequence arrives in a single millisecond, studying solely what is important as a result of the lookup pipeline can use the index on the equality predicates and kind area. It’s evident from the execution plan that every one operations on staff have been pushed right down to the lookup pipeline: $match, $type, $restrict, $venture. When coping with a one-to-many relationship with references, and the various aspect includes 1000’s, it is higher to start out from the one aspect.
Reference from the “one” aspect
To discover all potentialities, I can embed the workers within the division paperwork:
db.staff.mixture([
{
$group: {
_id: "$Department",
employees: { $push: "$_id" }
}
},
{
$merge: {
into: "departments",
on: "_id",
whenMatched: [
{ $set: { employees: "$$new.employees" } }
],
whenNotMatched: "discard"
}
}
]);
Here’s what the paperwork appear to be:
{
"_id": ObjectId("693f2e38c2dd5ab4fbfd73b8"),
"Title": "IT",
"Description": "Maintains know-how infrastructure, software program methods, and cybersecurity protections.",
"staff": [
ObjectId("693f1d61e235ef0960ae2b52"),
ObjectId("693f1d61e235ef0960ae2b99"),
ObjectId("693f1d61e235ef0960ae2b23"),
... (1,401,558 in total)
]
}
On this article, I study all choices. Usually, having an array with thousands and thousands of things that’s constantly rising shouldn’t be advisable. Nonetheless, 1,000,000 gadgets nonetheless match inside 16MB:
db.departments.mixture([
{ $project: { _id: 0, Name: 1, docSize: { $bsonSize: "$$ROOT" }, EmployeesCount: { $size: "$employees" }}
}
]);
// consequence:
[
{ Name: 'Finance', docSize: 3886514, EmployeesCount: 199873 },
{ Name: 'HR', docSize: 3071423, EmployeesCount: 159119 },
{ Name: 'IT', docSize: 11909884, EmployeesCount: 601042 },
{ Name: 'Marketing', docSize: 4690681, EmployeesCount: 240081 },
{ Name: 'Operations', docSize: 5890954, EmployeesCount: 300095 },
{ Name: 'R&D', docSize: 1884463, EmployeesCount: 99759 },
{ Name: 'Sales', docSize: 7889665, EmployeesCount: 400031 }
]
I create the precise index for this lookup, and to start out with the departments:
db.staff.createIndex({ Standing: 1, Performance_Rating: -1 });
db.departments.createIndex({ Title: 1 });
The question begins with departments and performs a lookup utilizing an array of worker identifiers:
x=db.departments.mixture([
{ $match: { Name: "IT" } },
{
$lookup: {
from: "employees",
let: { empIds: "$employees" },
pipeline: [
{
$match: {
$expr: {
$and: [
{ $in: ["$_id", "$$empIds"] },
{ $eq: ["$Status", "Active"] }
]
}
}
},
{ $type: { Performance_Rating: -1 } },
{ $restrict: 10 },
{ $venture: { _id: 0, Full_Name: 1, Performance_Rating: 1 } }
],
as: "staff"
}
},
{
$venture: { _id: 0, Title: 1, staff: 1 }
}
]).clarify("executionStats")
printjson(x.phases[1])
// Execution plan for the lookup stage:
{
'$lookup': {
from: 'staff',
as: 'staff',
let: { empIds: '$staff' },
pipeline: [
{
'$match': { '$expr': { '$and': [
{ '$in': [ '$_id', '$$empIds' ] },
{ '$eq': [ '$Status', 'Active' ] }
] } }
},
{ '$type': { Performance_Rating: -1 } },
{ '$restrict': 10 },
{ '$venture': { _id: 0, Full_Name: 1, Performance_Rating: 1 } }
]
},
totalDocsExamined: Lengthy('32'),
totalKeysExamined: Lengthy('32'),
collectionScans: Lengthy('0'),
indexesUsed: [ 'Status_1_Performance_Rating_-1' ],
nReturned: Lengthy('1'),
executionTimeMillisEstimate: Lengthy('460')
}
Performing a lookup from 1,000,000 entries will be gradual, but it surely improves considerably when all filters and pagination are pushed right down to the lookup pipeline. On this case, it took half a second to learn the division with a listing of staff and retrieve 32 staff from the ‘IT’ division. For the reason that index is ordered by ranking, the method stopped after discovering 10 staff.
Embed from the one aspect
Embedding staff into departments is unlikely to be efficient on this case. One purpose is the dimensions constraint: if ObjectIds are already close to their restrict, including additional knowledge is not sensible. One more reason is effectivity: embedding staff signifies that to entry a division, your entire doc have to be learn, even when just one worker’s data is required.
This technique is sensible provided that departments are accessed while not having particular person worker particulars, permitting a protecting index to skip studying the doc, or if all staff are at all times required. Multi-key indexes cannot stop a doc from being fetched.
Understanding use instances and anticipated cardinalities is essential for knowledge modeling. I am not serious about all staff throughout the departments, solely the highest performers. I may hold an array of simply the highest ten energetic staff, or retailer staff with a efficiency rating above 5.9, making certain there aren’t too many however sufficient to determine the highest ten. I construct such an array to embed in departments:
db.departments.mixture([
{
$lookup: {
from: "employees",
let: { deptId: "$_id" },
pipeline: [
{
$match: {
$expr: {
$and: [
{ $eq: ["$Department", "$$deptId"] },
{ $gt: ["$Performance_Rating", 5.9] }
]
}
}
},
{ $type: { Performance_Rating: -1 } },
{
$venture: {
_id: 1,
Full_Name: 1,
Performance_Rating: 1,
Standing: 1
}
}
],
as: "Top_Employees"
}
},
{
$venture: {
_id: 1,
Title: 1,
Description: 1,
Top_Employees: 1
}
},
{
$merge: {
into: "departments",
whenMatched: "substitute",
whenNotMatched: "insert"
}
}
]);
To question it, I do not rely upon the array’s order. As a substitute, I exploit type and restrict to get the highest 10:
x=db.departments.mixture([
{ $match: { Name: "IT" } },
{
$project: {
_id: 0,
Department: { Name: "$Name" },
Employee: {
$slice: [
{
$sortArray: {
input: {
$filter: {
input: "$Top_Employees",
as: "emp",
cond: { $eq: ["$$emp.Status", "Active"] }
}
},
sortBy: { Performance_Rating: -1 }
}
},
10
]
}
}
}
]).clarify("executionStats");
printjson({
nReturned: x.executionStats.nReturned,
executionTimeMillis: x.executionStats.executionTimeMillis,
totalKeysExamined: x.executionStats.totalKeysExamined,
totalDocsExamined: x.executionStats.totalDocsExamined,
});
// Execution statistics:
{
nReturned: 1,
executionTimeMillis: 47,
totalKeysExamined: 1,
totalDocsExamined: 1
}
This question is quick. After all, it’s helpful solely whether it is maintained: any change in an worker’s ranking have to be mirrored accordingly, pushing it to the array if the ranking exceeds 5.9, or eradicating them if it falls under. This affords a great steadiness between embedding all knowledge and referencing all the pieces.
Conclusion
By testing a number of schema designs towards a 2‑million‑doc HR dataset, we noticed how MongoDB’s flexibility in modeling one‑to‑many relationships instantly impacts question pace, index utilization, and doc dimension.
What we realized:
- Overview execution plans along with runtime statistics to precisely perceive the true price of every question.
- Design indexes to match the question sample, together with the lookup predicates: begin with equality filters, then add type or vary fields, and eventually embody projected columns, aiming for a protecting index when potential.
- Embed small, often co-accessed knowledge—particularly the fields utilized in selective filters. For big or risky knowledge, use references as an alternative to stop bloat or duplication, and fetch it per doc when wanted.
- Keep away from utilizing
$lookupon big units. As a substitute, contemplate- Beginning the aggregation from the “one” aspect and pushing filters and limits into the
$lookuppipeline. - Utilizing a number of single-collection queries, making use of filters on the “one” assortment to retrieve the referenced identifiers. Then, use these identifiers when querying the “many” desk, the place the referencing area is listed.
- Beginning the aggregation from the “one” aspect and pushing filters and limits into the
- Be cautious with giant arrays and deep embedding: they will nonetheless question quick with effectively‑designed indexes, however they improve doc dimension and replace price.
- Intention for protecting indexes the place potential, particularly when embedding the “one” aspect into the “many” aspect, to allow index‑solely scans, or no less than filtering earlier than fetching.
Remaining Thought:
In MongoDB, there isn’t a single “greatest” solution to mannequin one-to-many relationships. The proper alternative is dependent upon your learn/write patterns, knowledge quantity, and the way usually your knowledge modifications. When your schema matches your queries and is backed by good indexes, each embedding and referencing will be quick. Examine the execution plan to validate it. Efficiency is proactive with MongoDB, with an information mannequin optimized for the appliance.
Relational databases might look easier as a result of they use a single normalized mannequin for one-to-many relationships, however that modifications as soon as you utilize an ORM or code complicated queries. You continue to have to tune the fetch technique—selecting what to load with joins and what to fetch with separate queries. With SQL databases, you may additionally duplicate and denormalize knowledge so that every one filters and types can use a single compound index. Efficiency is commonly reactive with SQL databases, creating indexes when encountering gradual queries.
Lastly, SQL permits you to begin modeling earlier than you totally perceive how the appliance will use the information, at the price of tuning the mannequin later when efficiency points come up. MongoDB, in contrast, encourages stronger schema design for relationships as soon as your entry patterns and cardinality are effectively understood.
