In SQL databases, international keys act as speedy constraints that confirm the correctness of relationships between tables earlier than accepting a write. This was designed for eventualities during which finish customers can submit random queries on to the database. Consequently, the database is accountable for defending the info mannequin utilizing normalization, integrity constraints, saved procedures, and triggers, relatively than counting on validation carried out earlier than the applying interacts with the database. When relational integrity is violated, an error happens, stopping the consumer from making the modifications, and the applying rolls again and raises an exception.
MongoDB’s NoSQL method differs from relational databases because it was designed for software builders. It depends on software code to implement these guidelines. Use instances are clearly outlined, validation happens on the software degree, and enterprise logic takes priority over international key verification. Eliminating the necessity for added serializable reads related to international keys can considerably enhance write efficiency and scalability.
Referential integrity might be verified asynchronously. As a substitute of elevating an exception—an surprising occasion the applying may not be prepared for—MongoDB permits the write to proceed and provides instruments just like the aggregation framework and change streams to detect and log errors. This method permits error evaluation, knowledge correction, and software fixes with out affecting the applying’s availability and contains the enterprise logic.
Let’s undergo a standard instance of departments and staff, the place all staff should belong to a division.
Two collections with reference
Robust relationships, together with one-to-many, do not essentially require a number of collections with references, particularly in the event that they share the precise lifecycle. Relying on the area’s context, we will embed an inventory of staff inside every division doc to make sure referential integrity and forestall orphans. Alternatively, we would embed division data into every worker’s doc, significantly when division updates are rare—comparable to a easy multi-document change to a division description—or when division modifications are often a part of bigger enterprise reorganizations.
When each entities will not be at all times accessed collectively, have unbounded cardinality, or are up to date independently, you possibly can select to reference one other doc as an alternative of embedding all particulars. For instance, retailer a “deptno” for every worker and keep a separate assortment of departments, every with a novel “deptno”. I insert such knowledge:
// Reset
db.departments.drop();
db.staff.drop();
// Departments
db.departments.createIndex(
{ deptno: 1 }, // deptno will probably be used because the referenced key
{ distinctive: true } // it should be distinctive for many-to-one relationships
;
db.departments.insertMany([
{ deptno: 10, dname: "ACCOUNTING", loc: "NEW YORK" },
{ deptno: 20, dname: "RESEARCH", loc: "DALLAS" },
{ deptno: 30, dname: "SALES", loc: "CHICAGO" },
{ deptno: 40, dname: "OPERATIONS", loc: "BOSTON" }
]);
// Workers in departments 10, 20, and 30
db.departments.createIndex(
{ deptno: 1 }, // reference to departments
;
db.staff.insertMany([
{ empno: 7839, ename: "KING", job: "PRESIDENT", deptno: 10 },
{ empno: 7698, ename: "BLAKE", job: "MANAGER", deptno: 30 },
{ empno: 7782, ename: "CLARK", job: "MANAGER", deptno: 10 },
{ empno: 7566, ename: "JONES", job: "MANAGER", deptno: 20 },
{ empno: 7788, ename: "SCOTT", job: "ANALYST", deptno: 20 },
{ empno: 7902, ename: "FORD", job: "ANALYST", deptno: 20 },
{ empno: 7844, ename: "TURNER", job: "SALESMAN", deptno: 30 },
{ empno: 7900, ename: "JAMES", job: "CLERK", deptno: 30 },
{ empno: 7654, ename: "MARTIN", job: "SALESMAN", deptno: 30 },
{ empno: 7499, ename: "ALLEN", job: "SALESMAN", deptno: 30 },
{ empno: 7521, ename: "WARD", job: "SALESMAN", deptno: 30 },
{ empno: 7934, ename: "MILLER", job: "CLERK", deptno: 10 },
{ empno: 7369, ename: "SMITH", job: "CLERK", deptno: 20 },
{ empno: 7876, ename: "ADAMS", job: "CLERK", deptno: 20 }
]);
I did not declare the schema upfront, as it’s going to come as-is from the applying. For efficiency causes, I declare indexes on each side to allow quick navigation between staff and departments, and between departments and staff.
Question examples
This schema helps all cardinalities, together with hundreds of thousands of staff per division—one thing you would not embed—and is normalized to make sure that updates have an effect on solely a single doc. It additionally permits for bidirectional querying.
Here is an instance of a question that joins all division data to staff as if it have been embedded, however evaluated at learn time:
db.staff.combination([
{
$lookup: { // add all department info in an array
from: "departments",
localField: "deptno",
foreignField: "deptno", // fast access by index on departments
as: "department"
}
},
{
$set: { // get first (and only) match (guaranteed by unique index)
department: { $arrayElemAt: ["$department", 0] }
}
}
]);
Here is an instance of a question that joins all worker data to departments as if it have been duplicated and embedded, however evaluated at learn time:
db.departments.combination([
{
$lookup: { // add all employee info in an array
from: "employees",
localField: "deptno",
foreignField: "deptno", // fast access by index on employees
as: "employees"
}
}
]);
From a efficiency standpoint, performing a lookup is extra pricey than studying from a single embedded assortment. Nevertheless, this overhead is not vital when shopping by way of tens or lots of of paperwork. When selecting this mannequin, as a result of a division may need 1,000,000 staff, you do not retrieve all the info without delay. As a substitute, a $match will filter paperwork earlier than the $lookup within the first question, or a filter will probably be utilized inside the $lookup pipeline within the second question.
I’ve lined these variations in a earlier submit:
What about referential integrity for these queries? If an worker is inserted with a deptno that doesn’t exist in departments, the lookup finds no match. The primary question omits the division data, and the second question does not present the brand new worker as a result of it lists solely the identified division. That is anticipated behaviour for an software that did not insert the referenced division.
Relational database directors usually overstate how severe that is, and even name it knowledge corruption. As a result of SQL defaults to interior joins, that worker can be lacking from the results of the primary question, however with outer joins like $lookup in MongoDB, this doesn’t occur. It’s extra like a NULL in SQL: the knowledge just isn’t but identified, so it isn’t proven. You may add the division later, and the queries will mirror the knowledge because it turns into obtainable.
You should still need to detect when referenced objects will not be inserted after a while, for instance, resulting from an software bug.
Overseas key definition as a $lookup stage
I outline referential integrity utilizing two levels: a lookup stage and a match stage that confirm whether or not the referenced doc exists:
const lookupStage = {
$lookup: {
from: "departments",
localField: "deptno",
foreignField: "deptno",
as: "dept"
}
};
const matchStage = { $match: { dept: { $dimension: 0 } } }
;
The definition is straightforward and much like an SQL international key. In follow, it may be extra advanced and exact. Doc databases are well-suited to well-understood functions the place enterprise logic extends past what might be outlined by a international key. For instance, some staff could quickly haven’t any division—comparable to new hires—or could belong to 2 departments throughout a transition. MongoDB’s versatile schema helps these instances, and also you outline referential integrity guidelines accordingly. You aren’t constrained to an application-unaware mannequin as with SQL schemas. I will preserve it easy for this instance.
One-time validation with an aggregation pipeline
I insert a brand new worker, Eliot, into dept 42, which doesn’t exist but:
db.staff.insertOne({
empno: 9002,
ename: "Eliot",
job: "CTO",
deptno: 42 // Lacking division
});
This doesn’t elevate any errors. In all queries, the worker is seen solely by division quantity, with no different details about that division.
When you resolve that such a state of affairs mustn’t keep and should be detected, you should utilize an aggregation pipeline to checklist the violations, with the lookup and match stage outlined earlier:
db.staff.combination([ lookupStage, matchStage ])
;
This exhibits the workers referencing a division that does not exist:
[
{
_id: ObjectId('694d8b6cd0e5c67212d4b14f'),
empno: 9002,
ename: 'Eliot',
job: 'CTO',
deptno: 42,
dept: []
}
]
We’ve caught the violation asynchronously and might resolve what to do. Possibly the “deptno” was mistaken, perhaps we did not insert the division, or somebody deleted it, or we missed a enterprise situation the place staff might be assigned to a division quantity with out extra data.
In SQL databases, the foundations are fundamental and never pushed by use instances. They examine just for anomalies arising from normalization, and any violation is handled as an error with out additional consideration. Nevertheless, in MongoDB, the place you construct a database for a identified software, the integrity is a part of the enterprise logic.
Whether or not you must run this validation is determined by your database’s dimension and the chance of integrity points. After main knowledge refactoring, run it as an additional examine. To keep away from manufacturing affect, run it on a learn duplicate—a bonus of asynchronous verification. You do not want a excessive isolation degree, as, at worst, concurrent transactions could set off a false warning, which might be checked later. When you restore backups for catastrophe restoration testing, it’s sensible to run the validation on that duplicate to confirm each the restore course of and knowledge integrity of the first database.
Actual-time watcher with change streams
You might also resolve to carry out validations in close to actual time, checking the modifications shortly after they happen.
I begin a change stream for staff, in search of inserts/updates,
and apply the identical $lookup + $match to only the modified doc:
const cs = db.staff.watch([
{ $match: { operationType: { $in: ["insert", "update", "replace"] } } }
]);
print("👀 Watching staff for referential integrity violations...");
whereas (cs.hasNext()) {
const change = cs.subsequent(); // Get the following change occasion
if (["insert", "update", "replace"].contains(change.operationType)) {
const outcome = db.staff.combination([
{ $match: { _id: change.documentKey._id } }, // check the new document
lookupStage, // lookup dept info by deptno
matchStage // keep only docs with NO matching dept
]).toArray();
if (outcome.size > 0) {
print("nâš Actual-time Referential Integrity Violation Detected:");
printjson(outcome[0]);
}
}
}
In one other session, I insert one other worker, Dwight, in division 42, which continues to be lacking:
db.staff.insertOne({
empno: 9001,
ename: "Dwight",
job: "CEO",
deptno: 42 // 🚨 Nonetheless lacking
});
The loop on the change stream get notified of the insert, applies the aggregation pipeline verification and returns the next output:
âš Actual-time Referential Integrity Violation Detected:
{
_id: ObjectId('694da3aa8cd2fa3fe4d4b0c2'),
empno: 9001,
ename: 'Dwight',
job: 'CEO',
deptno: 42,
dept: []
}
Relatively than an error that blocks the applying, the applying handles this as any software alert. It might both create a division mechanically or have a consumer analyze the state of affairs.
Fixing the Violations
I add the lacking division:
db.departments.insertOne({
deptno: 42,
dname: "DEVELOPER EXPERIENCE",
loc: "REMOTE"
});
I re-run the batch examine:
db.staff.combination([ lookupStage, matchStage ]);
Conclusion
In SQL databases, international key constraints require studying the mother or father report earlier than inserting or updating a baby and making use of locks if crucial. When deleting or updating referenced keys, the database checks for present youngsters and waits if a baby is being inserted. If customers manually carry out these operations on manufacturing or if the database administrator doubts the event staff, utilizing a SQL database with declared international keys is advisable.
Nevertheless, these operations usually originate from an software that has already performed the required checks: it reads the referenced desk to get the important thing, verifies no youngsters exist earlier than deleting a mother or father, and infrequently prefers logical over bodily deletions. Further validations can establish bugs, however they will run asynchronously and combine with enterprise logic and alert techniques.
In MongoDB, knowledge integrity is applied by builders throughout varied growth levels in a DevOps method, relatively than solely throughout write operations. This technique depends on not altering the manufacturing database past structured growth practices like peer evaluations and testing. Nevertheless, if the database administrator (DBA) does not management who accesses the database or does not belief the builders, they consider that each one verifications must be carried out inside the database earlier than every commit.
