Wednesday, February 11, 2026

{ w: 1 } Asynchronous Writes and Battle Decision in MongoDB


MongoDB ensures sturdiness—the D in ACID—over the community with sturdy consistency—the C within the CAP theorem—by default. It nonetheless maintains excessive availability: within the occasion of a community partition, a majority of nodes proceed to serve constant reads and writes transparently, with out elevating errors to the appliance.



Cluster state + operation log replication

A Raft‑impressed consensus (phrases, elections, majority commit) is used to realize this distributed consistency at two ranges:

  • Writes are directed to the shard’s major, which coordinates consistency between the gathering and the indexes. A raft-like election is used to elect one duplicate as the first, with the others performing as secondaries.
  • Writes to the shard’s major are replicated to the secondaries and acknowledged as soon as a majority has assured sturdiness on persistent storage. The equal of the Raft log is the info itself—the transaction oplogs.



Comparability with different databases

It’s necessary to differentiate between two sorts of consensus: consensus used to regulate duplicate roles (chief election) and consensus used to duplicate information. In PostgreSQL, for instance, failover automation instruments akin to Patroni use a consensus system (e.g., etcd) to elect a major, however information replication by way of WAL streaming is just not itself ruled by a consensus protocol. Consequently, failures throughout replication can depart replicas in inconsistent states that have to be resolved afterward (e.g., by way of pg_rewind).

PostgreSQL group has mentioned including built-in replication on the hackers mailing checklist, utilizing MongoDB for instance (Constructed-in Raft replication). PostgreSQL forks akin to YugabyteDB replicate information utilizing Raft.

Whether or not utilizing consensus or not, all databases steadiness safety, availability, and efficiency. For instance, in PostgreSQL, setting synchronous_commit to ON and synchronous_standby_names to ANY can guarantee sturdiness over the community whereas remaining accessible in case of failure, like MongoDB’s default (w:majority). Conversely, setting synchronous_commit to off and utilizing w:1 in MongoDB favors efficiency with asynchronous replication.



Commerce-offs between efficiency and safety

Consensus on writes will increase latency, particularly in multi-region deployments, as a result of it requires synchronous replication together with community latency, however it ensures no information loss in catastrophe restoration eventualities (RPO = 0). Some workloads could favor decrease latency and settle for restricted information loss (for instance, a few seconds of RPO when a knowledge heart burns). When you ingest information from IoT gadgets, it’s possible you’ll favor quick ingestion on the danger of shedding some information in such a catastrophe. Equally, when migrating from one other database, you would possibly favor quick synchronization and, in case of an infrastructure failure, merely restart the migration from the purpose earlier than the failure. In such circumstances, you should utilize {w:1} write concern in MongoDB as a substitute of the default {w:"majority"}.

Most failures aren’t full-scale disasters through which a whole information heart is misplaced, however slightly transient points involving quick community disconnections. With {w:1}, the first danger is just not information loss—since writes can ultimately be synchronized—however break up mind, the place each side of a community partition proceed to just accept writes. That is the place the 2 ranges of consensus matter:

  • A brand new major is elected, and the outdated major steps down, limiting the split-brain window to some seconds.
  • With the default {w:"majority"}, writes that can’t attain a majority aren’t acknowledged on the facet of the partition with no quorum. This prevents break up mind. Nevertheless, with {w:1}, these writes are acknowledged till the outdated major steps down.

As a result of the failure is transient, when the outdated major rejoins, no information is bodily misplaced: writes from each side nonetheless exist. Nevertheless, these writes could battle, leading to a diverging database state with two branches. As with every asynchronous replication, this requires battle decision. MongoDB handles this as follows:

  • Writes from the brand new major are preserved, as that is the place the appliance has continued to make progress.
  • Writes that occurred on the outdated major throughout the temporary split-brain window are rolled again, and it pulls the newer writes from the brand new major.

Thus, whenever you use {w:1}, you settle for the potential of restricted information loss within the occasion of a failure. As soon as the node is again, these writes aren’t fully misplaced, however they can’t be merged routinely. MongoDB shops them as BSON recordsdata in a rollback listing so you may examine them and carry out handbook battle decision if wanted.

This battle decision is documented as Recuperate To a Timestamp (RTT).



Demo on a Docker lab

Let’s attempt it. I begin 3 containers as a duplicate set:

docker community create lab
docker run --network lab --name m1 --hostname m1 -d mongo --replSet rs0
docker run --network lab --name m2 --hostname m2 -d mongo --replSet rs0
docker run --network lab --name m3 --hostname m3 -d mongo --replSet rs0
docker exec -it m1 mongosh --eval '
rs.provoke({
  _id: "rs0",
  members: [
    { _id: 0, host: "m1:27017", priority: 3 },
    { _id: 1, host: "m2:27017", priority: 2 },
    { _id: 2, host: "m3:27017", priority: 1 }
  ]
})
'
till
docker exec -it m1 mongosh --eval "rs.standing().members.forEach(m => print(m.title, m.stateStr))" |
 grep -C3 "m1:27017 PRIMARY"
do sleep 1 ; accomplished

Enter fullscreen mode

Exit fullscreen mode

The final command waits till m1 is the first, as set by its precedence. I do this to make the demo reproducible with easy copy-paste.

I insert “XXX-10” when related to m1:

docker exec -it m1 mongosh --eval '
  db.demo.insertOne(
   { _id:"XXX-10" , date:new Date() },
   { writeConcern: {w: "1"}    }
)
'

{ acknowledged: true, insertedId: 'XXX-10' }

Enter fullscreen mode

Exit fullscreen mode

I disconnect the secondary m2:

docker community disconnect lab m2

Enter fullscreen mode

Exit fullscreen mode

With a replication issue of three, the cluster is resilient to 1 failure and I insert “XXX-11”, when related to the first:

docker exec -it m1 mongosh --eval '
  db.demo.insertOne(
   { _id:"XXX-11" , date:new Date() },
   { writeConcern: {w: "1"}    }
)
'

{ acknowledged: true, insertedId: 'XXX-11' }

Enter fullscreen mode

Exit fullscreen mode

I disconnect m1, the present major, and reconnect m2, and instantly insert “XXX-12”, nonetheless related to m1:

docker community disconnect lab m1
docker community    join lab m2

docker exec -it m1 mongosh --eval '
  db.demo.insertOne(
   { _id:"XXX-12" , date:new Date() },
   { writeConcern: {w: "1"}    }
)
'

{ acknowledged: true, insertedId: 'XXX-12' }
Enter fullscreen mode

Exit fullscreen mode

Right here, m1 remains to be a major for a brief interval earlier than it detects it can not attain the vast majority of replicas and steps down. If the write concern was {w: "majority"} it might have waited and failed, not capable of sync to the quorum, however with {w: "1"} the replication is asynchronous and the write is acknowledged when written to native disks.

Two seconds later, the same write fails as a result of the first stepped down:

sleep 2

docker exec -it m1 mongosh --eval '
  db.demo.insertOne(
   { _id:"XXX-13" , date:new Date() },
   { writeConcern: {w: "1"}    }
)
'

MongoServerError: not major

Enter fullscreen mode

Exit fullscreen mode

I wait that m2 is the brand new major, as set by precedence, and connect with it to insert “XXX-20”:

till
docker exec -it m2 mongosh --eval "rs.standing().members.forEach(m => print(m.title, m.stateStr))" |
 grep -C3 "m2:27017 PRIMARY"
do sleep 1 ; accomplished

docker exec -it m2 mongosh --eval '
  db.demo.insertOne(
   { _id:"XXX-20" , date:new Date() },
   { writeConcern: {w: "1"}    }
)
'

{ acknowledged: true, insertedId: 'XXX-20' }

Enter fullscreen mode

Exit fullscreen mode

No nodes are down, it is solely a community partition, and I can learn from all nodes so long as I do not join by the community. I question the gathering on either side:

docker exec -it m1 mongosh --eval 'db.demo.discover()'
docker exec -it m2 mongosh --eval 'db.demo.discover()'
docker exec -it m3 mongosh --eval 'db.demo.discover()'
Enter fullscreen mode

Exit fullscreen mode

The inconsistency is seen, “XXX-12” is just in m1 and “XXX-20” solely in m2 and m3:

I reconnect m1 so that every one nodes can talk and synchronize their state:

docker community    join lab m1

Enter fullscreen mode

Exit fullscreen mode

I question once more and all nodes present the identical values:

“XXX-12” has disappeared and all nodes at the moment are synchronized to the present state. When it rejoined, m1 rolled again the operations that occurred throughout the split-brain window. That is anticipated and acceptable, for the reason that write used a { w: 1 } write concern, which explicitly permits restricted information loss in case of failure so as to keep away from cross-network latency on every write.

The rolled again operations aren’t misplaced, MongoDB logged them in a rollback listing within the BSON format, with the rolled again doc in addition to the associated oplog.

I learn and decode all BSON within the rollback listing:


docker exec -i m1 bash -c '
for f in /information/db/rollback/*/eliminated.*.bson
do
 echo "$f"
 bsondump $f --pretty
accomplished
' | egrep --color=auto '^|^/.*|.*("op":|"XXX-..").*'

Enter fullscreen mode

Exit fullscreen mode

The deleted doc is in /information/db/rollback/0ae03154-0a51-4276-ac62-50d73ad31fe0/eliminated.2026-02-10T10-40-58.1.bson:

{
        "_id": "XXX-12",
        "date": {
                "$date": {
                        "$numberLong": "1770719868965"
                }
        }
}
Enter fullscreen mode

Exit fullscreen mode

The deleted oplog for the associated insert is in /information/db/rollback/native.oplog.rs/eliminated.2026-02-10T10-40-58.0.bson:

{
        "lsid": {
                "id": {
                        "$binary": {
                                "base64": "erR2AoFXS3mbcX4BJSiWjw==",
                                "subType": "04"
                        }
                },
                "uid": {
                        "$binary": {
                                "base64": "47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=",
                                "subType": "00"
                        }
                }
        },
        "txnNumber": {
                "$numberLong": "1"
        },
        "op": "i",
        "ns": "take a look at.demo",
        "ui": {
                "$binary": {
                        "base64": "CuAxVApRQnasYlDXOtMf4A==",
                        "subType": "04"
                }
        },
        "o": {
                "_id": "XXX-12",
                "date": {
                        "$date": {
                                "$numberLong": "1770719868965"
                        }
                }
        },
        "o2": {
                "_id": "XXX-12"
        },
        "stmtId": {
                "$numberInt": "0"
        },
        "ts": {
                "$timestamp": {
                        "t": 1770719868,
                        "i": 1
                }
        },
        "t": {
                "$numberLong": "1"
        },
        "v": {
                "$numberLong": "2"
        },
        "wall": {
                "$date": {
                        "$numberLong": "1770719868983"
                }
        },
        "prevOpTime": {
                "ts": {
                        "$timestamp": {
                                "t": 0,
                                "i": 0
                        }
                },
                "t": {
                        "$numberLong": "-1"
                }
        }
}
Enter fullscreen mode

Exit fullscreen mode

The disappeared worth “XXX-12” is obtainable right here as each its after-image and its oplog entry.



Conclusion: past Raft

By default, MongoDB favors sturdy consistency and sturdiness: writes use { w: "majority" }, are majority-committed, by no means rolled again, and reads with readConcern: "majority" by no means observe rolled-back information. On this mode, MongoDB behaves like a traditional Raft system: as soon as an operation is dedicated, it’s last.

MongoDB additionally enables you to explicitly calm down that assure by selecting a weaker write concern akin to { w: 1 }. In doing so, you inform the system: “Prioritize availability and latency over fast world consistency“. The demo reveals what that means:

  • Throughout a transient community partition, two primaries can briefly settle for writes.
  • Each branches of historical past are durably written to disk.
  • When the partition heals, MongoDB deterministically chooses the bulk department.
  • Operations from the shedding department are rolled again—however not discarded. They’re preserved as BSON recordsdata with their oplog entries.
  • The node then recovers to a majority-committed timestamp (RTT) and rolls ahead.

This rollback habits is the place MongoDB deliberately diverges from vanilla Raft.

In traditional Raft, the replicated log is the supply of fact, and dedicated log entries are by no means rolled again. Raft assumes a linearizable, strongly constant state machine the place the appliance doesn’t anticipate divergence. MongoDB, in contrast, comes from a NoSQL and event-driven background, the place asynchronous replication, eventual consistency, and application-level reconciliation are typically acceptable trade-offs.

Consequently:

  • MongoDB nonetheless makes use of Raft semantics for chief election and phrases, so two primaries are by no means elected in the identical time period.
  • For information replication, MongoDB extends the mannequin with Recuperate To a Timestamp (RTT) rollback.
  • This permits MongoDB to soundly assist decrease write issues, quick ingestion, multi-region latency optimization, and migration workloads.

In brief, MongoDB replication is predicated on Raft, however provides rollback semantics to assist real-world distributed software patterns. Rollbacks occur solely whenever you explicitly permit them, by no means with majority writes, and they’re totally auditable and recoverable.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles