A Nightmare no DBA desires
Someday, I had a nightmare, the sort each DBA fears.
My supervisor known as me and stated,
“Hey… Mr. Intelligent. We don’t want you on the group anymore.”
I panicked.
“What did I do incorrect?!”
He replied,
“You inform me. Take into consideration your ‘nice’ work yesterday.”
Making an attempt to defend myself, I stated proudly:
“Sir, one node was down. PostgreSQL service was inactive, so I began it… and the node got here again up!
Fairly sensible, proper?”
He sighed, paused for dramatic impact, and stated,
“Get nicely quickly.”
Then he hung up.
I awakened immediately, my coronary heart racing.
Not due to the boss,
However as a result of the true horror hit me:
We’re working a Patroni cluster…
…and I began PostgreSQL manually in my dream.
That’s not “intelligent work.”
That’s a database-level crime, punishable by
“Why-is-my-cluster-broken-now?”
and
“Why-are-my-replicas-on-a-different-timeline?”
and probably
“Why-is-ETCD-looking-at-me-like-that?”
Fortunately, it was solely a nightmare.
This weblog goals to assist stop anybody from dwelling that nightmare. It clearly explains what occurred, why it’s harmful, and easy methods to keep away from it.
Understanding Patroni’s Position within the Cluster :
Patroni is liable for managing :
- Startup and shutdown of PostgreSQL
- Replication configuration
- Chief election and failover/fencing choices
- WAL (Write-Forward Logging) timeline coordination
- Restoration choices and cluster consistency
- Coordinating cluster state with the DCS (e.g., ETCD)
- Stopping divergence and corruption
What Truly Occurs When PostgreSQL Begins Outdoors Patroni
In a Patroni-based high-availability (HA) cluster, it isn’t really useful to manually begin PostgreSQL in a working Patroni cluster utilizing both the postgres service or pg_ctl.
When you begin PostgreSQL manually whereas Patroni is down:
When PostgreSQL is manually began whereas Patroni is down, the node is not underneath Patroni’s management and begins working as a standalone occasion.
Nevertheless, this doesn’t instantly create a conflicting timeline.
- A conflicting (divergent) timeline solely happens if a failover or switchover occurs whereas the node is standalone.
- If the manually began node was beforehand a duplicate, and no failover occurred, it might nonetheless proceed replicating from the first however it’s invisible to Patroni and can’t take part in coordinated HA.
- If the manually began node was the former chief, and Patroni elects a brand new chief elsewhere, solely then does a real timeline divergence happen.
- Standalone duplicate + no failover → replication could proceed, no new timeline
- Standalone chief + failover happens → timeline divergence, requiring pg_rewind or full reinit
Takeaway: A manually began main is probably the most harmful situation; it may well instantly result in a divergent timeline if a brand new chief is elected. A manually began duplicate continues to be harmful, however divergence happens provided that a frontrunner change occurs whereas it runs standalone.
Earlier than diving deeper, I’d wish to share a take a look at with you to supply a greater understanding of potential outcomes.
Detailed Take a look at Case:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Working System: Crimson Hat Enterprise Linux launch 9.6 (Plow)
Cluster Topology:
Node IP Handle Position (Preliminary) node1 192.168.183.131 Chief node2 192.168.183.129 Duplicate node3 192.168.183.130 Duplicate PostgreSQL Model: 17.7 Patroni: Energetic on all nodes DCS: etcd 3–node cluster
Notice: Confirm ETCD main model compatibility. Older Linux distributions could nonetheless ship legacy ETCD releases that are incompatible with fashionable Patroni variations.
Replication: Streaming pg_rewind: Enabled |
Seize Baseline State
|
[root@node1 ~]# patronictl -c /and many others/patroni/patroni.yml checklist
+ Cluster: postgres (7575009857136431951) ———–+——+——————–+——–+——————+——–+ | Member | Host | Position | State | TL | Obtain LSN | Lag | Replay LSN | Lag | +————+————————–+————–+—————–+——+——————–+——–+——————+——–+ | node1 | 192.168.183.131 | Chief | working | 4 | | | | | | node2 | 192.168.183.129 | Duplicate | streaming | 4 | 0/5055240 | 0 | 0/5055240 | 0 | | node3 | 192.168.183.130 | Duplicate | streaming | 4 | 0/5055240 | 0 | 0/5055240 | 0 | +————+————————–+————–+—————–+——+——————–+——–+——————+——–+ |
This confirms cluster well being: replicas have acquired and replayed WAL as much as the chief’s present LSN and all nodes are on the identical timeline.
Crash Take a look at & Failure Simulation:
Now, simulating the crash on node1:
1] Kill Patroni on node1 (the present chief)
2] Begin PostgreSQL manually:
|
[root@node1 ~]# systemctl begin postgresql-17.service |
3] Node1 is now a standalone main on timeline 4.
4] In the meantime, Patroni promotes node3 → timeline 5.
Patroni now exhibits the modified cluster state:
|
[root@node1 ~]# patronictl -c /and many others/patroni/patroni.yml checklist + Cluster: postgres (7575009857136431951) ———–+——+——————–+——–+——————+——–+ | Member | Host | Position | State | TL | Obtain LSN | Lag | Replay LSN | Lag | +————+————————–+————–+—————–+——+——————–+——–+——————+——–+ | node2 | 192.168.183.129 | Duplicate | streaming | 5 | 0/5055380 | 0 | 0/5055380 | 0 | | node3 | 192.168.183.130 | Chief | working | 5 | | | | | +————+————————–+————–+—————–+——+——————–+——–+——————+——–+ |
From the output proven above, we will see {that a} timeline divergence has occurred, and node1 is lacking as a result of its providers are down. Quite than beginning the Patroni providers, I immediately initiated the PostgreSQL service. The output additionally signifies that node3 is now the chief and node2 is functioning because the duplicate, each working on the identical timeline. Moreover, it’s clear that node1 is not a part of the Patroni cluster.
Divergence Demonstration
Writes on standalone node1
|
postgres=# SELECT timeline_id FROM pg_control_checkpoint(); timeline_id ——————– 4 (1 row)
postgres=# SELECT pg_current_wal_lsn(); pg_current_wal_lsn —————————— 0/50552F0 (1 row) |
Now carry out transactions on the standalone node1:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
postgres=# create desk crash_patroni (id int); CREATE TABLE
postgres=# SELECT pg_current_wal_lsn(); pg_current_wal_lsn —————————— 0/506EDF8 (1 row)
postgres=# insert into crash_patroni values(1), (2), (3); INSERT 0 3
postgres=# SELECT pg_current_wal_lsn(); pg_current_wal_lsn —————————— 0/5070E68 (1 row) |
In the meantime, on the brand new chief (node3):
|
postgres=# create desk crash_new_leader (id int); CREATE TABLE postgres=#
postgres=# SELECT pg_current_wal_lsn(); pg_current_wal_lsn —————————— 0/5070B80 (1 row) |
Node2 replicates Node3’s modifications usually.
|
postgres=# choose * from crash_new_leader; id —— 1 2 3 (3 rows) |
Patroni itemizing on node3:
|
[root@node3 ~]# patronictl -c /and many others/patroni/patroni.yml checklist
+ Cluster: postgres (7575009857136431951) ———–+——+——————–+——–+——————+——–+ | Member | Host | Position | State | TL | Obtain LSN | Lag | Replay LSN | Lag | +————+————————–+————–+—————–+——+——————–+——–+——————+——–+ | node2 | 192.168.183.129 | Duplicate | streaming | 5 | 0/5070B80 | 0 | 0/5070B80 | 0 | | node3 | 192.168.183.130 | Chief | working | 5 | | | | | +————+————————–+————–+—————–+——+——————–+——–+——————+——–+ |
Rejoining node1 (begin Patroni)
When Patroni is restarted on node1, it detects the divergence and runs pg_rewind to synchronize the datadir with the brand new chief:
|
[root@node1 ~]# patronictl checklist
node1 | Duplicate | TL 5 |
Node1 is rewound and joins as a reproduction.
|
[root@node1 ~]# patronictl -c /and many others/patroni/patroni.yml checklist
+ Cluster: postgres (7575009857136431951) ———–+——+——————–+——–+——————+——–+ | Member | Host | Position | State | TL | Obtain LSN | Lag | Replay LSN | Lag | +————+————————–+————–+—————–+——+——————–+——–+——————+——–+ | node1 | 192.168.183.131 | Duplicate | streaming | 5 | 0/5070CC0 | 0 | 0/5070CC0 | 0 | | node2 | 192.168.183.129 | Duplicate | streaming | 5 | 0/5070CC0 | 0 | 0/5070CC0 | 0 | | node3 | 192.168.183.130 | Chief | working | 5 | | | | | +————+————————–+————–+—————–+——+——————–+——–+——————+——–+ |
Now verify the desk created earlier on node1:
|
postgres=# choose * from crash_patroni;
ERROR: relation “crash_patroni” does not exist |
The desk created on node3 (crash_new_leader) is now current on node1:
|
postgres=# choose * from crash_new_leader;
id
—— 1 2 3
(3 rows) |
What occurred: Let me clarify.
Patroni detected:
- Node1 was on timeline 4
- The cluster progressed to timeline 5 (through promotion of node3)
- Divergence occurred
Patroni invoked pg_rewind:
|
pg_rewind: servers diverged at WAL location... pg_rewind: rewinding... pg_rewind: Achieved! |
In consequence:
- Node1’s knowledge listing was rewritten to match the brand new chief (node3)
- The desk crash_patroni created whereas node1 was standalone was misplaced (knowledge loss)
- Node1 efficiently rejoined as a reproduction on timeline 5
- Writes carried out throughout standalone mode had been erased
Pg_rewind: stipulations and failure modes
pg_rewind is the most secure restoration methodology to restore a diverged node as a result of it:
- Resets the diverging timeline
- Reconstructs the datadir to match the brand new chief
- Rejoins the node with out requiring a full base backup
Essential stipulations:
- wal_log_hints = on OR knowledge checksums have to be enabled.
Patroni usually ensures wal_log_hints = on for you, however all the time confirm the setting, particularly on manually constructed clusters or upgraded nodes.
If pg_rewind is:
- Disabled
- Misconfigured
- Fails on account of lacking WAL segments or catalog mismatch
- Crashes unexpectedly
then reinitialization (a full base backup + restore) is the one restoration possibility.
Key Dangers and Issues
1. Patroni have to be the only controller
Handbook PostgreSQL begin = Rogue occasion.
2. Standalone writes WILL be misplaced
Each operation carried out in standalone mode disappears.
3. Break up-brain–like signs
Even when not a real split-brain, signs embrace:
- Diverged timelines
- Orphaned tables
- Inconsistent LSNs
- Complicated restoration choices
4. Operational confusion
Completely different nodes could disagree on whether or not a desk exists.
Precautions sensible steps
Masks the OS service
Stop unintentional systemctl begin postgresql-XX.service by masking the unit:
|
sudo systemctl masks postgresql–XX.service |
Monitor each Patroni and PostgreSQL providers
Create alerts for:
- Sudden postgresql-XX.service = lively when patroni. service is down
- Lacking Patroni heartbeat
- Timeline divergence detection
- Replicas not receiving WAL
- Sudden PID or process-name mismatches
- pg_rewind occasions in Patroni logs
Validate pg_rewind stipulations
Verify wal_log_hints = on or knowledge checksums are enabled earlier than taking cluster nodes into manufacturing.
Examine ETCD compatibility
Guarantee your ETCD main model is appropriate together with your Patroni launch; older distributions could comprise legacy ETCD packages.
Use port-based well being checks
Use netstat checks for port 5432 moderately than process-name matching throughout heterogeneous OS households.
Backup when unsure
When you suspect standalone writes occurred earlier than restarting Patroni, take a filesystem-level backup of the datadir if attainable — this might help forensic investigation.
Early detection is crucial to forestall knowledge loss.
Abstract
Beginning PostgreSQL manually in a Patroni-managed HA cluster is not a innocent mistake — it’s a data-consistency danger. It may trigger:
- Timeline divergence
- Break up-brain–like conduct
- Misplaced transactions and DDL
- Cluster confusion and administrative overhead
- Compelled pg_rewind or full reinitialization
Observe the precautions above, guarantee pg_rewind stipulations, monitor each Patroni and PostgreSQL, and all the time begin PostgreSQL through Patroni. Try this, and also you’ll tremendously scale back the chance of waking as much as a DBA nightmare.
When you’re utilizing a Patroni cluster, all the time begin PostgreSQL by the Patroni service. Your database will admire it. Let’s hold every little thing working easily!
