CloudNativePG - set up (2.18) and first take a look at: transient failure

I am beginning a collection of weblog posts to discover CloudNativePG (CNPG), a Kubernetes operator for PostgreSQL that automates excessive availability in containerized environments.

PostgreSQL itself helps bodily streaming replication, however doesn’t present orchestration logic — no automated promotion, scaling, or failover. Instruments like Patroni fill that hole by implementing consensus (etcd, Consul, ZooKeeper, Kubernetes, or Raft) for cluster state administration. In Kubernetes, databases are sometimes deployed with StatefulSets, which give steady community identities and protracted storage per occasion. CloudNativePG as a substitute defines PostgreSQL‑particular CustomResourceDefinitions (CRDs), which introduce the next assets:

ImageCatalog: PostgreSQL picture catalogs
Cluster: PostgreSQL cluster definition
Database: Declarative database administration
Pooler: PgBouncer connection pooling
Backup: On-demand backup requests
ScheduledBackup: Automated backup scheduling
Publication Logical replication publications
Subscription Logical replication subscriptions

Set up: management airplane for PostgreSQL

Right here I’m utilizing CNPG 1.28, which is the primary launch to assist (quorum-based failover). Prior variations promoted the most-recently-available standby with out stopping information loss (good for catastrophe restoration however not strict excessive availability).

Set up the operator’s elements:

kubectl apply --server-side -f https://uncooked.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.28/releases/cnpg-1.28.0.yaml

The CRDs and controller deploy into the cnpg-system namespace. Examine rollout standing:

kubectl rollout standing deployment -n cnpg-system cnpg-controller-manager

deployment "cnpg-controller-manager" efficiently rolled out

This Deployment defines the CloudNativePG Controller Supervisor — the management airplane element — which runs as a single pod and repeatedly reconciles PostgreSQL cluster assets with their desired state by way of the Kubernetes API:

kubectl get deployments -n cnpg-system -o extensive

NAME                      READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES                                         SELECTOR
cnpg-controller-manager   1/1     1            1           11d   supervisor      ghcr.io/cloudnative-pg/cloudnative-pg:1.28.0   app.kubernetes.io/title=cloudnative-pg

The pod’s containers hear on ports for metrics (8080/TCP) and webhook configuration (9443/TCP), and work together with CNPG’s CRDs in the course of the reconciliation loop:

kubectl describe deploy -n cnpg-system cnpg-controller-manager

Identify:                   cnpg-controller-manager
Namespace:              cnpg-system
CreationTimestamp:      Thu, 15 Jan 2026 21:04:25 +0100
Labels:                 app.kubernetes.io/title=cloudnative-pg
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               app.kubernetes.io/title=cloudnative-pg
Replicas:               1 desired | 1 up to date | 1 whole | 1 obtainable | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app.kubernetes.io/title=cloudnative-pg
  Service Account:  cnpg-manager
  Containers:
   supervisor:
    Picture:           ghcr.io/cloudnative-pg/cloudnative-pg:1.28.0
    Ports:           8080/TCP (metrics), 9443/TCP (webhook-server)
    Host Ports:      0/TCP (metrics), 0/TCP (webhook-server)
    SeccompProfile:  RuntimeDefault
    Command:
      /supervisor
    Args:
      controller
      --leader-elect
      --max-concurrent-reconciles=10
      --config-map-name=cnpg-controller-manager-config
      --secret-name=cnpg-controller-manager-config
      --webhook-port=9443
    Limits:
      cpu:     100m
      reminiscence:  200Mi
    Requests:
      cpu:      100m
      reminiscence:   100Mi
    Liveness:   http-get https://:9443/readyz delay=0s timeout=1s interval=10s #success=1 #failure=3
    Readiness:  http-get https://:9443/readyz delay=0s timeout=1s interval=10s #success=1 #failure=3
    Startup:    http-get https://:9443/readyz delay=0s timeout=1s interval=5s #success=1 #failure=6
    Atmosphere:
      OPERATOR_IMAGE_NAME:           ghcr.io/cloudnative-pg/cloudnative-pg:1.28.0
      OPERATOR_NAMESPACE:             (v1:metadata.namespace)
      MONITORING_QUERIES_CONFIGMAP:  cnpg-default-monitoring
    Mounts:
      /controller from scratch-data (rw)
      /run/secrets and techniques/cnpg.io/webhook from webhook-certificates (rw)
  Volumes:
   scratch-data:
    Kind:       EmptyDir (a brief listing that shares a pod's lifetime)
    Medium:     
    SizeLimit:  
   webhook-certificates:
    Kind:          Secret (a quantity populated by a Secret)
    SecretName:    cnpg-webhook-cert
    Elective:      true
  Node-Selectors:  
  Tolerations:     
Situations:
  Kind           Standing  Cause
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Out there      True    MinimumReplicasAvailable
OldReplicaSets:  
NewReplicaSet:   cnpg-controller-manager-6b9f78f594 (1/1 replicas created)
Occasions:

Deploy: information airplane (PostgreSQL cluster)

The management airplane handles orchestration logic. The precise PostgreSQL situations — the info airplane — are managed by way of CNPG’s Cluster customized useful resource.

Create a devoted namespace:

kubectl delete namespace lab
kubectl create namespace lab

namespace/lab created

Right here’s a minimal high-availability cluster spec:

3 situations: 1 main, 2 scorching standby replicas
Synchronous decide to 1 reproduction
Quorum-based failover enabled

cat > lab-cluster-rf3.yaml <<'YAML'
apiVersion: postgresql.cnpg.io/v1
sort: Cluster
metadata:
  title: cnpg
spec:
  situations: 3
  postgresql:
    synchronous:
      technique: any
      quantity: 1
      failoverQuorum: true
  storage:
    measurement: 1Gi
YAML

kubectl -n lab apply -f lab-cluster-rf3.yaml

CNPG provisions Pods with stateful semantics, utilizing PersistentVolumeClaims for storage:

kubectl -n lab get pvc -o extensive

NAME     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE   VOLUMEMODE
cnpg-1   Sure    pvc-76754ba4-e8bd-4218-837f-36aa0010940f   1Gi        RWO            hostpath       <unset>                 42s   Filesystem
cnpg-2   Sure    pvc-3b231dcc-b973-43f8-a429-80222bd51420   1Gi        RWO            hostpath       <unset>                 26s   Filesystem
cnpg-3   Sure    pvc-b8e4c6a0-bbcb-445d-9267-ffe38a1a8685   1Gi        RWO            hostpath       <unset>                 10s   Filesystem

These PVCs bind to PersistentVolumes supplied by their storage class:

kubectl -n lab get pv -o extensive 

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM        STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE   VOLUMEMODE
pvc-3b231dcc-b973-43f8-a429-80222bd51420   1Gi        RWO            Delete           Sure    lab/cnpg-2   hostpath       <unset>                          53s   Filesystem
pvc-76754ba4-e8bd-4218-837f-36aa0010940f   1Gi        RWO            Delete           Sure    lab/cnpg-1   hostpath       <unset>                          69s   Filesystem
pvc-b8e4c6a0-bbcb-445d-9267-ffe38a1a8685   1Gi        RWO            Delete           Sure    lab/cnpg-3   hostpath       <unset>                          37s   Filesystem

PostgreSQL situations runs in pods:

kubectl -n lab get pod -o extensive

NAME     READY   STATUS    RESTARTS   AGE     IP           NODE             NOMINATED NODE   READINESS GATES
cnpg-1   1/1     Operating   0          3m46s   10.1.0.141   docker-desktop              
cnpg-2   1/1     Operating   0          3m29s   10.1.0.143   docker-desktop              
cnpg-3   1/1     Operating   0          3m13s   10.1.0.145   docker-desktop

In Kubernetes, pods are usually thought of equal, however PostgreSQL makes use of a single main node whereas the opposite pods function learn replicas. CNPG identifies which pod is working the first occasion:

kubectl -n lab get cluster      

NAME   AGE   INSTANCES   READY   STATUS                     PRIMARY
cnpg    4m   3           3       Cluster in wholesome state   cnpg-1

Because the roles of pods can change with a switchover or failover, software entry although companies that expose the fitting situations:

kubectl -n lab get svc -o extensive

NAME      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE     SELECTOR
cnpg-r    ClusterIP   10.97.182.192            5432/TCP   4m13s   cnpg.io/cluster=cnpg,cnpg.io/podRole=occasion
cnpg-ro   ClusterIP   10.111.116.164           5432/TCP   4m13s   cnpg.io/cluster=cnpg,cnpg.io/instanceRole=reproduction
cnpg-rw   ClusterIP   10.108.19.85             5432/TCP   4m13s   cnpg.io/cluster=cnpg,cnpg.io/instanceRole=main

These are the endpoints used to hook up with PostgreSQL:

cnpg-rw connects to the first for constant reads and writes
cnpg-ro connects to at least one standby for stale reads
cnpg-r connects the first or standby for stale reads

The load-balancing of learn workloads is round-robin, like a number checklist, so the identical workload runs on all replicas.

Shopper entry setup

CNPG generated credentials in a Kubernetes Secret named cnpg-app for the consumer app:

kubectl -n lab get secrets and techniques

NAME               TYPE                       DATA   AGE
cnpg-app           kubernetes.io/basic-auth   11     8m48s
cnpg-ca            Opaque                     2      8m48s
cnpg-replication   kubernetes.io/tls          2      8m48s
cnpg-server        kubernetes.io/tls          2      8m48s

When wanted, the password may be retreived with kubectl -n lab get secret cnpg-app -o jsonpath="{.information.password}" | base64 -d).

Outline a shell alias to launch a PostgreSQL shopper pod with these credentials:

alias pgrw='kubectl -n lab run shopper --rm -it --restart=By no means  
 --env PGHOST="cnpg-rw" 
 --env PGUSER="app" 
 --env PGPASSWORD="$(kubectl -n lab get secret cnpg-app -o jsonpath="{.information.password}" | base64 -d)" 
--image=postgres:18 --'

Use the alias pgrw to run a PostgreSQL shopper linked to the first.

PgBench default workload

With the earlier alias outlined, initialize PgBench tables:


pgrw pgbench -i

dropping outdated tables...
creating tables...
producing information (client-side)...
vacuuming...                                                                              
creating main keys...
finished in 0.10 s (drop tables 0.02 s, create tables 0.01 s, client-side generate 0.04 s, vacuum 0.01 s, main keys 0.01 s).
pod "shopper" deleted from lab namespace

Run for 10 minutes with progress each 5 seconds:

pgrw pgbench -T 600 -P 5

progress: 5.0 s, 1541.4 tps, lat 0.648 ms stddev 0.358, 0 failed
progress: 10.0 s, 1648.6 tps, lat 0.606 ms stddev 0.154, 0 failed
progress: 15.0 s, 1432.7 tps, lat 0.698 ms stddev 0.218, 0 failed
progress: 20.0 s, 1581.3 tps, lat 0.632 ms stddev 0.169, 0 failed
progress: 25.0 s, 1448.2 tps, lat 0.690 ms stddev 0.315, 0 failed
progress: 30.0 s, 1640.6 tps, lat 0.609 ms stddev 0.155, 0 failed
progress: 35.0 s, 1609.9 tps, lat 0.621 ms stddev 0.223, 0 failed

Simulated failure

In one other terminal, I checked which is the first pod:

kubectl -n lab get cluster      

NAME   AGE   INSTANCES   READY   STATUS                     PRIMARY
cnpg   40m   3           3       Cluster in wholesome state   cnpg-1

From the Docker Desktop GUI, I paused the container within the main’s pod:

PgBench queries cling as the first the place it’s linked to does not reply:

The pod was recovered and PgBench continues with out being disconnected:

Kubernetes displays pod well being with liveness/readiness probes and restarts containers when these probes fail. On this case, Kubernetes—not CNPG—restored the service.

In the meantime, CNPG independently displays PostgreSQL and triggered a failover earlier than Kubernetes restarted the pod:

franck.pachot@M-C7Y646J4JP cnpg % kubectl -n lab get cluster

NAME   AGE    INSTANCES   READY   STATUS         PRIMARY
cnpg   3m6s   3           2       Failing over   cnpg-1

Kubernetes introduced the service again in about 30 seconds, however CNPG had already initiated a failover. A brand new outage will occur.

A couple of minutes later, cnpg-1 restarted and PgBench exited with:

WARNING:  canceling the look ahead to synchronous replication and terminating connection as a result of administrator command
DETAIL:  The transaction has already dedicated domestically, however may not have been replicated to the standby.
pgbench: error: shopper 0 aborted in command 10 (SQL) of script 0; maybe the backend died whereas processing

As a result of cnpg-1 was nonetheless there and wholesome, it’s nonetheless the first, however all connections have been terminated.

Observations

This take a look at exhibits how PostgreSQL and Kubernetes work together underneath CloudNativePG. Kubernetes pod well being checks and CloudNativePG’s failover logic every run their very own management loop:

Kubernetes restarts containers when liveness or readiness probes fail.
CloudNativePG (CNPG) evaluates database well being utilizing replication state, quorum, and occasion supervisor connectivity.

Pausing the container briefly triggered CNPG’s main isolation test. When the first loses contact with each the Kubernetes API and different cluster members, CNPG shuts it down to stop split-brain. Timeline:

T+0s — Major paused. CNPG detects isolation.
T+30s — Kubernetes restarts the container.
T+180s — CNPG triggers failover.
T+275s — Major shutdown terminates shopper connections.

As a result of CNPG and Kubernetes act on completely different timelines, the unique pod restarted as main (“self-failover”) when no reproduction was a greater promotion candidate. CNPG prioritizes information integrity over quick restoration and, and not using a consensus protocol like Raft, depends on:

Kubernetes API state
PostgreSQL streaming replication
Occasion supervisor well being checks

This could trigger false positives underneath transient faults however protects in opposition to split-brain. Reproducible steps:
https://github.com/cloudnative-pg/cloudnative-pg/discussions/9814

Cloud programs can fail in some ways. On this take a look at, I used docker pause to freeze processes and simulate a main that stops responding to purchasers and well being checks. This mirrors a earlier take a look at I did with Yugabyte: YugabyteDB Restoration Time Goal (RTO) with PgBench: steady availability with max. 15s latency on infrastructure failure

This put up begins a CNPG collection the place I may also cowl failures like community partitions and storage points, and the connection pooler.

CloudNativePG – set up (2.18) and first take a look at: transient failure

Set up: management airplane for PostgreSQL

Deploy: information airplane (PostgreSQL cluster)

Shopper entry setup

PgBench default workload

Simulated failure

Observations

Related Articles

Democratizing enterprise intelligence: BGL’s journey with Claude Agent SDK and Amazon Bedrock AgentCore

Why Our Open Supply, Companies-Led Mannequin Nonetheless Works

GPTHuman vs HIX Bypass: AI Humanizer Showdown

LEAVE A REPLY Cancel reply

Latest Articles

Democratizing enterprise intelligence: BGL’s journey with Claude Agent SDK and Amazon Bedrock AgentCore

Why Our Open Supply, Companies-Led Mannequin Nonetheless Works

GPTHuman vs HIX Bypass: AI Humanizer Showdown

loish weblog

Lab-grown corticospinal neurons provide new fashions for ALS and spinal accidents – NanoApps Medical – Official web site