Kubernetes DCS Postgres DB pod stuck in read-only state

Please, answer some short questions which should help us to understand your problem / question better?

- **Which image of the operator are you using?** postgres-operator:v1.14.0
- **Where do you run it - cloud or metal? Kubernetes or OpenShift?** Rancher Kubernetes
- **Are you running Postgres Operator in production?** yes 
- - **Type of issue?** Bug report

Some general remarks when posting a bug report:
- Please, check the operator, pod (Patroni) and postgresql logs first. When copy-pasting many log lines please do it in a separate GitHub gist together with your Postgres CRD and configuration manifest.

We had a Zalando Operator Postgres DB pod that is stuck in a read-only state. During a memory increase to our Zalando Postgres DB cluster in Kubernetes today, the cluster promoted the replica to leader, but writes to the new leader failed because it was in a read-only state.

Here are the logs from that event:
postgres-1 (new leader):
```
2025-05-06 21:25:14,413 INFO: promoted self to leader by acquiring session lock
server promoting
2025-05-06 21:25:15,465 INFO: no action. I am (gitlab-code-postgres-1), the leader with the lock
ERROR: cannot execute ALTER ROLE in a read-only transaction
CONTEXT: SQL statement "ALTER ROLE admin WITH CREATEDB NOLOGIN NOCREATEROLE NOSUPERUSER NOREPLICATION INHERIT"
PL/pgSQL function inline_code_block line 5 at SQL statement
```

postgres-0 (new replica):
```
2025-05-06 21:25:15,429 INFO: Lock owner: gitlab-code-postgres-1; I am gitlab-code-postgres-0
2025-05-06 21:25:15,430 INFO: switchover: demote in progress
2025-05-06 21:25:16,382 INFO: Local timeline=51 lsn=3D6/9B000028
2025-05-06 21:25:16,396 ERROR: Exception when working with leader
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/[rewind.py](https://rewind.py/)", line 80, in check_leader_is_not_in_recovery
with get_connection_cursor(connect_timeout=3, options='-c statement_timeout=2000', **conn_kwargs) as cur:
File "/usr/lib/python3.10/[contextlib.py](https://contextlib.py/)", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/[connection.py](https://connection.py/)", line 158, in get_connection_cursor
conn = psycopg.connect(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/patroni/[psycopg.py](https://psycopg.py/)", line 123, in connect
ret = _connect(*args, **kwargs)
File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "****", port [****] failed: session is read-only

2025-05-06 21:25:16,397 INFO: closed patroni connections to postgres
```

Restarting the pods several times seems to have resolved the issue.

There seems to be a class of these issues in patroni. Here is one with etcd: https://github.com/patroni/patroni/issues/3314. Here is one with consul: https://github.com/patroni/patroni/issues/2972


- If you feel this issue might be more related to the [Spilo](https://github.com/zalando/spilo/issues) docker image or [Patroni](https://github.com/zalando/patroni/issues), consider opening issues in the respective repos.

Honestly not sure, this is my first time personally using the Zalando Operator (previous experience with generic patroni). There is a similar class of bugs in patroni, so perhaps the problem is with the kubernetes DCS?

We have not reproduced it. However, it might be possible to reproduce with memory pressure.

bootstrap:
  dcs:
    failsafe_mode: false
    loop_wait: 10
    maximum_lag_on_failover: 33550432
    postgresql:
      parameters:
        archive_mode: 'on'
        archive_timeout: 1800s
        autovacuum_analyze_scale_factor: 0.02
        autovacuum_max_workers: 5
        autovacuum_vacuum_scale_factor: 0.05
        checkpoint_completion_target: 0.9
        hot_standby: 'on'
        log_autovacuum_min_duration: 0
        log_checkpoints: 'on'
        log_connections: 'on'
        log_disconnections: 'on'
        log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
        log_lock_waits: 'on'
        log_min_duration_statement: 500
        log_statement: ddl
        log_temp_files: 0
        max_connections: 500
        max_replication_slots: 10
        max_wal_senders: 10
        tcp_keepalives_idle: 900
        tcp_keepalives_interval: 50
        track_functions: all
        wal_compression: 'on'
        wal_keep_size: 4096MB
        wal_level: hot_standby
        wal_log_hints: 'on'
      use_pg_rewind: true
      use_slots: true
    retry_timeout: 10
    ttl: 30
  initdb:
  - auth-host: md5
  - auth-local: trust
  post_init: /scripts/post_init.sh "zalandos"
kubernetes:
  bypass_api_service: true
  labels:
    application: spilo
  leader_label_value: master
  pod_ip: 50.50.50.50
  port: tcp://50.12.0.1:443
  port_443_tcp: tcp://50.12.0.1:443
  port_443_tcp_addr: 50.12.0.1
  port_443_tcp_port: '443'
  port_443_tcp_proto: tcp
  ports:
  - name: postgresql
    port: 5032
  role_label: spilo-role
  scope_label: cluster-name
  service_host: 50.12.0.1
  service_port: '443'
  service_port_https: '443'
  standby_leader_label_value: master
  use_endpoints: true
namespace: gitlab-code
postgresql:
  authentication:
    replication:
      password: <redacted>
      username: standby
    superuser:
      password: <redacted>
      username: postgres
  basebackup_fast_xlog:
    command: /scripts/basebackup.sh
    retries: 2
  bin_dir: /usr/lib/postgresql/14/bin
  callbacks:
    on_role_change: /scripts/on_role_change.sh zalandos true
  connect_address: 50.50.50.50:5032
  create_replica_method:
  - basebackup_fast_xlog
  data_dir: /home/postgres/pgdata/pgroot/data
  listen: '*:5032'
  name: gitlab-code-postgres-1
  parameters:
    archive_command: /bin/true
    bg_mon.history_buckets: 120
    bg_mon.listen_address: 0.0.0.0
    extwlist.custom_path: /scripts
    extwlist.extensions: btree_gin,btree_gist,citext,extra_window_functions,first_last_agg,hll,hstore,hypopg,intarray,ltree,pgcrypto,pgq,pgq_node,pg_trgm,postgres_fdw,tablefunc,uuid-ossp,timescaledb,pg_partman
    log_destination: csvlog
    log_directory: ../pg_log
    log_file_mode: '0644'
    log_filename: postgresql-%u.log
    log_rotation_age: 1d
    log_truncate_on_rotation: 'on'
    logging_collector: 'on'
    pg_stat_statements.track_utility: 'off'
    shared_buffers: 8002MB
    shared_preload_libraries: bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,timescaledb,pg_cron,pg_stat_kcache
    ssl: 'on'
    ssl_ca_file: /tlsca/caFile
    ssl_cert_file: /tls/tls.crt
    ssl_key_file: /tls/tls.key
  pg_hba:
  - local   all             all                                   trust
  - hostssl all             +zalandos    127.0.0.1/32       pam
  - host    all             all                127.0.0.1/32       md5
  - hostssl all             +zalandos    ::1/128            pam
  - host    all             all                ::1/128            md5
  - local   replication     standby                    trust
  - hostssl replication     standby all                md5
  - hostnossl all           all                all                reject
  - hostssl all             +zalandos    all                pam
  - hostssl all             all                all                md5
  pgpass: /run/postgresql/pgpass
  use_unix_socket: true
  use_unix_socket_repl: true
restapi:
  connect_address: 50.50.50.50:8008
  listen: :8008
scope: gitlab-code-postgres

-   Patroni version: 4.0.4
-   PostgreSQL version: 14
-   DCS (and its version): Kubernetes

Please let me know if any additional information would be useful, or if you think I should file against patroni or spilo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kubernetes DCS Postgres DB pod stuck in read-only state #2908

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kubernetes DCS Postgres DB pod stuck in read-only state #2908

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions