Skip to content

Kubernetes DCS Postgres DB pod stuck in read-only state #2908

Open
@douglasawh

Description

@douglasawh

Please, answer some short questions which should help us to understand your problem / question better?

  • Which image of the operator are you using? postgres-operator:v1.14.0
  • Where do you run it - cloud or metal? Kubernetes or OpenShift? Rancher Kubernetes
  • Are you running Postgres Operator in production? yes
    • Type of issue? Bug report

Some general remarks when posting a bug report:

  • Please, check the operator, pod (Patroni) and postgresql logs first. When copy-pasting many log lines please do it in a separate GitHub gist together with your Postgres CRD and configuration manifest.

We had a Zalando Operator Postgres DB pod that is stuck in a read-only state. During a memory increase to our Zalando Postgres DB cluster in Kubernetes today, the cluster promoted the replica to leader, but writes to the new leader failed because it was in a read-only state.

Here are the logs from that event:
postgres-1 (new leader):

2025-05-06 21:25:14,413 INFO: promoted self to leader by acquiring session lock
server promoting
2025-05-06 21:25:15,465 INFO: no action. I am (gitlab-code-postgres-1), the leader with the lock
ERROR: cannot execute ALTER ROLE in a read-only transaction
CONTEXT: SQL statement "ALTER ROLE admin WITH CREATEDB NOLOGIN NOCREATEROLE NOSUPERUSER NOREPLICATION INHERIT"
PL/pgSQL function inline_code_block line 5 at SQL statement

postgres-0 (new replica):

2025-05-06 21:25:15,429 INFO: Lock owner: gitlab-code-postgres-1; I am gitlab-code-postgres-0
2025-05-06 21:25:15,430 INFO: switchover: demote in progress
2025-05-06 21:25:16,382 INFO: Local timeline=51 lsn=3D6/9B000028
2025-05-06 21:25:16,396 ERROR: Exception when working with leader
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/[rewind.py](https://rewind.py/)", line 80, in check_leader_is_not_in_recovery
with get_connection_cursor(connect_timeout=3, options='-c statement_timeout=2000', **conn_kwargs) as cur:
File "/usr/lib/python3.10/[contextlib.py](https://contextlib.py/)", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/[connection.py](https://connection.py/)", line 158, in get_connection_cursor
conn = psycopg.connect(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/patroni/[psycopg.py](https://psycopg.py/)", line 123, in connect
ret = _connect(*args, **kwargs)
File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "****", port [****] failed: session is read-only

2025-05-06 21:25:16,397 INFO: closed patroni connections to postgres

Restarting the pods several times seems to have resolved the issue.

There seems to be a class of these issues in patroni. Here is one with etcd: patroni/patroni#3314. Here is one with consul: patroni/patroni#2972

  • If you feel this issue might be more related to the Spilo docker image or Patroni, consider opening issues in the respective repos.

Honestly not sure, this is my first time personally using the Zalando Operator (previous experience with generic patroni). There is a similar class of bugs in patroni, so perhaps the problem is with the kubernetes DCS?

We have not reproduced it. However, it might be possible to reproduce with memory pressure.

bootstrap:
dcs:
failsafe_mode: false
loop_wait: 10
maximum_lag_on_failover: 33550432
postgresql:
parameters:
archive_mode: 'on'
archive_timeout: 1800s
autovacuum_analyze_scale_factor: 0.02
autovacuum_max_workers: 5
autovacuum_vacuum_scale_factor: 0.05
checkpoint_completion_target: 0.9
hot_standby: 'on'
log_autovacuum_min_duration: 0
log_checkpoints: 'on'
log_connections: 'on'
log_disconnections: 'on'
log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
log_lock_waits: 'on'
log_min_duration_statement: 500
log_statement: ddl
log_temp_files: 0
max_connections: 500
max_replication_slots: 10
max_wal_senders: 10
tcp_keepalives_idle: 900
tcp_keepalives_interval: 50
track_functions: all
wal_compression: 'on'
wal_keep_size: 4096MB
wal_level: hot_standby
wal_log_hints: 'on'
use_pg_rewind: true
use_slots: true
retry_timeout: 10
ttl: 30
initdb:

  • auth-host: md5

  • auth-local: trust
    post_init: /scripts/post_init.sh "zalandos"
    kubernetes:
    bypass_api_service: true
    labels:
    application: spilo
    leader_label_value: master
    pod_ip: 50.50.50.50
    port: tcp://50.12.0.1:443
    port_443_tcp: tcp://50.12.0.1:443
    port_443_tcp_addr: 50.12.0.1
    port_443_tcp_port: '443'
    port_443_tcp_proto: tcp
    ports:

  • name: postgresql
    port: 5032
    role_label: spilo-role
    scope_label: cluster-name
    service_host: 50.12.0.1
    service_port: '443'
    service_port_https: '443'
    standby_leader_label_value: master
    use_endpoints: true
    namespace: gitlab-code
    postgresql:
    authentication:
    replication:
    password:
    username: standby
    superuser:
    password:
    username: postgres
    basebackup_fast_xlog:
    command: /scripts/basebackup.sh
    retries: 2
    bin_dir: /usr/lib/postgresql/14/bin
    callbacks:
    on_role_change: /scripts/on_role_change.sh zalandos true
    connect_address: 50.50.50.50:5032
    create_replica_method:

  • basebackup_fast_xlog
    data_dir: /home/postgres/pgdata/pgroot/data
    listen: '*:5032'
    name: gitlab-code-postgres-1
    parameters:
    archive_command: /bin/true
    bg_mon.history_buckets: 120
    bg_mon.listen_address: 0.0.0.0
    extwlist.custom_path: /scripts
    extwlist.extensions: btree_gin,btree_gist,citext,extra_window_functions,first_last_agg,hll,hstore,hypopg,intarray,ltree,pgcrypto,pgq,pgq_node,pg_trgm,postgres_fdw,tablefunc,uuid-ossp,timescaledb,pg_partman
    log_destination: csvlog
    log_directory: ../pg_log
    log_file_mode: '0644'
    log_filename: postgresql-%u.log
    log_rotation_age: 1d
    log_truncate_on_rotation: 'on'
    logging_collector: 'on'
    pg_stat_statements.track_utility: 'off'
    shared_buffers: 8002MB
    shared_preload_libraries: bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,timescaledb,pg_cron,pg_stat_kcache
    ssl: 'on'
    ssl_ca_file: /tlsca/caFile
    ssl_cert_file: /tls/tls.crt
    ssl_key_file: /tls/tls.key
    pg_hba:

  • local all all trust

  • hostssl all +zalandos 127.0.0.1/32 pam

  • host all all 127.0.0.1/32 md5

  • hostssl all +zalandos ::1/128 pam

  • host all all ::1/128 md5

  • local replication standby trust

  • hostssl replication standby all md5

  • hostnossl all all all reject

  • hostssl all +zalandos all pam

  • hostssl all all all md5
    pgpass: /run/postgresql/pgpass
    use_unix_socket: true
    use_unix_socket_repl: true
    restapi:
    connect_address: 50.50.50.50:8008
    listen: :8008
    scope: gitlab-code-postgres

  • Patroni version: 4.0.4

  • PostgreSQL version: 14

  • DCS (and its version): Kubernetes

Please let me know if any additional information would be useful, or if you think I should file against patroni or spilo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions