Description
Please, answer some short questions which should help us to understand your problem / question better?
- Which image of the operator are you using? postgres-operator:v1.14.0
- Where do you run it - cloud or metal? Kubernetes or OpenShift? Rancher Kubernetes
- Are you running Postgres Operator in production? yes
-
- Type of issue? Bug report
Some general remarks when posting a bug report:
- Please, check the operator, pod (Patroni) and postgresql logs first. When copy-pasting many log lines please do it in a separate GitHub gist together with your Postgres CRD and configuration manifest.
We had a Zalando Operator Postgres DB pod that is stuck in a read-only state. During a memory increase to our Zalando Postgres DB cluster in Kubernetes today, the cluster promoted the replica to leader, but writes to the new leader failed because it was in a read-only state.
Here are the logs from that event:
postgres-1 (new leader):
2025-05-06 21:25:14,413 INFO: promoted self to leader by acquiring session lock
server promoting
2025-05-06 21:25:15,465 INFO: no action. I am (gitlab-code-postgres-1), the leader with the lock
ERROR: cannot execute ALTER ROLE in a read-only transaction
CONTEXT: SQL statement "ALTER ROLE admin WITH CREATEDB NOLOGIN NOCREATEROLE NOSUPERUSER NOREPLICATION INHERIT"
PL/pgSQL function inline_code_block line 5 at SQL statement
postgres-0 (new replica):
2025-05-06 21:25:15,429 INFO: Lock owner: gitlab-code-postgres-1; I am gitlab-code-postgres-0
2025-05-06 21:25:15,430 INFO: switchover: demote in progress
2025-05-06 21:25:16,382 INFO: Local timeline=51 lsn=3D6/9B000028
2025-05-06 21:25:16,396 ERROR: Exception when working with leader
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/[rewind.py](https://rewind.py/)", line 80, in check_leader_is_not_in_recovery
with get_connection_cursor(connect_timeout=3, options='-c statement_timeout=2000', **conn_kwargs) as cur:
File "/usr/lib/python3.10/[contextlib.py](https://contextlib.py/)", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/[connection.py](https://connection.py/)", line 158, in get_connection_cursor
conn = psycopg.connect(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/patroni/[psycopg.py](https://psycopg.py/)", line 123, in connect
ret = _connect(*args, **kwargs)
File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "****", port [****] failed: session is read-only
2025-05-06 21:25:16,397 INFO: closed patroni connections to postgres
Restarting the pods several times seems to have resolved the issue.
There seems to be a class of these issues in patroni. Here is one with etcd: patroni/patroni#3314. Here is one with consul: patroni/patroni#2972
- If you feel this issue might be more related to the Spilo docker image or Patroni, consider opening issues in the respective repos.
Honestly not sure, this is my first time personally using the Zalando Operator (previous experience with generic patroni). There is a similar class of bugs in patroni, so perhaps the problem is with the kubernetes DCS?
We have not reproduced it. However, it might be possible to reproduce with memory pressure.
bootstrap:
dcs:
failsafe_mode: false
loop_wait: 10
maximum_lag_on_failover: 33550432
postgresql:
parameters:
archive_mode: 'on'
archive_timeout: 1800s
autovacuum_analyze_scale_factor: 0.02
autovacuum_max_workers: 5
autovacuum_vacuum_scale_factor: 0.05
checkpoint_completion_target: 0.9
hot_standby: 'on'
log_autovacuum_min_duration: 0
log_checkpoints: 'on'
log_connections: 'on'
log_disconnections: 'on'
log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
log_lock_waits: 'on'
log_min_duration_statement: 500
log_statement: ddl
log_temp_files: 0
max_connections: 500
max_replication_slots: 10
max_wal_senders: 10
tcp_keepalives_idle: 900
tcp_keepalives_interval: 50
track_functions: all
wal_compression: 'on'
wal_keep_size: 4096MB
wal_level: hot_standby
wal_log_hints: 'on'
use_pg_rewind: true
use_slots: true
retry_timeout: 10
ttl: 30
initdb:
-
auth-host: md5
-
auth-local: trust
post_init: /scripts/post_init.sh "zalandos"
kubernetes:
bypass_api_service: true
labels:
application: spilo
leader_label_value: master
pod_ip: 50.50.50.50
port: tcp://50.12.0.1:443
port_443_tcp: tcp://50.12.0.1:443
port_443_tcp_addr: 50.12.0.1
port_443_tcp_port: '443'
port_443_tcp_proto: tcp
ports: -
name: postgresql
port: 5032
role_label: spilo-role
scope_label: cluster-name
service_host: 50.12.0.1
service_port: '443'
service_port_https: '443'
standby_leader_label_value: master
use_endpoints: true
namespace: gitlab-code
postgresql:
authentication:
replication:
password:
username: standby
superuser:
password:
username: postgres
basebackup_fast_xlog:
command: /scripts/basebackup.sh
retries: 2
bin_dir: /usr/lib/postgresql/14/bin
callbacks:
on_role_change: /scripts/on_role_change.sh zalandos true
connect_address: 50.50.50.50:5032
create_replica_method: -
basebackup_fast_xlog
data_dir: /home/postgres/pgdata/pgroot/data
listen: '*:5032'
name: gitlab-code-postgres-1
parameters:
archive_command: /bin/true
bg_mon.history_buckets: 120
bg_mon.listen_address: 0.0.0.0
extwlist.custom_path: /scripts
extwlist.extensions: btree_gin,btree_gist,citext,extra_window_functions,first_last_agg,hll,hstore,hypopg,intarray,ltree,pgcrypto,pgq,pgq_node,pg_trgm,postgres_fdw,tablefunc,uuid-ossp,timescaledb,pg_partman
log_destination: csvlog
log_directory: ../pg_log
log_file_mode: '0644'
log_filename: postgresql-%u.log
log_rotation_age: 1d
log_truncate_on_rotation: 'on'
logging_collector: 'on'
pg_stat_statements.track_utility: 'off'
shared_buffers: 8002MB
shared_preload_libraries: bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,timescaledb,pg_cron,pg_stat_kcache
ssl: 'on'
ssl_ca_file: /tlsca/caFile
ssl_cert_file: /tls/tls.crt
ssl_key_file: /tls/tls.key
pg_hba: -
local all all trust
-
hostssl all +zalandos 127.0.0.1/32 pam
-
host all all 127.0.0.1/32 md5
-
hostssl all +zalandos ::1/128 pam
-
host all all ::1/128 md5
-
local replication standby trust
-
hostssl replication standby all md5
-
hostnossl all all all reject
-
hostssl all +zalandos all pam
-
hostssl all all all md5
pgpass: /run/postgresql/pgpass
use_unix_socket: true
use_unix_socket_repl: true
restapi:
connect_address: 50.50.50.50:8008
listen: :8008
scope: gitlab-code-postgres -
Patroni version: 4.0.4
-
PostgreSQL version: 14
-
DCS (and its version): Kubernetes
Please let me know if any additional information would be useful, or if you think I should file against patroni or spilo.