Closed
Description
Servers have been extensively tested under network partitions, through functional-tester. And we haven't seen any failures for the last >5 months.
Client behavior under partitioned cluster needs more integration tests (though it's at least manually tested and we know that it works as expected). Add more tests to network_partition_test.go
, once the new health balancer passes all current CIs.
3-node cluster and client is configured with all 3 endpoints (A
, B
, C
):
- client balancer pins
A
, and memberA
fails (status: manually tested)- Watch without
WithRequireLeader(context.Context)
, without HTTP/2 keepalive ping- balancer automatically switches to either
B
orC
with a new connection - expect no error
- done via clientv3/integration: add TestBalancerUnderServerShutdownWatch #8758
- balancer automatically switches to either
- Put/Delete/Txn
- current request returns an error to the client
- balancer switches to either
B
orC
with a new connection - manual retry is needed, next request succeeds with the new endpoint (
B
orC
) - done via clientv3/integration: add TestBalancerUnderServerShutdownMutable* #8772
- Linearizable Get
- balancer switches to either
B
orC
with a new connection- expect internal retrial logic handles server errors
- no manual retry is needed, request should succeed with automatic retry
- done via clientv3/integration: add TestBalancerUnderServerShutdownImmutable #8779
- balancer switches to either
- Serializable Get
- balancer switches to either
B
orC
with a new connection- expect internal retrial logic handles server errors
- no manual retry is needed, request should succeed with automatic retry
- done via clientv3/integration: add TestBalancerUnderServerShutdownImmutable #8779
- balancer switches to either
- Watch without
- client balancer pins
A
, and memberA
is partitioned from memberB
andC
, while client can still talk to all 3 nodes (status: manually tested)iptables -A OUTPUT -p tcp --destination-port 2380 -j DROP iptables -A INPUT -p tcp --destination-port 2380 -j DROP
- Watch with
WithRequireLeader(context.Context)
, without HTTP/2 keepalive ping- expect
etcdserver.ErrNoLeader
(watch channel should be closed) - done via clientv3/integration: add TestBalancerUnderNetworkPartitionWatch #8762
- expect
- Put/Delete/Txn
- current request returns an error to the client
- balancer switches to either
B
orC
with a new connection - manual retry is needed, next request succeeds with the new endpoint (
B
orC
) - Put done via clientv3/balancer: handle network partition in health check #8669
- Delete, Txn done via clientv3/integration: finish isolated node test cases #8785
- Linearizable Get
- balancer switches to either
B
orC
with a new connection- expect internal retrial logic handles server errors
- e.g.
etcdserver: request timed out
when context timeout > request timeout
- e.g.
- expect internal retrial logic handles server errors
- no manual retry is needed, request should succeed with automatic retry
- done via clientv3/integration: finish isolated node test cases #8785
- balancer switches to either
- Serializable Get
- no manual retry is needed, request should succeed with automatic retry
- done via clientv3/integration: finish isolated node test cases #8785
- Watch with
- client balancer pins
A
, and memberA
is blackholed (status: manually tested)iptables -A INPUT -p tcp --destination-port 2379 -j DROP
- Watch without
WithRequireLeader(context.Context)
, with HTTP/2 keepalive ping- balancer automatically switches to either
B
orC
with a new connection - expect no error
- done via *: add watch with client keepalive test #8626 and clientv3/integration: move to TestBalancerUnderBlackholeKeepAliveWatch #8792
- balancer automatically switches to either
- Put/Delete/Txn
- current request returns an error to the client
- balancer switches to either
B
orC
with a new connection - manual retry is needed, next request succeeds with the new endpoint (
B
orC
) - Put done via clientv3/integration: add put blackhole test #8769
- Delete, Txn done via clientv3/integration: add blackhole tests on mutable operations #8789
- Linearizable Get
- current request returns an error to the client (
context.DeadlineExceeded
) - manual retry is needed, next request succeeds with the new endpoint (
B
orC
) - done via clientv3/integration: add blackhole tests for range RPCs #8790
- current request returns an error to the client (
- Serializable Get
- current request returns an error to the client (
context.DeadlineExceeded
) - manual retry is needed, next request succeeds with the new endpoint (
B
orC
) - done via clientv3/integration: add blackhole tests for range RPCs #8790
- current request returns an error to the client (
- Watch without