Description
Hi,
I met a problem at using etcd API when the network is not stable: I just use a etcd client(has 3 endpoints which are relative to the 3 etcd nodes)to get some value from etcd nodes. But if one of etcd node is in broken network, the request will hang at least 16 minutes, and then request succeed.
It's easy to reproduce:
- use docker container to start a 3 nodes etcd cluster
- new a clientv3 with all nodes' endpoints, and request for once
- broken one etcd node'network
- request again(It's random happened if the node is not the one your client currently connected to)
code like this:
func() {
clu, err := etcd.NewCluster(3) // use docker run to start a etcd cluster with 3 nodes
endpoints := []string{}
for i := 0; i < 3; i++ {
eps := clu.Client(i).Endpoints()
endpoints = append(endpoints, eps...)
}
logger.Infof("Total endpoints %v\n", endpoints)
client, err := clientv3.New(clientv3.Config{
Endpoints: endpoints,
AutoSyncInterval:time.Second,
DialTimeout: 5 * time.Second})
logger.Infof("Begin to get things")
// just request for the first time
_, err= client.Get(context.Background(), "noexit")
for _, mem := range clu.GetMembers() {
logger.Infof("netdown mem %s", mem.GetName())
_, err:= clu.DisconnectNet(mem.GetName()) // broken one node' network
_, err= client.Get(context.Background(), "noexit")
logger.Infof("recover mem: %s", mem.GetName())
_, err= clu.ConnectNet(mem.GetName()) // recover one node' network
}
}
start etcd container like this
// create a network
docker network create --driver bridge %s --subnet %s
// create one etcd node
"docker run -d --net=%s --ip=%s --name %s " + EtcdImagePath +
" /usr/local/bin/etcd " +
"--data-dir=data.etcd --name %s " +
"--initial-advertise-peer-urls http://%s:2380 " +
"--listen-peer-urls http://%s:2380 " +
"--advertise-client-urls http://%s:2379 " +
"--listen-client-urls http://%s:2379 " +
"--initial-cluster " + cluster + " " +
"--initial-cluster-state " + clusterState + " " +
"--initial-cluster-token " + token
// broken network
"docker network disconnect -f %s %s", c.netName, containerID
// recover network
"docker network connect %s %s", c.netName, containerID
I change the context.background() to context.WithTimeout(5*time.Second), the result like this:
- after 16min, the get operation is successful
- not every request is terminated after 5s, the request timeout is increasing.
I0214 10:44:52.192850 2228 etcd_local.go:394] Disconnect the test container to netowrk. Container: etcd-s1fsqse-node-1. Network: etcd-s1fsqse. Output: . Err: <nil>
I0214 10:44:57.194600 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:02.195953 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:07.199983 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:12.201359 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:17.203196 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:22.206000 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:27.207338 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:32.209470 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:42.249128 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:53.651880 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:46:15.644939 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:46:34.199748 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:47:32.885118 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:49:01.816162 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:49:43.253304 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:51:00.247848 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:51:52.005616 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:53:01.580015 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:54:01.834655 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:55:17.626976 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:56:36.551973 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:57:45.628319 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:58:27.367759 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:59:36.572002 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 11:00:22.016167 2228 etcd_test.go:53] Error is context deadline exceeded
I0214 11:00:22.016486 2228 etcd_test.go:59] connect to other node
Questions:
- Why the client request hang without context cancel or context timeout? Where it hang at? Why need to hang for 16min?
- The client has 3 endpoints, when one node broken, should client retry the other endpoint to connect instead of hanging?
- When request canceled by context, and just retry with client, why the client can't change to other endpoints when I had set "AutoSyncInterval" in client? Why It changed just after 16min?
ETCD version : 3.1.0
In my scenary, I grant a lease , and call keepAliveOnce(with a timeout 5s context) for the lease, but the etcd client can't change to connect other etcd nodes when I retry keepAliveOnce, so the lease expired, It's horrible!How should I deal with it? Help, pls