Skip to content

Network glitch reducing etcd cluster availability seriously #7321

Closed
@shuting-yst

Description

@shuting-yst

Hi,

I met a problem at using etcd API when the network is not stable: I just use a etcd client(has 3 endpoints which are relative to the 3 etcd nodes)to get some value from etcd nodes. But if one of etcd node is in broken network, the request will hang at least 16 minutes, and then request succeed.

It's easy to reproduce:

  • use docker container to start a 3 nodes etcd cluster
  • new a clientv3 with all nodes' endpoints, and request for once
  • broken one etcd node'network
  • request again(It's random happened if the node is not the one your client currently connected to)

code like this:

func() {
        clu, err := etcd.NewCluster(3)   // use docker run to start a etcd cluster with 3 nodes
	endpoints := []string{}
	for i := 0; i < 3; i++ {
		eps := clu.Client(i).Endpoints()
		endpoints = append(endpoints, eps...)
	}
	logger.Infof("Total endpoints %v\n", endpoints)
	client, err := clientv3.New(clientv3.Config{
		Endpoints: endpoints,
		AutoSyncInterval:time.Second,
		DialTimeout: 5 * time.Second})
	logger.Infof("Begin to get things")
        // just request for the first time
	_, err= client.Get(context.Background(), "noexit")
          
	for _, mem := range clu.GetMembers() {
		logger.Infof("netdown mem %s", mem.GetName())
		_, err:= clu.DisconnectNet(mem.GetName())   // broken one node' network
                        
		_, err= client.Get(context.Background(), "noexit")
		logger.Infof("recover mem: %s", mem.GetName())
		_, err= clu.ConnectNet(mem.GetName())   // recover one node' network
	}
}

start etcd container like this

// create a network
docker network create --driver bridge %s --subnet %s
// create one etcd node  
"docker run -d --net=%s --ip=%s --name %s " + EtcdImagePath +
			" /usr/local/bin/etcd " +
			"--data-dir=data.etcd --name %s " +
			"--initial-advertise-peer-urls http://%s:2380 " +
			"--listen-peer-urls http://%s:2380 " +
			"--advertise-client-urls http://%s:2379 " +
			"--listen-client-urls http://%s:2379 " +
			"--initial-cluster " + cluster + " " +
			"--initial-cluster-state " + clusterState + " " +
			"--initial-cluster-token " + token
// broken network
"docker network disconnect -f %s %s", c.netName, containerID
// recover network
"docker network connect %s %s", c.netName, containerID

I change the context.background() to context.WithTimeout(5*time.Second), the result like this:

  • after 16min, the get operation is successful
  • not every request is terminated after 5s, the request timeout is increasing.
I0214 10:44:52.192850    2228 etcd_local.go:394] Disconnect the test container to netowrk. Container: etcd-s1fsqse-node-1. Network: etcd-s1fsqse. Output: . Err: <nil>
I0214 10:44:57.194600    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:02.195953    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:07.199983    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:12.201359    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:17.203196    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:22.206000    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:27.207338    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:32.209470    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:42.249128    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:45:53.651880    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:46:15.644939    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:46:34.199748    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:47:32.885118    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:49:01.816162    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:49:43.253304    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:51:00.247848    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:51:52.005616    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:53:01.580015    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:54:01.834655    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:55:17.626976    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:56:36.551973    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:57:45.628319    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:58:27.367759    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 10:59:36.572002    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 11:00:22.016167    2228 etcd_test.go:53] Error is context deadline exceeded
I0214 11:00:22.016486    2228 etcd_test.go:59] connect to other node

Questions:

  1. Why the client request hang without context cancel or context timeout? Where it hang at? Why need to hang for 16min?
  2. The client has 3 endpoints, when one node broken, should client retry the other endpoint to connect instead of hanging?
  3. When request canceled by context, and just retry with client, why the client can't change to other endpoints when I had set "AutoSyncInterval" in client? Why It changed just after 16min?

ETCD version : 3.1.0

In my scenary, I grant a lease , and call keepAliveOnce(with a timeout 5s context) for the lease, but the etcd client can't change to connect other etcd nodes when I retry keepAliveOnce, so the lease expired, It's horrible!How should I deal with it? Help, pls

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions