Description
I've been experiencing the situation where transient failures occur in https://github.com/coreos/flannel/blob/master/backend/vxlan/vxlan_network.go#L145 especially under heavy memory contention. The current code structure in vxlan_network does not include a facility for retries, which results in the newly observed subnet never being properly added to netlink, which gives rise to #958
After discussions in the 10/29/2020 community meeting, Rajat proposed that rather than add retries to the code within vxlan_network.go that we would want to move to an architecture with a global reconciliation loop, to be run on some configurable interval, that would compare the etcd or k8s contents with the current set of iptables rules and make any required changes. This would help in reducing or eliminating all persistent error conditions, not just the one underlying 958.