Bump default API QPS limits for Kubelet #116121

wojtek-t · 2023-02-28T08:33:44Z

Based on different experiments with APF, we believe that we're ready to start our journey towards getting rid of client-side rate-limiting. Kubelet is the best first candidate, because:

it has a potential to be hugely throttled now (default QPS = 5)
in large clusters they can generate significant load anyway as there are a lot of them
they have a dedicated PriorityLevel and FlowSchema that we can easily control
bumping it significantly help e.g. for pod-startup-time as proven by our long-running scalability tests

Bump default API QPS limits for Kubelet.

/kind feature
/priority important-longterm
/sig node

/assign @lavalamp @deads2k
/cc @tkashem @MikeSpreitzer

wojtek-t · 2023-02-28T08:35:32Z

pkg/kubelet/apis/config/v1beta1/defaults.go

@@ -96,10 +96,10 @@ func SetDefaults_KubeletConfiguration(obj *kubeletconfigv1beta1.KubeletConfigura
 		obj.RegistryBurst = 10
 	}
 	if obj.EventRecordQPS == nil {
-		obj.EventRecordQPS = utilpointer.Int32Ptr(5)
+		obj.EventRecordQPS = utilpointer.Int32Ptr(1000)


This is obviously not the right way of doing that, as we can't change defaults.

The question is how exactly we should do that:

we seem to already have v1 API, but that don't even have any defaulting:
https://github.com/kubernetes/kubernetes/tree/master/pkg/kubelet/apis/config/v1

given the config is already in v1 - what options do we have to change it?

I'm not sure this is so terrible in a config type?

Do you want to remove the limit or increase it a lot? At 1k you might as well remove it completely?

In that case an alternative could be to add a "UseRateLimit" field and default it to false?

I'd get kubelet authors to weigh in first.

I don’t know why we wouldn’t go to 25 or 50 as a first step, then a release later with soak jump another bit. I struggle to think of kubelet even on 512 pod nodes being able to saturate 50qps anyway

I struggle to think of kubelet even on 512 pod nodes being able to saturate 50qps anyway

For extended period of time - I agree. As a spike - I actually can - let's say some large pod just finished there and we have a place to start next new small 30 pods, with 10 secrets each. That gives O(400) API calls...

That said - I would be fine with doing it in steps, if it's not a big overhead like creating a new config version to avoid changing defaults within a single (group, version) or sth.

When I looked at this for EKS, I found that I got most of the benefit just by doubling the QPS/burst. We've used 10 QPS 20 Burst as the default kubelet values for EKS AMIs since K8s 1.22. I'm happy to see the defaults increased.

The below graph is 3k pods and 30 nodes showing time to pod readiness.

The other situation where it helps is in auto-scaling. E.g. New pods are created, causing new nodes to be launched and large numbers of pods then schedule to the same node at roughly the same time as it goes ready.

Refs:

increase the kube-api-server QPS from 5/10 to 10/20 awslabs/amazon-eks-ami#1030

k8s: increase the kube-api-server QPS from 5/10 to 10/20 bottlerocket-os/bottlerocket#2436

At 1k you might as well remove it completely?

I think having a safeguard at a very high, but not extremely high value would protect API server from errors in kubelet. Virtually unlimited normal operation sounds reasonable to me.

@wojtek-t when I looked yesterday I thought the limit increased to 100/500. I might not remember correctly since I don't se force push there =). For me 100/500 was the good first step as @smarterclayton suggested.

One risk of 1k is potentially we can make kubelet to become a noisy neighbor in the moment of rescheduling (like @wojtek-t described in #116121 (comment) with 400 requests). It may be a problem when there are other API server clients on the node or network is very limited and close to be saturated... But we are keeping configuration settings and for set ups with limited network connectivity one can change defaults back.

I'd vote for 100/500 as a first step in 1.27.

This seems better than --use-rate-limit that is false by default because at least if someone previously specified these flags they would continue to work as the author expected.

I think the question for sig-node is more about whether we can change the default at all. And to us (api-machinery) about how confident we are in disabling the client-side rate limiter. We (redhat) have not yet tried disabling the rate limiter for kubelets.

@wojtek-t when I looked yesterday I thought the limit increased to 100/500. I might not remember correctly since I don't se force push there =). For me 100/500 was the good first step as @smarterclayton suggested.

I didn't change that in the meantime.
FWIW - Clayton suggested even smaller, but for now I switched to 50/100 - if we're ok with changing the defaults on the config level - this isn't much work so it's for sure ok to go gradually..

One risk of 1k is potentially we can make kubelet to become a noisy neighbor

Agree - but the fact that we're allowing clients to send requests, doesn't mean that kube-apiserver won't reject them anyway. So it's on kube-apiserver to decide whether it has capacity to process it, and APF should do the job here.

This seems better than --use-rate-limit that is false by default because at least if someone previously specified these flags they would continue to work as the author expected.

Agree - this is why I also started with this WIP PR.

And to us (api-machinery) about how confident we are in disabling the client-side rate limiter. We (redhat) have not yet tried disabling the rate limiter for kubelets.

As long as I don't have full confidence in other components, I think we're pretty confident in bumping Kubelets at this point from Google side.

wojtek-t · 2023-02-28T19:34:51Z

@SergeyKanzhelev - with whom from the node team I should talk about it?

tzneal · 2023-03-01T04:31:07Z

Some related info, I doubled the default kublet QPS/burst for both the EKS AL2 and Bottlerocket AMIs last year for the same reasons. There are some metrics/numbers in the BR PR at bottlerocket-os/bottlerocket#2436

SergeyKanzhelev · 2023-03-02T19:46:33Z

I think it is very good improvement. I cannot comment from the backend side if there will be scalability issues, but for majority of clusters these new defaults should simply work.

Maybe release notes on this PR may be expanded to note that action is required - review the new defaults and adjust if needed and provide the link back to docs.

SergeyKanzhelev · 2023-03-03T18:45:36Z

From PR mechanics, we also have these values documented in a field description:

kubernetes/staging/src/k8s.io/kubelet/config/v1beta1/types.go

Line 208 in a1b12e4

// Default: 5

Those needs to be changed as well.

It is not a backward compatible change, but I don't think it is breaking anything and I'd suggest we take it with the appropriate release notes.

gjkim42 · 2023-03-04T01:53:14Z

I have a question.

Is it ok to increase the default limit with v-2(v1.25) kube-apiserver?

SergeyKanzhelev · 2023-03-04T01:59:55Z

I have a question.

Is it ok to increase the default limit with v-2(v1.25) kube-apiserver?

Version skew backwards is not supported. API server must have higher version

wojtek-t

@SergeyKanzhelev - thanks for comments, PTAL

wojtek-t · 2023-03-06T08:48:02Z

pkg/kubelet/apis/config/v1beta1/defaults.go

@@ -96,10 +96,10 @@ func SetDefaults_KubeletConfiguration(obj *kubeletconfigv1beta1.KubeletConfigura
 		obj.RegistryBurst = 10
 	}
 	if obj.EventRecordQPS == nil {
-		obj.EventRecordQPS = utilpointer.Int32Ptr(5)
+		obj.EventRecordQPS = utilpointer.Int32Ptr(1000)


@wojtek-t when I looked yesterday I thought the limit increased to 100/500. I might not remember correctly since I don't se force push there =). For me 100/500 was the good first step as @smarterclayton suggested.

I didn't change that in the meantime.
FWIW - Clayton suggested even smaller, but for now I switched to 50/100 - if we're ok with changing the defaults on the config level - this isn't much work so it's for sure ok to go gradually..

One risk of 1k is potentially we can make kubelet to become a noisy neighbor

Agree - but the fact that we're allowing clients to send requests, doesn't mean that kube-apiserver won't reject them anyway. So it's on kube-apiserver to decide whether it has capacity to process it, and APF should do the job here.

This seems better than --use-rate-limit that is false by default because at least if someone previously specified these flags they would continue to work as the author expected.

Agree - this is why I also started with this WIP PR.

And to us (api-machinery) about how confident we are in disabling the client-side rate limiter. We (redhat) have not yet tried disabling the rate limiter for kubelets.

As long as I don't have full confidence in other components, I think we're pretty confident in bumping Kubelets at this point from Google side.

k8s-triage-robot · 2023-03-06T09:21:47Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

dchen1107 · 2023-03-07T18:16:40Z

I saw the original proposed 1000 reset back to 50, and burst is 100 from 5. I am ok to start from here, especially there is no concerns from API Machinery. The original configuration was chosen largely due to the limitation of API Machinery side.

/lgtm

k8s-ci-robot · 2023-03-07T18:16:48Z

LGTM label has been added.

Git tree hash: ec365c9044cd8410100dbe00e66411174c01332e

tzneal

Happy to see these increased.

/lgtm

lavalamp · 2023-03-07T18:32:55Z

/approve

k8s-ci-robot · 2023-03-07T18:33:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, tzneal, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/generated/openapi/OWNERS~~ [lavalamp,wojtek-t]
~~pkg/kubelet/apis/config/OWNERS~~ [lavalamp]
~~staging/src/k8s.io/kubelet/config/OWNERS~~ [lavalamp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2023-03-07T20:36:05Z

Thanks Dawn and Daniel!

/retest

fedebongio · 2023-03-07T21:13:10Z

/triage accepted

SergeyKanzhelev · 2023-03-07T22:19:13Z

/label api-review

We discussed the mechanics of it on SIG Node and agreed it is acceptable to change this default.

pacoxu · 2023-03-08T05:29:41Z

This helps in our clusters, and the first case for throttling that we meet is timeout expired waiting for volumes to attach or mount for pod. The solution is mentioned in https://access.redhat.com/solutions/4040681 to alter kube-api-burst and kube-api-qps.

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Feb 28, 2023

k8s-ci-robot assigned deads2k Feb 28, 2023

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 28, 2023

k8s-ci-robot assigned lavalamp Feb 28, 2023

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Feb 28, 2023

k8s-ci-robot requested review from MikeSpreitzer and tkashem February 28, 2023 08:33

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 28, 2023

wojtek-t commented Feb 28, 2023

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 2, 2023

wojtek-t force-pushed the bump_qps_kubelet branch from cec4184 to 10ead07 Compare March 6, 2023 08:53

wojtek-t changed the title ~~[WIP] Bump default API QPS limits for Kubelet~~ Bump default API QPS limits for Kubelet Mar 6, 2023

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 6, 2023

wojtek-t commented Mar 6, 2023

View reviewed changes

k8s-ci-robot added the area/code-generation label Mar 6, 2023

k8s-ci-robot assigned dchen1107 Mar 7, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 7, 2023

tzneal approved these changes Mar 7, 2023

View reviewed changes

k8s-ci-robot assigned tzneal Mar 7, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 7, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 7, 2023

k8s-ci-robot added the api-review Categorizes an issue or PR as actively needing an API review. label Mar 7, 2023

k8s-ci-robot merged commit fe6a51e into kubernetes:master Mar 7, 2023

k8s-ci-robot added this to the v1.27 milestone Mar 7, 2023

pacoxu mentioned this pull request Mar 8, 2023

sync default qps of kubelet change everywhere #116356

Merged

mimowo mentioned this pull request Mar 9, 2023

Give terminal phase correctly to all pods that will not be restarted #115331

Merged

marseel mentioned this pull request Mar 10, 2023

Move leader election to a separate Kubernetes client cilium/cilium#24267

Merged

bdevcich mentioned this pull request Mar 10, 2023

Investigate Copy Offload Create Request overhead NearNodeFlash/nnf-dm#88

Closed

tzneal mentioned this pull request Mar 30, 2023

update the kubelet API QPS for K8s 1.27 awslabs/amazon-eks-ami#1241

Merged

pacoxu mentioned this pull request Apr 7, 2023

kubelet: fix deprecated event-qps and event-burst flags #117156

Closed

1 task

cvvz mentioned this pull request Apr 26, 2023

emtpydir volume slow mount #115853

Closed

ary1992 mentioned this pull request Apr 28, 2023

☂️-Issue for "Support for Kubernetes v1.27” gardener/gardener#7783

Closed

17 tasks

tzneal mentioned this pull request May 9, 2023

k8s 1.27: remove defaults for kube-api-burst and kube-api-qps bottlerocket-os/bottlerocket#3094

Merged

GroceryBoyJr mentioned this pull request Jul 19, 2023

OCPBUGS-16180 Values incorrect in recommended control plane practices openshift/openshift-docs#62599

Merged

pacoxu mentioned this pull request Sep 1, 2023

When max-pods count is set on a node , how to get the list of pods (or) pod status that are considered in that limit? #120333

Closed

sathieu mentioned this pull request Jan 15, 2024

Increase maximum pods per node #23349

Open

Bump default API QPS limits for Kubelet #116121

Bump default API QPS limits for Kubelet #116121

Uh oh!

Conversation

wojtek-t commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tzneal Mar 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wojtek-t commented Feb 28, 2023

Uh oh!

tzneal commented Mar 1, 2023

Uh oh!

SergeyKanzhelev commented Mar 2, 2023

Uh oh!

SergeyKanzhelev commented Mar 3, 2023

Uh oh!

gjkim42 commented Mar 4, 2023

Uh oh!

SergeyKanzhelev commented Mar 4, 2023

Uh oh!

wojtek-t left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-triage-robot commented Mar 6, 2023

Uh oh!

dchen1107 commented Mar 7, 2023

Uh oh!

k8s-ci-robot commented Mar 7, 2023

Uh oh!

tzneal left a comment

Choose a reason for hiding this comment

Uh oh!

lavalamp commented Mar 7, 2023

Uh oh!

k8s-ci-robot commented Mar 7, 2023

Uh oh!

wojtek-t commented Mar 7, 2023

Uh oh!

fedebongio commented Mar 7, 2023

Uh oh!

SergeyKanzhelev commented Mar 7, 2023

Uh oh!

pacoxu commented Mar 8, 2023

Uh oh!

Uh oh!

wojtek-t commented Feb 28, 2023 •

edited

Loading

tzneal Mar 3, 2023 •

edited

Loading