-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[CONTP-675]Autoscaling Failover local workload store check: subcommand, flare support #37248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Go Package Import DifferencesBaseline: ad0e415
|
Uncompressed package size comparisonComparison with ancestor Diff per package
Decision✅ Passed |
60f468b
to
fcb95df
Compare
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: ad0e415 Optimization Goals: ✅ No significant changes detected
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | ddot_metrics | memory utilization | +0.47 | [+0.35, +0.59] | 1 | Logs |
➖ | tcp_syslog_to_blackhole | ingress throughput | +0.34 | [+0.26, +0.41] | 1 | Logs |
➖ | quality_gate_idle_all_features | memory utilization | +0.30 | [+0.21, +0.39] | 1 | Logs bounds checks dashboard |
➖ | otlp_ingest_metrics | memory utilization | +0.14 | [-0.02, +0.31] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency | egress throughput | +0.11 | [-0.49, +0.71] | 1 | Logs |
➖ | ddot_logs | memory utilization | +0.07 | [-0.07, +0.21] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.06 | [-0.47, +0.60] | 1 | Logs |
➖ | docker_containers_cpu | % cpu utilization | +0.05 | [-2.99, +3.08] | 1 | Logs |
➖ | otlp_ingest_logs | memory utilization | +0.04 | [-0.09, +0.17] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.02, +0.01] | 1 | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | -0.00 | [-0.29, +0.28] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http1 | egress throughput | -0.01 | [-0.59, +0.56] | 1 | Logs |
➖ | quality_gate_idle | memory utilization | -0.02 | [-0.08, +0.05] | 1 | Logs bounds checks dashboard |
➖ | file_to_blackhole_500ms_latency | egress throughput | -0.03 | [-0.65, +0.59] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency_linear_load | egress throughput | -0.06 | [-0.29, +0.18] | 1 | Logs |
➖ | file_to_blackhole_100ms_latency | egress throughput | -0.06 | [-0.59, +0.47] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http2 | egress throughput | -0.06 | [-0.62, +0.49] | 1 | Logs |
➖ | file_tree | memory utilization | -0.07 | [-0.23, +0.08] | 1 | Logs |
➖ | file_to_blackhole_300ms_latency | egress throughput | -0.09 | [-0.68, +0.51] | 1 | Logs |
➖ | quality_gate_logs | % cpu utilization | -0.32 | [-3.07, +2.42] | 1 | Logs bounds checks dashboard |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | -0.47 | [-1.34, +0.40] | 1 | Logs |
➖ | docker_containers_memory | memory utilization | -0.80 | [-0.85, -0.74] | 1 | Logs |
➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.89 | [-0.93, -0.85] | 1 | Logs |
Bounds Checks: ✅ Passed
perf | experiment | bounds_check_name | replicates_passed | links |
---|---|---|---|---|
✅ | docker_containers_cpu | simple_check_run | 10/10 | |
✅ | docker_containers_memory | memory_usage | 10/10 | |
✅ | docker_containers_memory | simple_check_run | 10/10 | |
✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency_linear_load | memory_usage | 10/10 | |
✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_300ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_300ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | lost_bytes | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | memory_usage | 10/10 | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
|
e1a69ef
to
8b98f20
Compare
c6f8cff
to
2b74246
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a subcommand to the cluster agent for checking local autoscaling workload store data including workload entities and their metadata. Key changes include:
- Creation and update of the local autoscaling workload store logic with tests.
- Addition of a new CLI flag and endpoint for retrieving local autoscaling debug information.
- Implementation of both active and noop versions based on build tags.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
pkg/clusteragent/autoscaling/workload/loadstore/workload_status_test.go | Adds tests for the workload status updates. |
pkg/clusteragent/autoscaling/workload/loadstore/workload_status_noop.go | Provides a noop implementation for non‑kubeapiserver builds. |
pkg/clusteragent/autoscaling/workload/loadstore/workload_status.go | Implements the workload status aggregator for local autoscaling. |
pkg/cli/subcommands/autoscalerlist/command.go | Updates CLI command to support a new localstore flag and debug endpoint. |
cmd/cluster-agent/api/agent/agent.go | Adds a new API handler for local autoscaling workload check. |
Comments suppressed due to low confidence (1)
pkg/cli/subcommands/autoscalerlist/command.go:71
- [nitpick] The shorthand flag 'v' is commonly associated with 'verbose', which may cause confusion given its usage here for enabling the localstore output. Consider using a different, more specific shorthand (e.g. 'l') if possible.
cmd.Flags().BoolVarP(&cliParams.localstore, "localstore", "v", false, "print autoscaling localstore debug info")
pkg/clusteragent/autoscaling/workload/loadstore/workload_status_noop.go
Outdated
Show resolved
Hide resolved
a508b46
to
539f16b
Compare
pkg/clusteragent/autoscaling/workload/loadstore/workload_status.go
Outdated
Show resolved
Hide resolved
pkg/clusteragent/autoscaling/workload/loadstore/workload_status.go
Outdated
Show resolved
Hide resolved
pkg/clusteragent/autoscaling/workload/loadstore/workload_status.go
Outdated
Show resolved
Hide resolved
pkg/clusteragent/autoscaling/workload/loadstore/workload_status.go
Outdated
Show resolved
Hide resolved
pkg/clusteragent/autoscaling/workload/loadstore/workload_status.go
Outdated
Show resolved
Hide resolved
pkg/clusteragent/autoscaling/workload/loadstore/workload_status.go
Outdated
Show resolved
Hide resolved
if !ok || datapoints == nil { | ||
continue | ||
} | ||
fmt.Fprintf(w, "Namespace: %s, PodOwner: %s, MetricName: %s, Datapoints: %v\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the output is quite dense, i'm wondering if it makes sense to add more newlines between to make it easier to read (but i guess we expect a lot of metric data so maybe that would make the output too long and unreadable as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With thousands of entities, it is better to keep it short for each item.
pkg/clusteragent/autoscaling/workload/loadstore/workload_status_test.go
Outdated
Show resolved
Hide resolved
@@ -122,3 +134,34 @@ func getAutoscalerList(w io.Writer, url string) error { | |||
autoscalerDump.Print(w) | |||
return nil | |||
} | |||
|
|||
func getLocalAutoscalingWorkloadCheck(w io.Writer, config config.Component) error { | |||
c := util.GetClient() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a headsup: i think this might have to change depending on if #37247 gets merged in first
607291c
to
c33dbac
Compare
bee398a
to
3432d07
Compare
} | ||
for _, statsResult := range lStoreInfo.StatsResults { | ||
// Skip the disabled namespaces | ||
if _, ok := defaultDisabledNamespaces()[statsResult.Namespace]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we expect the value returned by defaultDisabledNamespaces()
to change very frequently? or can we fetch the disabled namespaces once outside of the loop and then check against it? (to avoid so many reinitializations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for @DataDog/container-platform files
c488c7e
to
8a3d2d1
Compare
/merge |
View all feedbacks in Devflow UI.
This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals.
devflow unqueued this merge request: It did not become mergeable within the expected time |
8a3d2d1
to
3df9385
Compare
/merge |
View all feedbacks in Devflow UI.
This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals.
devflow unqueued this merge request: It did not become mergeable within the expected time |
3df9385
to
e399e26
Compare
e399e26
to
0abefa1
Compare
/merge |
View all feedbacks in Devflow UI.
The expected merge time in
|
What does this PR do?
This PR add agent subcommand and flare report to get local autoscaling status, including store entity number, metadata etc from cluster agent
Motivation
Describe how you validated your changes
Enable autoscaling failover setting
Go to leader cluster-agent
agent autoscaler-list --localstore
curl -v -k https://localhost:5005/local-autoscaling-check -H "Authorization: Bearer $DD_CLUSTER_AGENT_AUTH_TOKEN
agent flare
and findlocal-autoscaling-check.json
, example outputPossible Drawbacks / Trade-offs
Additional Notes