Decrease dynamic provisioning time by 1.5 seconds #2021

AndrewSirenko · 2024-04-23T19:17:57Z

Is this a bug fix or adding new feature?
Improvement

What is this PR about? / Why do we need it?
Cut median dynamic provisioning latency in half (~3.7 to 2 seconds) by polling EBS volume creation more agressively.

Today, when dynamically provisioning a volume via our CreateVolume RPC, we wait 3s after an EC2 CreateVolume call to start polling DescribeVolumes. Based on testing in us-west-2, ap-northeast-1, and us-east-1, the p10 & median time between calling EC2 CreateVolume and the volume being created is ~1.5s.

Polling every 3 seconds means that a typical CreateVolume RPC will take ~3.5 seconds (and a worse case of 6+ seconds once every hundred volumes). In this PR, we use a more aggressive initial delay and polling interval. Finally, we switch to an exponential backoff in order to decrease the likelihood of being rate-limited for DescribeVolumes if volume creation time slows down.

What testing is done?

Measured seconds between external-provisioner CreateVolume RPC first start and first success for 100 PVCs launched across 100 pods on a 30 node cluster. Repeated 3 times for each combination of parameters.

Final column is what we went for in this PR.

	3s Initial Sleep; 3s poll duration (Today's Performance)	1.5s Initial Sleep; 1s poll duration	1s Initial Sleep; Exponential backoff with .75s initial	1s Initial Sleep; Exponential backoff with .5s initial	1.5s Initial Sleep; Exponential backoff with .5s initial	1.25s Initial Sleep; Exponential backoff with .5s initial
p10	3.52	1.90	1.76	1.73	1.93	1.76
p50	3.76	2.19	2.52	2.57	2.19	2.02
p90	3.85	2.81	2.99	2.86	2.60	2.85
p95	3.86	3.25	3.11	3.63	3.12	3.09
p99	6.66	5.31	4.37	3.94	4.40	4.01
Avg amount of DV calls	27	41	30	34	33	32

NOTE: Tested with 500ms maxDelay (instead of 1s) for describeVolume batcher, because we will decrease that value in a future parameter tuning PR. That maxDelay value mostly affected p90 p95 (I presume those affected were the first DescribeVolume calls in every batch)

github-actions · 2024-04-23T19:20:06Z

Code Coverage Diff

This PR does not change the code coverage

torredil · 2024-04-23T19:48:44Z

/retest

torredil

Great work! thank you for digging into this and the efficiency improvement, this PR is very significant 💯

pkg/cloud/cloud.go

ConnorJC3 · 2024-04-25T18:35:30Z

/lgtm

torredil

/lgtm
/approve

k8s-ci-robot · 2024-04-25T18:37:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: torredil

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [torredil]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

AndrewSirenko · 2024-04-25T19:00:20Z

/retest

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 23, 2024

k8s-ci-robot requested review from ConnorJC3 and torredil April 23, 2024 19:18

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 23, 2024

AndrewSirenko mentioned this pull request Apr 23, 2024

Inject volumeWaitParameters dependency #2022

Merged

torredil reviewed Apr 24, 2024

View reviewed changes

pkg/cloud/cloud.go Outdated Show resolved Hide resolved

torredil reviewed Apr 25, 2024

View reviewed changes

pkg/cloud/cloud.go Show resolved Hide resolved

AndrewSirenko force-pushed the optimizeDvPolling branch from 09e2507 to fa7ca3d Compare April 25, 2024 17:41

ConnorJC3 reviewed Apr 25, 2024

View reviewed changes

pkg/cloud/cloud.go Outdated Show resolved Hide resolved

Decrease median dynamic provisioning time by 1.5 seconds

485db6a

AndrewSirenko force-pushed the optimizeDvPolling branch from fa7ca3d to 485db6a Compare April 25, 2024 18:31

k8s-ci-robot assigned ConnorJC3 Apr 25, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 25, 2024

torredil approved these changes Apr 25, 2024

View reviewed changes

k8s-ci-robot assigned torredil Apr 25, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 25, 2024

k8s-ci-robot merged commit 48b2755 into kubernetes-sigs:master Apr 25, 2024
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decrease dynamic provisioning time by 1.5 seconds #2021

Decrease dynamic provisioning time by 1.5 seconds #2021

AndrewSirenko commented Apr 23, 2024 •

edited

Loading

github-actions bot commented Apr 23, 2024

torredil commented Apr 23, 2024

torredil left a comment

ConnorJC3 commented Apr 25, 2024

torredil left a comment

k8s-ci-robot commented Apr 25, 2024

AndrewSirenko commented Apr 25, 2024

Decrease dynamic provisioning time by 1.5 seconds #2021

Decrease dynamic provisioning time by 1.5 seconds #2021

Conversation

AndrewSirenko commented Apr 23, 2024 • edited Loading

github-actions bot commented Apr 23, 2024

Code Coverage Diff

torredil commented Apr 23, 2024

torredil left a comment

Choose a reason for hiding this comment

ConnorJC3 commented Apr 25, 2024

torredil left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 25, 2024

AndrewSirenko commented Apr 25, 2024

AndrewSirenko commented Apr 23, 2024 •

edited

Loading