-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decrease dynamic provisioning time by 1.5 seconds #2021
Decrease dynamic provisioning time by 1.5 seconds #2021
Conversation
Code Coverage DiffThis PR does not change the code coverage |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! thank you for digging into this and the efficiency improvement, this PR is very significant 💯
09e2507
to
fa7ca3d
Compare
fa7ca3d
to
485db6a
Compare
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: torredil The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
Is this a bug fix or adding new feature?
Improvement
What is this PR about? / Why do we need it?
Cut median dynamic provisioning latency in half (~3.7 to 2 seconds) by polling EBS volume creation more agressively.
Today, when dynamically provisioning a volume via our CreateVolume RPC, we wait 3s after an EC2 CreateVolume call to start polling DescribeVolumes. Based on testing in
us-west-2
,ap-northeast-1
, andus-east-1
, the p10 & median time between calling EC2 CreateVolume and the volume being created is ~1.5s.Polling every 3 seconds means that a typical CreateVolume RPC will take ~3.5 seconds (and a worse case of 6+ seconds once every hundred volumes). In this PR, we use a more aggressive initial delay and polling interval. Finally, we switch to an exponential backoff in order to decrease the likelihood of being rate-limited for DescribeVolumes if volume creation time slows down.
What testing is done?
Measured seconds between
external-provisioner
CreateVolume RPC first start and first success for 100 PVCs launched across 100 pods on a 30 node cluster. Repeated 3 times for each combination of parameters.Final column is what we went for in this PR.
NOTE: Tested with 500ms maxDelay (instead of 1s) for describeVolume batcher, because we will decrease that value in a future parameter tuning PR. That maxDelay value mostly affected
p90
p95
(I presume those affected were the first DescribeVolume calls in every batch)