-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handling describe instances consistency issue #801
handling describe instances consistency issue #801
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Hi @vdhanan. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
dc55f33
to
51f33fa
Compare
Pull Request Test Coverage Report for Build 1851
💛 - Coveralls |
/unlabel do-not-merge/work-in-progress |
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
b53aa12
to
1833d73
Compare
/ok-to-test |
1833d73
to
35e3261
Compare
b456309
to
21a67f7
Compare
can you describe in detail the exact scenario this is fixing? If a volume is in detaching state, and we try to attach it back to the same node, what's the problem? BTW, I have been advocating for us to move to / adopt the in-tree cloudprovider code instead of trying to roll our own and reinvent the wheel. |
@wongma7 if a pod with a volume attached dies, that volume will be detached, since we don't know where the pod will be recreated. the csi driver checks that the volume has been detached by calling DescribeVolumes. if the pod is recreated on the same node, the csi driver will attempt to reattach the volume. however, if the volume appears in the DescribeInstances call (which is only eventually consistent, meaning it can report stale info) during the attach workflow, the driver assumes it's already assigned and doesn't bother trying to attach it. by erroring out if we see a volume in detaching state, we ensure that the attach workflow will retry, hopefully when DescribeInstances is reporting accurately. (apologies if i used any incorrect terminology here) |
if the volume is appearing in the instance Attachment list regardless the volume status, we thought this volume is attached to the instance, driver will not try to call attachVolume api and wait for volume to be attached https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/pkg/cloud/cloud.go#L384 |
What if the volume gets detached and then in our reattach attempt the DescribeInstances is so stale that it returns the volume is still in state Notice in the cloud provider code they check the state in each polling attempt and break out if the volume unexpectedly becomes detached. https://github.com/kubernetes/kubernetes/blob/a55bd631728590045b51a4f65bba31aed1415571/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L2205. This is a better solution. I would like to see either |
@vdhanan In your testing, did you see the detached volume back to "attached" status? I agree to port in-tree waitForAttachmentStatus to CSI as it is more robust |
21a67f7
to
2b9acda
Compare
I ported most of the waitForAttachmentStatus function from in-tree. I think after GA we should just consume the vendor code directly like Matthew mentioned. |
Can you add some unit test for the updated function? |
2b9acda
to
870829a
Compare
/lgtm thanks, I am really more confident if we just copy the code, I know it's not so glamorous to be doing that but since this issue is so tricky to debug and test (hard to reproduce race condition) I think it's best option! |
pkg/cloud/cloud_test.go
Outdated
ctx := context.Background() | ||
|
||
switch tc.name { | ||
case "success: detached": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just a style alternative): if u want to avoid depending on test case name you could make these anonymous functions, something like this https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/pkg/driver/node_test.go#L69
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's definitely cleaner. i'll use this style next time
pkg/cloud/cloud_test.go
Outdated
name: "failure: already assigned but wrong state", | ||
volumeID: "vol-test-1234", | ||
expectedState: volumeAttachedState, | ||
expectedInstance: "1235", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this say 1234? otherwise this test is giving us a false positive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's giving false postiive at the moment cuz, we should be testing the case where: if we set alreadyAttached to true, and DescribeVolumes returns that the volume is detached, we want to error. Correct me if wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup you're right, it should be 1234
870829a
to
307ed14
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: AndyXiangLi, vdhanan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Is this a bug fix or adding new feature?
fixes #389
What is this PR about? / Why do we need it?
the describeInstances API follows an eventual consistency model. the csi driver should handle the fact that it may get inconsistent responses from this API
What testing is done?
manual testing trying to reproduce