Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to AttachVolume, expected device to be attached but was attaching #1801

Closed
adovy opened this issue Oct 25, 2023 · 14 comments
Closed

Failed to AttachVolume, expected device to be attached but was attaching #1801

adovy opened this issue Oct 25, 2023 · 14 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@adovy
Copy link

adovy commented Oct 25, 2023

/kind bug

What happened?

Getting Pod error FailedAttachVolume: AttachVolume.Attach failed for volume "pvc-XXX" : rpc error: code = Internal desc = Could not attach volume "vol-XXX" to node "i-XXX": attachment of disk "vol-XX" failed, expected device to be attached but was attaching

Error log from ebs-csi-controller:

I1025 11:08:34.489703       1 controller.go:415] "ControllerPublishVolume: attaching" volumeID="vol-XXX" nodeID="i-XXX"
E1025 11:08:49.490233       1 driver.go:124] "GRPC error" err="rpc error: code = Internal desc = Could not attach volume \"vol-XXX\" to node \"i-XXX\": context canceled"
I1025 11:08:49.497484       1 controller.go:415] "ControllerPublishVolume: attaching" volumeID="vol-XXX" nodeID="i-XXX"
E1025 11:08:49.629393       1 driver.go:124] "GRPC error" err="rpc error: code = Internal desc = Could not attach volume \"vol-XXX\" to node \"i-XXX\": attachment of disk \"vol-XXX\" failed, expected device to be attached but was attaching"

What you expected to happen?

EBS Volumes to be attached successfully to a node when a PVC is created.

Anything else we need to know?:

I see that actually the volume is attached to Node.
p.s. it doesn't happen with all volumes

Environment

  • Kubernetes version: v1.27.4-eks-2d98532
  • Driver version: v1.24.0
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 25, 2023
@ConnorJC3
Copy link
Contributor

Do volumes get stuck in this state indefinitely (or for an abnormally long time)? It's usual to see this message if a volume is taking a long time to attach on EBS's side, but it should eventually fix itself after the volume becomes attached.

@adovy
Copy link
Author

adovy commented Oct 30, 2023

on AWS console I see that volume is attached to the node, but the controller keeps trying to attach it

@adovy
Copy link
Author

adovy commented Nov 15, 2023

seems like the problem was with reaching limit of EBS Volumes per Node.
But after setting volumeAttachLimit, the pod is still scheduled on the node which already reached the attached volumes limit.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 14, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 13, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aws-patrickc
Copy link

Hello,

Any other workaround besides a manual intervention? Can this be consider as a feature request to have EBS CSI Controller to throw an error or better yet have a way to notify the scheduler to re-schedule the pod onto a different worker node?

@divya898
Copy link

I want to deploy postgres on k8s cluster made by Kubeadm and getting the issue

"E0718 13:31:52.030062 1 batcher.go:161] "execute: error executing batch" err="error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded"
E0718 13:31:52.030114 1 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Could not attach volume "vol-xxxxxxxxxxxxx" to node "i-xxxxxxxxxxxxx": error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded""

And also I have created pv and pvc which is bounded with the help of ebs-csi-driver and I have deployed csi friver using helm.

"ubuntu@k8smaster:/application/postgres$ kubectl get pvc -n app
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
postgres-pvc Bound existing-ebs-pv 60Gi RWO ebs-postgres 115m
ubuntu@k8smaster:
/application/postgres$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
existing-ebs-pv 60Gi RWO Retain Bound app/postgres-pvc ebs-postgres 115m"

AND

I have correctly configured the iam role and attach to all nodes
with this policy
"
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:CreateSnapshot",
"ec2:CreateTags",
"ec2:CreateVolume",
"ec2:DeleteSnapshot",
"ec2:DeleteTags",
"ec2:DeleteVolume",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeInstances",
"ec2:DescribeSnapshots",
"ec2:DescribeTags",
"ec2:DescribeVolumes",
"ec2:DescribeVolumesModifications",
"ec2:AttachVolume",
"ec2:DetachVolume"
],
"Resource": "*"
}
]
}
"

can any one tell what is the issue ?

@torredil
Copy link
Member

@divya898 Deploying the driver with SDK debug mode enabled and taking a look at the logs once again will be very helpful for debugging the issue.

@divya898
Copy link

divya898 commented Jul 18, 2024

@torredil
Ya
I have used this command helm repo add aws-ebs-csi-driver https://kubernetes-sigs.github.io/aws-ebs-csi-driver
and install the csi

pod/ebs-csi-controller-86957f8467-785bg 5/5 Running 0 5h54m
pod/ebs-csi-controller-86957f8467-vxcmc 5/5 Running 0 5h54m
pod/ebs-csi-node-4jkrz 3/3 Running 0 5h54m
pod/ebs-csi-node-kk6qw 3/3 Running 0 5h54m
pod/ebs-csi-node-zrljn 3/3 Running 0 5h54m

and I can able to bound with pvc and pv
But the issue persist pod are in creating state

Events:
Type Reason Age From Message


Normal Scheduled 72s default-scheduler Successfully assigned app/postgres-777d8c6765-d2tct to k8sworker2.example.net
Warning FailedAttachVolume 4s (x8 over 71s) attachdetach-controller AttachVolume.Attach failed for volume "existing-ebs-pv" : volume attachment is being deleted

Logs of CSI

controller pod :-

E0718 17:06:24.989148 1 batcher.go:161] "execute: error executing batch" err="error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded"
E0718 17:06:24.989220 1 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Could not detach volume "vol-02a0b78825ece9614" from node "i-0492a4a4bf1856321": error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded"
I0718 17:36:25.048006 1 controller.go:464] "ControllerUnpublishVolume: detaching" volumeID="vol-02a0b78825ece9614" nodeID="i-0492a4a4bf1856321"
E0718 17:36:55.551709 1 batcher.go:161] "execute: error executing batch" err="error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded"
E0718 17:36:55.551774 1 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Could not detach volume "vol-02a0b78825ece9614" from node "i-0492a4a4bf1856321": error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded"
I0718 17:38:21.934625 1 controller.go:464] "ControllerUnpublishVolume: detaching" volumeID="vol-02a0b78825ece9614" nodeID="i-0492a4a4bf1856321"
E0718 17:38:52.436619 1 batcher.go:161] "execute: error executing batch" err="error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded"
E0718 17:38:52.436696 1 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Could not detach volume "vol-02a0b78825ece9614" from node "i-0492a4a4bf1856321": error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded"

ubuntu@k8smaster:/application/postgres$ k logs ebs-csi-controller-86957f8467-vxcmc --tail=100 -n kube-system
Defaulted container "ebs-plugin" out of: ebs-plugin, csi-provisioner, csi-attacher, csi-resizer, liveness-probe
I0718 11:18:13.486934 1 main.go:157] "Initializing metadata"
I0718 11:18:13.509506 1 metadata.go:48] "Retrieved metadata from IMDS"
I0718 11:18:13.510885 1 driver.go:68] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.32.0"
ubuntu@k8smaster:
/application/postgres$

ubuntu@k8smaster:~/application/postgres$ k logs ebs-csi-node-4jkrz --tail=100 -n kube-system
Defaulted container "ebs-plugin" out of: ebs-plugin, node-driver-registrar, liveness-probe
I0718 11:18:12.593894 1 main.go:157] "Initializing metadata"
I0718 11:18:12.633046 1 metadata.go:48] "Retrieved metadata from IMDS"
I0718 11:18:12.633999 1 driver.go:68] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.32.0"
E0718 11:18:13.644744 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8smaster.example.net"
E0718 11:18:14.152630 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8smaster.example.net"
E0718 11:18:15.159355 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8smaster.example.net"
I0718 11:18:17.167929 1 node.go:936] "CSINode Allocatable value is set" nodeName="k8smaster.example.net" count=38

ubuntu@k8smaster:~/application/postgres$ k logs ebs-csi-node-kk6qw --tail=100 -n kube-system
Defaulted container "ebs-plugin" out of: ebs-plugin, node-driver-registrar, liveness-probe
I0718 11:18:12.840835 1 main.go:157] "Initializing metadata"
I0718 11:18:12.876464 1 metadata.go:48] "Retrieved metadata from IMDS"
I0718 11:18:12.878910 1 driver.go:68] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.32.0"
E0718 11:18:13.905032 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker2.example.net"
E0718 11:18:14.413693 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker2.example.net"
E0718 11:18:15.426821 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker2.example.net"
I0718 11:18:17.433850 1 node.go:936] "CSINode Allocatable value is set" nodeName="k8sworker2.example.net" count=26

ubuntu@k8smaster:~/application/postgres$ k logs ebs-csi-node-zrljn --tail=100 -n kube-system
Defaulted container "ebs-plugin" out of: ebs-plugin, node-driver-registrar, liveness-probe
I0718 11:18:11.396228 1 main.go:157] "Initializing metadata"
I0718 11:18:11.405170 1 metadata.go:48] "Retrieved metadata from IMDS"
I0718 11:18:11.405756 1 driver.go:68] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.32.0"
E0718 11:18:12.418598 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker1.example.net"
E0718 11:18:12.926204 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker1.example.net"
E0718 11:18:13.932893 1 node.go:856] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node k8sworker1.example.net"
I0718 11:18:15.939380 1 node.go:936] "CSINode Allocatable value is set" nodeName="k8sworker1.example.net" count=38

Please tell what i am missing ?

@torredil
Copy link
Member

@divya898 It looks like SDK debug logging has not been enabled, you can do so via the controller.sdkDebugLog=true parameter. Example:

helm upgrade --install aws-ebs-csi-driver --namespace kube-system ./charts/aws-ebs-csi-driver --values ./charts/aws-ebs-csi-driver/values.yaml --set controller.sdkDebugLog=true

you should be able to see more detailed information that will help troubleshoot why the HTTP request is timing out directly in the CSI controller logs:

Action=DescribeInstances&InstanceId.1=i-0c73a509f58829453&Version=2016-11-15
SDK 2024/07/18 17:51:31 DEBUG Response
HTTP/1.1 200 OK
Transfer-Encoding: chunked
Cache-Control: no-cache, no-store
Content-Type: text/xml;charset=UTF-8
Date: Thu, 18 Jul 2024 17:51:31 GMT
Server: AmazonEC2
Strict-Transport-Security: max-age=31536000; includeSubDomains
Vary: accept-encoding
X-Amzn-Requestid: 693436ba-xxxx-xxxx-xxxx-a03a27ea5640
...

@divya898
Copy link

Hi @torredil
thankyou for your time I can see my issue is solved by restarting the csi-driver pod by updating the helm repo .

@torredil
Copy link
Member

@divya898 Glad to hear 👍

Let us know if you run into any other issues and we'll be happy to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants