Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter Disrupted Nodes and EBS CSI Volume Attachment #2318

Open
jammerful opened this issue Jan 28, 2025 · 3 comments
Open

Karpenter Disrupted Nodes and EBS CSI Volume Attachment #2318

jammerful opened this issue Jan 28, 2025 · 3 comments
Assignees
Labels
triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@jammerful
Copy link

jammerful commented Jan 28, 2025

I'm running into an issue where the Karpenter wants to disrupt a node with a stateful set running on it. Then karpenter terminates all the non-daemonset pods on that node. However, when the pod is scheduled to the new node it is unable to start as the volume is still attached to the old node and karpenter is not able to terminate that node:

$ kubectl describe pod
Status:                    Terminating (lasts 3h5m)
...
Events:
  Type    Reason     Age                   From       Message
  ----    ------     ----                  ----       -------
  Normal  Nominated  6m1s (x79 over 164m)  karpenter  Pod should schedule on: nodeclaim/default-on-demand-p27q8, node/ip-10-221-64-33.ec2.internal

When trying to find the volumeattachment and which node its attached to

kubectl describe volumeattachment csi-c72d43bef46cd68c80357ffa7c5e647f8351bd0b01b2b747cb11f5d702745f7f
Name:         csi-c72d43bef46cd68c80357ffa7c5e647f8351bd0b01b2b747cb11f5d702745f7f
Namespace:
Labels:       <none>
Annotations:  csi.alpha.kubernetes.io/node-id: i-048a0a6c6c9e79dd2
API Version:  storage.k8s.io/v1
Kind:         VolumeAttachment
Metadata:
  Creation Timestamp:  2025-01-24T03:25:51Z
  Finalizers:
    external-attacher/ebs-csi-aws-com
  Resource Version:  913618302
  UID:               26ce1744-6c4d-440b-a54b-aa4e9e02eb5c
Spec:
  Attacher:   ebs.csi.aws.com
  Node Name:  ip-10-221-66-172.ec2.internal

You can see the that is another node than what the pod is scheduled to. Looking at the EBS CSI Driver attacher, I don't see any mentions of that attachment:

$ kubectl logs -n system-storage ebs-csi-driver-controller-659467997f-5rw4s -c csi-attacher | grep csi-c72d43bef46cd68c80357ffa7c5e647f8351bd0b01b2b747cb11f5d702745f7f
<empty> (I confirmed this was the leader....)

Once I run kubectl delete volumeattachment csi-c72d43bef46cd68c80357ffa7c5e647f8351bd0b01b2b747cb11f5d702745f7f the pod that was stuck terminating comes up on the new node.

What could this issue be caused by? I would expect the EBS CSI attacher to dettach the volume at some point.

@ElijahQuinones
Copy link
Member

Hi @jammerful,

Im looking into this now but need a bit more information from you.
Can you please let us know the following about your environment.

K8s version
Karpenter version
aws-ebs-csi-driver version
Are you using helm or aws-ebs-csi-driver eks addon

As well as if this is an eks or self managed cluster.

Additionally here is our FAQ which contains some Karpenter best practices.

Thank you.

@ElijahQuinones ElijahQuinones self-assigned this Jan 30, 2025
@AndrewSirenko
Copy link
Contributor

/triage needs-information

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Feb 7, 2025
@mugdha-adhav
Copy link

We noticed a similar issue, but in our case the pod is scheduled on the same node as the one mentioned in volume-attachment.

Even then the PV is stuck in terminating state and unable to recover.

The only relevant logs that I see are from the external-attacher -

ebs-csi-controller-866fcc7577-vwx5p csi-attacher I0227 14:46:04.338320       1 csi_handler.go:243] "Error processing" VolumeAttachment="csi-5e079e13957bfa0b6e7045d9e544afcdc47beaf74f862c3e321a70a655543f8e" err="failed to attach: PersistentVolume \"pvc-8cce5fbf-43b5-4bcf-bfdc-604a7e5a0ff0\" is marked for deletion"

Details of pod stuck in init phase -

Node:             ip-10-141-167-0.sa-east-1.compute.internal/10.141.167.0
Events:
  Type     Reason              Age                  From                     Message
  ----     ------              ----                 ----                     -------
  Warning  FailedAttachVolume  23s (x1428 over 2d)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-8cce5fbf-43b5-4bcf-bfdc-604a7e5a0ff0" : PersistentVolume "pvc-8cce5fbf-43b5-4bcf-bfdc-604a7e5a0ff0" is marked for deletion

Details of the volume-attachment -

NAME                                                                   ATTACHER          PV                                         NODE                                         ATTACHED   AGE
csi-5e079e13957bfa0b6e7045d9e544afcdc47beaf74f862c3e321a70a655543f8e   ebs.csi.aws.com   pvc-8cce5fbf-43b5-4bcf-bfdc-604a7e5a0ff0   ip-10-141-167-0.sa-east-1.compute.internal   false      2d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

5 participants