Karpenter Disrupted Nodes and EBS CSI Volume Attachment #2318

jammerful · 2025-01-28T22:16:10Z

I'm running into an issue where the Karpenter wants to disrupt a node with a stateful set running on it. Then karpenter terminates all the non-daemonset pods on that node. However, when the pod is scheduled to the new node it is unable to start as the volume is still attached to the old node and karpenter is not able to terminate that node:

$ kubectl describe pod
Status:                    Terminating (lasts 3h5m)
...
Events:
  Type    Reason     Age                   From       Message
  ----    ------     ----                  ----       -------
  Normal  Nominated  6m1s (x79 over 164m)  karpenter  Pod should schedule on: nodeclaim/default-on-demand-p27q8, node/ip-10-221-64-33.ec2.internal

When trying to find the volumeattachment and which node its attached to

kubectl describe volumeattachment csi-c72d43bef46cd68c80357ffa7c5e647f8351bd0b01b2b747cb11f5d702745f7f
Name:         csi-c72d43bef46cd68c80357ffa7c5e647f8351bd0b01b2b747cb11f5d702745f7f
Namespace:
Labels:       <none>
Annotations:  csi.alpha.kubernetes.io/node-id: i-048a0a6c6c9e79dd2
API Version:  storage.k8s.io/v1
Kind:         VolumeAttachment
Metadata:
  Creation Timestamp:  2025-01-24T03:25:51Z
  Finalizers:
    external-attacher/ebs-csi-aws-com
  Resource Version:  913618302
  UID:               26ce1744-6c4d-440b-a54b-aa4e9e02eb5c
Spec:
  Attacher:   ebs.csi.aws.com
  Node Name:  ip-10-221-66-172.ec2.internal

You can see the that is another node than what the pod is scheduled to. Looking at the EBS CSI Driver attacher, I don't see any mentions of that attachment:

$ kubectl logs -n system-storage ebs-csi-driver-controller-659467997f-5rw4s -c csi-attacher | grep csi-c72d43bef46cd68c80357ffa7c5e647f8351bd0b01b2b747cb11f5d702745f7f
<empty> (I confirmed this was the leader....)

Once I run kubectl delete volumeattachment csi-c72d43bef46cd68c80357ffa7c5e647f8351bd0b01b2b747cb11f5d702745f7f the pod that was stuck terminating comes up on the new node.

What could this issue be caused by? I would expect the EBS CSI attacher to dettach the volume at some point.

The text was updated successfully, but these errors were encountered:

ElijahQuinones · 2025-01-29T18:08:03Z

Hi @jammerful,

Im looking into this now but need a bit more information from you.
Can you please let us know the following about your environment.

K8s version
Karpenter version
aws-ebs-csi-driver version
Are you using helm or aws-ebs-csi-driver eks addon

As well as if this is an eks or self managed cluster.

Additionally here is our FAQ which contains some Karpenter best practices.

Thank you.

AndrewSirenko · 2025-02-07T19:23:31Z

/triage needs-information

mugdha-adhav · 2025-02-28T09:20:42Z

We noticed a similar issue, but in our case the pod is scheduled on the same node as the one mentioned in volume-attachment.

Even then the PV is stuck in terminating state and unable to recover.

The only relevant logs that I see are from the external-attacher -

ebs-csi-controller-866fcc7577-vwx5p csi-attacher I0227 14:46:04.338320       1 csi_handler.go:243] "Error processing" VolumeAttachment="csi-5e079e13957bfa0b6e7045d9e544afcdc47beaf74f862c3e321a70a655543f8e" err="failed to attach: PersistentVolume \"pvc-8cce5fbf-43b5-4bcf-bfdc-604a7e5a0ff0\" is marked for deletion"

Details of pod stuck in init phase -

Node:             ip-10-141-167-0.sa-east-1.compute.internal/10.141.167.0
Events:
  Type     Reason              Age                  From                     Message
  ----     ------              ----                 ----                     -------
  Warning  FailedAttachVolume  23s (x1428 over 2d)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-8cce5fbf-43b5-4bcf-bfdc-604a7e5a0ff0" : PersistentVolume "pvc-8cce5fbf-43b5-4bcf-bfdc-604a7e5a0ff0" is marked for deletion

Details of the volume-attachment -

NAME                                                                   ATTACHER          PV                                         NODE                                         ATTACHED   AGE
csi-5e079e13957bfa0b6e7045d9e544afcdc47beaf74f862c3e321a70a655543f8e   ebs.csi.aws.com   pvc-8cce5fbf-43b5-4bcf-bfdc-604a7e5a0ff0   ip-10-141-167-0.sa-east-1.compute.internal   false      2d

ElijahQuinones self-assigned this Jan 30, 2025

k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter Disrupted Nodes and EBS CSI Volume Attachment #2318

Karpenter Disrupted Nodes and EBS CSI Volume Attachment #2318

jammerful commented Jan 28, 2025 •

edited

Loading

ElijahQuinones commented Jan 29, 2025

AndrewSirenko commented Feb 7, 2025

mugdha-adhav commented Feb 28, 2025

Karpenter Disrupted Nodes and EBS CSI Volume Attachment #2318

Karpenter Disrupted Nodes and EBS CSI Volume Attachment #2318

Comments

jammerful commented Jan 28, 2025 • edited Loading

ElijahQuinones commented Jan 29, 2025

AndrewSirenko commented Feb 7, 2025

mugdha-adhav commented Feb 28, 2025

jammerful commented Jan 28, 2025 •

edited

Loading