-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ModifyVolume via VolumeAttributesClass forces ebs-csi-controller into CrashLoopback when modifying PVC from gp2 to gp3 #2340
Comments
Hi @gpapakyriakopoulos, thank you for your issue! One possible root cause is that Our list of valid modification parameters are listed in our modify-volume documentation. This includes We can do a better job distinguishing valid storageClass vs VolumeAttributesClass parameters by adding a section to our parameters.md, instead of having this list in this modify-volume documentation. I'll take an action item for that. As for why the nil pointer panic, that seems to be a bug with the upstream resizer sidecar itself: https://github.com/kubernetes-csi/external-resizer. However, I cannot reproduce this panic with either v1.12.0 or v1.13.1 of the resizer sidecar. These are the steps I was using to try to reproduce the issue:
`kubectl patch pvc ebs-claim1 --patch '{"spec": {"volumeAttributesClassName": "gp3-class"}}' The resizer sidecar does not throw nil pointer error, it just has following logs:
|
@AndrewSirenko Thanks for the amazingly fast response! I'll test your suggestion and get back to you as soon as possible. In the meantime, for your reproduction scenario, maybe it has to do with the fact that the original
|
@AndrewSirenko Just tested with your suggestion, by deleting and recreating the VolumeAttributesClass without the
The new VAC for reference :
|
Great catch on the in-tree sc. Yes I am able to now reproduce your issue... Looks like one cannot use VAC is PVC's SC's provisioner was an in-tree driver... I will raise an issue on the external-resizer project, and bring this up in the next SIG Storage implementation meeting on Monday. In the meantime, let me try to provide you a workaround to unblock your use case (swapping your PVC to no longer refer to the in-tree SC). For posterity, these are my reproduction steps for running into your issue:
|
Thanks @AndrewSirenko! From my (albeit clueless) investigation into the source code of the external-resizer I think the issue is on the following two lines (but I might be way off mark here) :
My assumption is since my PV does not have a If my assumption above is correct I wonder if I could manually patch a fake In any case, if you can find a workaround I would really appreciate it, since this basically nuked my ebs-csi-controller which is unable to start at all and I cannot remove the VAC since it seems after setting it you cannot unset it without deleting the PVC (which is out of the question for me due to potential data loss) |
Yep exactly. The PV resource of a migrated volume does not PV.Spec.CSI (it has PV.spec.awsElasticBlockStore instead). We can trivially fix the nil pointer exception (by adding an extra nil check), but we still have to figure out what we should do in the migrated volume case.
Let me try out a workaround where we set the PVC retention policy to retain, delete the PVC (leaving the PV, which is still attached to the workload), then create a new PVC and statically bind that new PVC to original PV, but without the in-tree sc. That workaround has worked in other corner cases. I will type up full instructions shortly. I'm not sure if patching will work because that property may be immutable but please do share if it does! Thanks a million for catching this corner case. |
If patching did not work, here is a different workaround @gpapakyriakopoulos, where we will hot swap the PVC and PV resources to replace the volume's Note that any procedure that forcefully deletes resources by removing finalizers is inherently risky. Please try with a non production workload first to ensure no data loss. Alternatively if you no longer care about volume modification and just want to rollback to a safe state, you can probably get away with just deleting PVC resource, and re-creating it without PVC.spec.volumeAttributesClass. We will change PV reclaim policy to retain, forcefully delete PVC and PV resources (by removing finalizers), then statically provision that EBS volume with the same PV and PVC names as before but no SC.
You may then need to restart your ebs-csi-controller deployment. But then you are able to modify your volume.
I did all this while my application was running normally. Afterwards I was then able to delete my workload pod, create new workload pod, and confirm that volume was successfully detached and re-attached. You may want to do some further testing. Hope this helps. I'll work with kubernetes-csi to ensure that folks cannot run into this failure mode (and maybe even will be able to modify volumes provisioned by in-tree drivers, by falling back to the PV migrated-to annotation). |
@AndrewSirenko Amazing stuff, I really appreciate all you are doing. I'll get to testing this tomorrow to see how it goes in a staging environment. In the meantime, since modifying the volume is not critical for us, I would be ok to just be able to get the ebs-csi-controller pods running, even if they do not modify the volume to gp3. Is there any chance that the extra I am not trying to put pressure on anyone I am rather asking because I know this needs to be updated in many different places to reach us as the end-users, namely : a) Patched on external-resizer From your experience does that take a week, a month or more ? If it's a week or two I would be willing to hold out instead of risking production data loss or messing with the PVs. Otherwise I'll try more risky approaches like what you described. Again appreciate all the help, awesome response and write-ups! |
Let me get back to you on that. I'm sure folks upstream will be open to a patch external-resizer release with EBS help. Worst case at that point you can rely on the self-managed EBS CSI Driver helm chart, and use the newer external-resizer image without waiting for a new EBS CSI Driver add-on release. |
Sidenote: It is also a good idea in general to migrate in-tree provisioned volumes to real EBS CSI driver volumes. This is recommended by upstream Kubernetes to use any newer CSI features. You can do this without messing with finalizers by restoring a new volume from a snapshot by following a guide like https://ryandeangraham.medium.com/migrate-a-pvc-to-ebs-csi-from-in-tree-ebs-c39178f586f6. |
You can track the merging of fix PR here: kubernetes-csi/external-resizer#471 You can track discussion about patch release on Kubernetes Slack #csi channel: https://kubernetes.slack.com/archives/C8EJ01Z46/p1739486018992389 |
Hmm doing that for the offending PVC would resolve my issue right ? The offending PVC would be deleted and a new one would be created right ? But do I still need the new PVC to also be gp2 to match the underlying PV and volume with this method ? (And potentially then try to migrate using the VAC after the PVC conversion has finished ? ) |
An update on the above and good news. We were able to delete the offending PVC and remove the finalizers and were able to recover the ebs-csi-controller. Since our workload is an Elastic ECK Stack, we are also able to migrate to new PVCs with a new storage class (CSI managed) using ECK Operator's capabilities by renaming our nodeSet. So we are now unstuck from this issue with the great help you provided @AndrewSirenko. You guys can take your time to patch the issue properly and let me know if there is anything more I can do. |
Thanks for catching this upstream VAC bug! Very glad to hear you resolved your issue. |
Update: Upstream resizer fix for the panic has been merged kubernetes-csi/external-resizer#471. Patch releases should come out by end of February. There are talks of allowing migrated volumes to be modified via VAC, no ETA on that yet. |
/kind bug
As the title suggests we are trying to modify an existing PVC from gp2 to gp3 EBS volume in our EKS cluster (version 1.31) using the VolumeAttributeClass approach of ModifyVolume.
We have created the following VolumeAttributeClass :
Our existing PVC looks as follows :
We apply the created VolumeAttributesClass via :
kubectl patch pvc elasticsearch-data-es-es-ingest-2 --patch '{"spec": {"volumeAttributesClassName": "gp3-update-class"}}'
Application is successful as seen here :
Immediately after applying the VolumeAttributesClass to the PVC, the csi-resizer container of the ebs-csi-controller (v. 1.39.0) throws the following panic :
The csi-resizer container runs with the following parameters :
The text was updated successfully, but these errors were encountered: