Cluster deletion stuck due to AWSCluster finalizer issue #5107

a13x5 · 2024-08-26T21:56:09Z

/kind bug

What steps did you take and what happened:

We have CAPI objects deployed using Helm chart. Thus when helm uninstall <release> is executed all objects get deleted simultaneously.

Sometimes finalizer on AWSCluster object (and object itself) is removed before all AWSMachine resources are removed. This is causing AWSMachines to stuck forever, because awsmachine controller tries to patch AWSCluster somehow.

I0726 16:19:31.225463       1 awscluster_controller.go:208] "Reconciling AWSCluster delete" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="defau
lt/aws-cl-1" namespace="default" name="aws-cl-1" reconcileID="da967d9f-4c3d-47a1-953a-cedf44e4d8d0" cluster="default/aws-cl-1"
I0726 16:19:33.955431       1 securitygroups.go:320] "Deleted security group" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/aws-cl-1" n
amespace="default" name="aws-cl-1" reconcileID="da967d9f-4c3d-47a1-953a-cedf44e4d8d0" cluster="default/aws-cl-1" security-group-id="sg-068b633aae83d2e19" kind="cluster managed"
I0726 16:19:34.432437       1 securitygroups.go:320] "Deleted security group" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/aws-cl-1" n
amespace="default" name="aws-cl-1" reconcileID="da967d9f-4c3d-47a1-953a-cedf44e4d8d0" cluster="default/aws-cl-1" security-group-id="sg-05fe37ab8f0a3ab15" kind="cluster managed"
I0726 16:19:36.516438       1 vpc.go:550] "Deleted VPC" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/aws-cl-1" namespace="default" nam
e="aws-cl-1" reconcileID="da967d9f-4c3d-47a1-953a-cedf44e4d8d0" cluster="default/aws-cl-1" vpc-id="vpc-03b7241ad6eae9ab1"
E0726 16:19:36.632931       1 controller.go:329] "Reconciler error" err="failed to patch AWSCluster default/aws-cl-1: awsclusters.infrastructure.cluster.x-k8s.io \"aws-cl-1\" not found" controller="awscluster" c
ontrollerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/aws-cl-1" namespace="default" name="aws-cl-1" reconcileID="da967d9f-4c3d-47a1-953a-cedf44e4d8d0"
I0726 16:19:51.603067       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-md-xczh6-bfjvv" namespace="default" name="aws-cl-1-md-xczh6-bfjvv" reconcileID="c0acf9c4-8be9-413f-a906-483b59563d9f" machine="default/aws-cl-1-md-xczh6-bfjvv" cluster="defaul
t/aws-cl-1"
I0726 16:19:52.434829       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-cp-0" namespace="default" name="aws-cl-1-cp-0" reconcileID="dac57c28-e165-472b-b20f-fa0521e4b2f1" machine="default/aws-cl-1-cp-0" cluster="default/aws-cl-1"
I0726 16:19:59.970099       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-cp-0" namespace="default" name="aws-cl-1-cp-0" reconcileID="df4dc07d-5e1b-4d28-88a3-0f30fa7a76f8" machine="default/aws-cl-1-cp-0" cluster="default/aws-cl-1"
I0726 16:19:59.970270       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-md-xczh6-bfjvv" namespace="default" name="aws-cl-1-md-xczh6-bfjvv" reconcileID="120b3a56-1855-4d44-8333-8502e8d04981" machine="default/aws-cl-1-md-xczh6-bfjvv" cluster="defaul
t/aws-cl-1"
I0726 16:22:36.109923       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-cp-0" namespace="default" name="aws-cl-1-cp-0" reconcileID="4c0f6242-2f77-4194-b9dc-c1aed2034184" machine="default/aws-cl-1-cp-0" cluster="default/aws-cl-1"
I0726 16:22:36.110149       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-md-xczh6-bfjvv" namespace="default" name="aws-cl-1-md-xczh6-bfjvv" reconcileID="90ca119f-d6c2-457f-9d79-be35c1ae70a8" machine="default/aws-cl-1-md-xczh6-bfjvv" cluster="defaul
t/aws-cl-1"
I0726 16:29:30.719506       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-cp-0" namespace="default" name="aws-cl-1-cp-0" reconcileID="d9d8a14a-8e04-43ab-b596-b3bc9083af81" machine="default/aws-cl-1-cp-0" cluster="default/aws-cl-1"
I0726 16:29:30.719543       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-md-xczh6-bfjvv" namespace="default" name="aws-cl-1-md-xczh6-bfjvv" reconcileID="7bba88c3-0555-4536-bb12-d366df08e338" machine="default/aws-cl-1-md-xczh6-bfjvv" cluster="defaul
t/aws-cl-1"
I0726 16:32:52.252957       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-md-xczh6-bfjvv" namespace="default" name="aws-cl-1-md-xczh6-bfjvv" reconcileID="2ba19b22-ddb1-4323-aa07-4bd170f5e49b" machine="default/aws-cl-1-md-xczh6-bfjvv" cluster="defaul
t/aws-cl-1"
I0726 16:32:52.253183       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-cp-0" namespace="default" name="aws-cl-1-cp-0" reconcileID="36079ee5-a4ec-407a-a181-1d7a3ca1058f" machine="default/aws-cl-1-cp-0" cluster="default/aws-cl-1"

Ultimately this makes cluster deletion to stuck indefinitely. The Cluster object will then stuck in the Deleting state.

To fix this operator should manually delete finalizers on all AWSMachine objects which are left in the cluster.

Note that this is intermittent issue, but happens pretty often in my tests (~7 times out of 10). Also I should note that I AWS resources seem to be properly cleaned up

What did you expect to happen:

AWSCluster object finalizer should only get removed when no depended objects (like AWSMachine) are present.

Environment:

Cluster-api-provider-aws version: v2.6.1
Kubernetes version: (use kubectl version): v1.30.2+k0s
OS (e.g. from /etc/os-release): Amazon Linux 2

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-08-26T21:56:18Z

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

a13x5 · 2024-08-26T22:01:17Z

@randybias FYI

fiunchinho · 2024-09-12T14:58:35Z

We also use Helm charts to deploy CAPA resources and we've also ran into issues when deleting clusters, but I believe we were doing the same mistake that you seem to be doing now. Let me try to explain our journey. Maintainers or other folks with more experience can correct me if I'm wrong, because I may be :)

CAPI and CAPA need to take care of deleting a cluster, specifically they need to follow certain order in which the different resources are removed. First worker nodes are removed (i.e. MachinePools), then control plane is removed (i.e. KubeadmControlplane). And only after all of these are removed, then the AWSCluster is removed. Finally, when the infrastructure cluster is removed, the parent Cluster is removed.

When using Helm charts, helm hijacks the whole deletion process, deleting all objects at once. The AWSCluster deletion logic is triggered before the workers have finished being deleted. This is your current issue if I understand correctly.
To fix this problem and allow CAPI and CAPA to manage the deletion of the resources, we now add the helm annotation helm.sh/resource-policy: keep to all resources but the Cluster CR. This means that when a chart with a cluster is deleted, the Cluster will be marked for deletion, and CAPI / CAPA will start the deletion process in the order they need. Once the process is completed, all the finalizers in the Cluster CR will be removed, and the Cluster CR will be garbage collected.

This may not apply to you, but as a final note, we ended up removing the helm.sh/resource-policy: keep annotation from MachinePools because we wanted to be able to change the name of the node pools. If we would keep the annotation, a new MachinePool would be rendered by Helm, but the old one would be kept. Because workers are the first thing CAPI / CAPA wants to remove, we decided it was safe to remove the helm.sh/resource-policy: keep annotation in this case.

I hope that helps.

So I believe this is working as intended. Maybe we could add documentation about this particular use case.

a13x5 · 2024-09-16T15:21:02Z

Thanks for your answer and suggestions, but I will have to disagree.

Kubernetes is a declarative system, so when operating on objects I shouldn't maintain any order of execution.
The controller should maintain the required order of execution. For example it would be a shame if I could delete PVC when it's still being used by a pod or namespace deleted when it's still has objects.

CAPA in this case fails to maintain it's own state and order of execution it requires. It deletes the finalizer on it's own resource which still has dependent resources present. All these resources are owned (reconciled) by a CAPA controller. So this is clearly a bug.

I'm considering the solution with annotations (thanks for the hint btw) only as a workaround, but not as proper fix.

fiunchinho · 2024-09-16T15:40:39Z

Thanks for your answer and suggestions, but I will have to disagree.

Kubernetes is a declarative system, so when operating on objects I shouldn't maintain any order of execution. The controller should maintain the required order of execution. For example it would be a shame if I could delete PVC when it's still being used by a pod or namespace deleted when it's still has objects.

CAPA in this case fails to maintain it's own state and order of execution it requires. It deletes the finalizer on it's own resource which still has dependent resources present. All these resources are owned (reconciled) by a CAPA controller. So this is clearly a bug.

I'm considering the solution with annotations (thanks for the hint btw) only as a workaround, but not as proper fix.

Fair enough. Let's see if that could be improved, because it would be a good improvement to all users, not just the ones using Helm.

k8s-triage-robot · 2024-12-15T15:59:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2025-01-14T16:30:09Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

joshfrench · 2025-02-11T19:36:06Z

/remove-lifecycle rotten

richardcase · 2025-02-25T18:41:04Z

This is not a CAPA specific thing but for the whole of CAPI. In CAPI the only supported way to delete a cluster is by deleting the Cluster resource. If you try and delete all the resources then you will end up with orphaned resources. Please see: https://cluster-api.sigs.k8s.io/user/quick-start.html?highlight=delete#clean-up:

IMPORTANT: In order to ensure a proper cleanup of your infrastructure you must always delete the cluster object. Deleting the entire cluster template with kubectl delete -f capi-quickstart.yaml might lead to pending resources to be cleaned up manually.

If you use Helm for the cluster definitions then as @fiunchinho you will need to use helm.sh/resource-policy: keep

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority labels Aug 26, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Aug 26, 2024

a13x5 mentioned this issue Aug 26, 2024

Cluster can stuck in Deleting state indefinitely k0rdent/kcm#151

Closed

squizzi mentioned this issue Sep 17, 2024

Add support for testing aws-hosted-cp k0rdent/kcm#280

Merged

This was referenced Dec 5, 2024

Dev AWS cluster deletion stuck k0rdent/kcm#717

Closed

AWSCluster deletion workarond doesn't work anymore k0rdent/kcm#729

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 15, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 14, 2025

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 11, 2025

alexander-demicev linked a pull request Feb 27, 2025 that will close this issue

✨ Wait for AWSCluster dependent object to be deleted #5365

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster deletion stuck due to AWSCluster finalizer issue #5107

Cluster deletion stuck due to AWSCluster finalizer issue #5107

a13x5 commented Aug 26, 2024

k8s-ci-robot commented Aug 26, 2024

a13x5 commented Aug 26, 2024

fiunchinho commented Sep 12, 2024 •

edited

Loading

a13x5 commented Sep 16, 2024

fiunchinho commented Sep 16, 2024

k8s-triage-robot commented Dec 15, 2024

k8s-triage-robot commented Jan 14, 2025

joshfrench commented Feb 11, 2025

richardcase commented Feb 25, 2025

Cluster deletion stuck due to AWSCluster finalizer issue #5107

Cluster deletion stuck due to AWSCluster finalizer issue #5107

Comments

a13x5 commented Aug 26, 2024

k8s-ci-robot commented Aug 26, 2024

a13x5 commented Aug 26, 2024

fiunchinho commented Sep 12, 2024 • edited Loading

a13x5 commented Sep 16, 2024

fiunchinho commented Sep 16, 2024

k8s-triage-robot commented Dec 15, 2024

k8s-triage-robot commented Jan 14, 2025

joshfrench commented Feb 11, 2025

richardcase commented Feb 25, 2025

fiunchinho commented Sep 12, 2024 •

edited

Loading