Delay before removing the startup taint from ebs #1945

vivekskrishna · 2024-02-26T02:31:04Z

/kind bug

EBS CSI driver currently supports removing a taints on new ndoes once it has started running sucessfully(implemented via #1581)

What you expected to happen?
I could see that the taint is being removed upon csi driver pod running in that ndoe but current code removes it just before new node service is registered in func newNodeService. But it can be seen that still sometimes we see the issue mentioned in kubernetes/kubernetes#95911 is seen due to which pods are assigned to new ndoes even when volume mount limit will exceed the capacity of new node.

How to reproduce it (as minimally and precisely as possible)?
Easy way to reproduce this is to create a pod with 26 ebs mounts and try to provision it on a node which supports say only 25(t3 instance type for example). It can be seen that when this si ried using karpenter dependening on race condition this pod will eventually get scheduled onto a newly spun up t3 instance node.

Anything else we need to know?:
This can potentially be fixed if we introduce a sleep initially in removeTaintInBackground before it proceeds to remove the taint in a backed off mode. This might delay the csi node driver removing the taint which will give enough time for csinode limits to be properly registered.

Environment
AWS EKS where karpenter + ebs is being used for a node group

Kubernetes version (use kubectl version):
1.28 with karpenter 0.31.4
Driver version:
ebs csi driver - 1.24.0

If needed I can raise a PR for this

The text was updated successfully, but these errors were encountered:

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 26, 2024

torredil mentioned this issue Feb 27, 2024

Ensure CSINode allocatable count is set on node before removing startup taint #1949

Merged

k8s-ci-robot closed this as completed in #1949 Feb 29, 2024

vivekskrishna mentioned this issue Feb 29, 2024

When a new node joins the cluster - scheduler doesn't respect CSI volume limit kubernetes/kubernetes#95911

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay before removing the startup taint from ebs #1945

Delay before removing the startup taint from ebs #1945

vivekskrishna commented Feb 26, 2024

Delay before removing the startup taint from ebs #1945

Delay before removing the startup taint from ebs #1945

Comments

vivekskrishna commented Feb 26, 2024