Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay before removing the startup taint from ebs #1945

Closed
vivekskrishna opened this issue Feb 26, 2024 · 0 comments · Fixed by #1949
Closed

Delay before removing the startup taint from ebs #1945

vivekskrishna opened this issue Feb 26, 2024 · 0 comments · Fixed by #1949
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@vivekskrishna
Copy link

/kind bug

EBS CSI driver currently supports removing a taints on new ndoes once it has started running sucessfully(implemented via #1581)

What you expected to happen?
I could see that the taint is being removed upon csi driver pod running in that ndoe but current code removes it just before new node service is registered in func newNodeService. But it can be seen that still sometimes we see the issue mentioned in kubernetes/kubernetes#95911 is seen due to which pods are assigned to new ndoes even when volume mount limit will exceed the capacity of new node.

How to reproduce it (as minimally and precisely as possible)?
Easy way to reproduce this is to create a pod with 26 ebs mounts and try to provision it on a node which supports say only 25(t3 instance type for example). It can be seen that when this si ried using karpenter dependening on race condition this pod will eventually get scheduled onto a newly spun up t3 instance node.

Anything else we need to know?:
This can potentially be fixed if we introduce a sleep initially in removeTaintInBackground before it proceeds to remove the taint in a backed off mode. This might delay the csi node driver removing the taint which will give enough time for csinode limits to be properly registered.

Environment
AWS EKS where karpenter + ebs is being used for a node group

  • Kubernetes version (use kubectl version):
  • 1.28 with karpenter 0.31.4
  • Driver version:
  • ebs csi driver - 1.24.0

If needed I can raise a PR for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
2 participants