-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kueue scheduler fragmentation optimization #4329
Comments
You might be intersted in TopologyAwareScheduling: https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/ Note that this feature is still alpha. |
Thanks @tenzen-y ! I'm not aware of this alpha feature, checking it out now. |
Does TAS satisfy your request? |
Hey @tenzen-y , I read through the docs - it looks like TAS is addressing the static cluster topology (racks, blocks etc..), but the challenge in this issue is mostly around runtime deployment topology (i.e how many available resources per node at the scheduling time). So I'm afraid TAS alone won't solve this issue, but please correct me if I'm wrong. |
In that case, you can use flat topology by |
If you specify "kubernetes.io/hostname" for topology, Kueue traverses all Node's allocatable resources, and packing Pods as much as possible to nodes (similar to kube-scheduler mostAllocated). |
Thanks, we'll test this out and keep this issue updated. |
I would recommend using the main branch to confirm all features for TAS since only the main branch is guaranteed to support obviously "mostAllocated" scheduling. The older released versions do not support obviously "mostAllocated" scheduling. |
What would you like to be added:
In the current Kueue implementation, each queue resources are simply added up (e.g total of 8 GPUs) without the awareness of the actual topology (e.g 1 * 8 vs 2 * 4). As a result, Kueue would admit workload that "admittable" but cannot be scheduled at runtime. This wrongly admitted workload would pending indefinitely until previously workloads free up the resources, while blocking any new workload that could have been running (e.g requesting single GPU) from running. As a result, the fragmentation issue would lead to low cluster allocation rate overall.
The suggested solution would be re-schedule if the fragmentation issue happens, and permit future workloads that immediately schedulable to be admitted.
Why is this needed:
Further improve the cluster allocation rate.
Completion requirements:
N/A
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: