Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kueue scheduler fragmentation optimization #4329

Open
3 tasks
shaowei-su opened this issue Feb 19, 2025 · 8 comments
Open
3 tasks

Kueue scheduler fragmentation optimization #4329

shaowei-su opened this issue Feb 19, 2025 · 8 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@shaowei-su
Copy link

What would you like to be added:
In the current Kueue implementation, each queue resources are simply added up (e.g total of 8 GPUs) without the awareness of the actual topology (e.g 1 * 8 vs 2 * 4). As a result, Kueue would admit workload that "admittable" but cannot be scheduled at runtime. This wrongly admitted workload would pending indefinitely until previously workloads free up the resources, while blocking any new workload that could have been running (e.g requesting single GPU) from running. As a result, the fragmentation issue would lead to low cluster allocation rate overall.

The suggested solution would be re-schedule if the fragmentation issue happens, and permit future workloads that immediately schedulable to be admitted.

Why is this needed:
Further improve the cluster allocation rate.

Completion requirements:
N/A

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@shaowei-su shaowei-su added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 19, 2025
@tenzen-y
Copy link
Member

tenzen-y commented Feb 19, 2025

You might be intersted in TopologyAwareScheduling: https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/

Note that this feature is still alpha.

@shaowei-su
Copy link
Author

Thanks @tenzen-y ! I'm not aware of this alpha feature, checking it out now.

@tenzen-y
Copy link
Member

Thanks @tenzen-y ! I'm not aware of this alpha feature, checking it out now.

Does TAS satisfy your request?

@shaowei-su
Copy link
Author

Hey @tenzen-y , I read through the docs - it looks like TAS is addressing the static cluster topology (racks, blocks etc..), but the challenge in this issue is mostly around runtime deployment topology (i.e how many available resources per node at the scheduling time). So I'm afraid TAS alone won't solve this issue, but please correct me if I'm wrong.

@tenzen-y
Copy link
Member

Hey @tenzen-y , I read through the docs - it looks like TAS is addressing the static cluster topology (racks, blocks etc..), but the challenge in this issue is mostly around runtime deployment topology (i.e how many available resources per node at the scheduling time). So I'm afraid TAS alone won't solve this issue, but please correct me if I'm wrong.

In that case, you can use flat topology by "kubernetes.io/hostname".

@tenzen-y
Copy link
Member

Hey @tenzen-y , I read through the docs - it looks like TAS is addressing the static cluster topology (racks, blocks etc..), but the challenge in this issue is mostly around runtime deployment topology (i.e how many available resources per node at the scheduling time). So I'm afraid TAS alone won't solve this issue, but please correct me if I'm wrong.

In that case, you can use flat topology by "kubernetes.io/hostname".

If you specify "kubernetes.io/hostname" for topology, Kueue traverses all Node's allocatable resources, and packing Pods as much as possible to nodes (similar to kube-scheduler mostAllocated).

@shaowei-su
Copy link
Author

Thanks, we'll test this out and keep this issue updated.

@tenzen-y
Copy link
Member

tenzen-y commented Feb 19, 2025

Thanks, we'll test this out and keep this issue updated.

I would recommend using the main branch to confirm all features for TAS since only the main branch is guaranteed to support obviously "mostAllocated" scheduling. The older released versions do not support obviously "mostAllocated" scheduling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants