Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduling changes for lora affinity load balancing #423
base: main
Are you sure you want to change the base?
scheduling changes for lora affinity load balancing #423
Changes from all commits
ad15e84
b9f57c5
9e94fd9
2b934d0
2d3a3bb
323e141
41ec5b8
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On one hand, I can see why considering
waiting
is useful, becausewaiting
loras are going to be served next. However, I have concerns of this weakening the lora affinity.running
is bound by themax lora
,waiting
is not bound. If we enter an unstable state with a longwaiting
, we can lose the affinity benefit.An improvement algorithm could be we prioritize
waiting
overrunning
, what do you think?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so using waiting + running for affinity is always superior to using just running. Because adapters get served in First come first serve basis. So we know for sure thar waiting if not available will get loaded for sure. But yes, within waiting + running prioritizing waiting over running makes sense I think, but need to test first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clean up the TODOs and the metrics that are not currently used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the TODOs are still relevant. I will remove max token in KV Cache because its not being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this happen in practice? in what cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there are no running requests, that is qps = 0 for the pod. In this case it's good to use last adapters that ran to determine affinity (as a proxy for Lora metric for what's already loaded)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is essentially the
canAcceptNewLoraPredicate
function below, are we still usingcanAcceptNewLoraPredicate
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are not using canAcceptNewLoraPredicate any more. But would be good to keep I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a follow up, but it sounds like we can extend the current filter framework to support such probability based filtering. So instead of having one base filter, we have a list of filters with weights. This way we can keep each filter very focused, and make them more reusable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make this configurable perhaps via a flag for now. Different environments will likely need different thresholds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather levarage this to make this configurable. #16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have time to do API change for the next release. Given we already had to change it on different accelerator types, it's important to have this knob configurable. Exposing it as a flag seems straightforward and gives us time to gather feedback on this before making an API change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took at look, iiuc, adding this flag is not straightforward, the way scheduler is written. If its needed for next release would rather have it in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defining a flag for each parameter is tedious, we can use a versioned configuration file, this is called ComponentConfig, ideally we do that for #383
Here is JobSet's config file as an example: https://github.com/kubernetes-sigs/jobset/tree/main/api/config/v1alpha1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have some insights to show why this is needed and why this value is picked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I picked it after some trail and error. This value worked well when we had skewed traffic for different adapters, helped spread out high QPS adapters while keeping low QPS adapters less spread out