scheduling changes for lora affinity load balancing #423

kaushikmitr · 2025-02-27T21:37:39Z

This pull request includes several changes to the deployment configuration, metrics collection, and scheduling logic. The most important changes include updating metrics collection to include waiting adapters, and implementing a new pod selection strategy that balances load while considering model affinity.

Scheduling Logic Enhancements:

pkg/epp/scheduling/filter.go: Replaced the loRAAffinityPredicate function with a new loRASoftAffinityPredicate function that prioritizes pods with existing model affinity while allowing for load balancing through randomization (as long as there is room to fit another adapter in the pod).
pkg/epp/scheduling/scheduler.go: Updated the scheduling configuration to use the new loRASoftAffinityPredicate function and increased the queueingThresholdLoRA value from 50 to 128. Added a new loraAffinityThreshold constant to indicate the probability of preferring pods with model affinity. [1] [2] [3]

Deployment Configuration Changes:

config/manifests/vllm/deployment.yaml: Added new command-line arguments for --compilation-config, --max-num-seqs, and --max-lora-rank. Added a new environment variable VLLM_USE_V1. [1] [2]

Metrics Collection Updates:

pkg/epp/backend/vllm/metrics.go: Added a new metric LoraRequestInfoWaitingAdaptersMetricName and updated the promToPodMetrics and getLatestLoraMetric functions to handle waiting adapters. Also pick the previous running + waiting adapters if there are no current running or waiting adapters [1] [2] [3]

k8s-ci-robot · 2025-02-27T21:37:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kaushikmitr
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-02-27T21:37:49Z

Hi @kaushikmitr. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2025-02-27T21:38:02Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`ad15e84`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67c0db26b6699f0008a369c2
😎 Deploy Preview	https://deploy-preview-423--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

ahg-g · 2025-02-27T21:48:33Z

/ok-to-test

k8s-ci-robot · 2025-02-27T21:56:55Z

@kaushikmitr: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-gateway-api-inference-extension-test-unit-main	`ad15e84`	link	true	`/test pull-gateway-api-inference-extension-test-unit-main`
pull-gateway-api-inference-extension-verify-main	`ad15e84`	link	true	`/test pull-gateway-api-inference-extension-verify-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ahg-g

I didn't look at the algorithm change yet, left a couple of quick comments.

ahg-g · 2025-02-27T22:08:34Z

config/manifests/vllm/deployment.yaml

@@ -24,15 +24,23 @@ spec:
          - "1"
          - "--port"
          - "8000"
+          - "--compilation-config"


what does this do?

we may not need this if using V0. It outputs the CUDA graph for optimization.

ahg-g · 2025-02-27T23:26:55Z

config/manifests/vllm/deployment.yaml

          - "--lora-modules"
          - '{"name": "tweet-summary-0", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}'
          - '{"name": "tweet-summary-1", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}'
          env:
+            - name: VLLM_USE_V1
+              value: "1"


The released vllm version doesn't support our metrics yet, right? if so, then we can't use it now.

Yes, that is why the tests are failing. I will switch back to V0

I don't think that is, the integration test doesn't use this deployment yaml.

I think the test is failing because this PR introduces some randomness to the selection.

ahg-g · 2025-02-27T23:57:09Z

pkg/epp/backend/vllm/metrics.go

+
+		// Ignore metrics with both labels empty.
+		if running == "" && waiting == "" {
+			//	continue


commented out code

this was a bug.

ahg-g

The algorithm is not using the waiting_lora_adapters metric, right?

ahg-g · 2025-02-28T04:43:59Z

pkg/epp/scheduling/scheduler.go

-			name:          "affinity LoRA",
-			filter:        toFilterFunc(loRAAffinityPredicate),
-			nextOnSuccess: queueAndKVCacheFilter,
-			nextOnFailure: &filter{


was this never executed before?

It was but we had "hard" affinity, which was optimized for throughput. This helped in lowering tail latency

ahg-g · 2025-02-28T04:50:06Z

pkg/epp/backend/vllm/metrics.go

@@ -37,6 +37,7 @@ import (
 const (
 	LoraRequestInfoMetricName                = "vllm:lora_requests_info"
 	LoraRequestInfoRunningAdaptersMetricName = "running_lora_adapters"
+	LoraRequestInfoWaitingAdaptersMetricName = "waiting_lora_adapters"


can you please document the semantics of each metric?

We can update the protocol doc and reference the protocol doc here.

liu-cong · 2025-02-28T17:43:43Z

config/manifests/vllm/deployment.yaml

The changes here are not necessarily related to the algorithm change. Can we make this a separate PR?

I think its related in the sense that the benchmark I did relies on both changes. But still can break it to two.

This is just an example manifest for startup guide.

liu-cong · 2025-02-28T17:45:27Z

pkg/epp/backend/vllm/metrics.go

@@ -37,6 +37,7 @@ import (
 const (
 	LoraRequestInfoMetricName                = "vllm:lora_requests_info"
 	LoraRequestInfoRunningAdaptersMetricName = "running_lora_adapters"
+	LoraRequestInfoWaitingAdaptersMetricName = "waiting_lora_adapters"


We can update the protocol doc and reference the protocol doc here.

liu-cong · 2025-02-28T17:45:54Z

pkg/epp/backend/vllm/metrics.go

@@ -37,6 +37,7 @@ import (
 const (
 	LoraRequestInfoMetricName                = "vllm:lora_requests_info"
 	LoraRequestInfoRunningAdaptersMetricName = "running_lora_adapters"
+	LoraRequestInfoWaitingAdaptersMetricName = "waiting_lora_adapters"
 	LoraRequestInfoMaxAdaptersMetricName     = "max_lora"
 	// TODO: Replace these with the num_tokens_running/waiting below once we add those to the fork.


Can you clean up the TODOs and the metrics that are not currently used?

I think the TODOs are still relevant. I will remove max token in KV Cache because its not being used.

liu-cong · 2025-02-28T17:48:11Z

pkg/epp/scheduling/scheduler.go

-	// The value of 50 is arrived heuristicically based on experiments.
-	queueingThresholdLoRA = 50
+	// The value of 128 is arrived heuristicically based on experiments.
+	queueingThresholdLoRA = 128


I think we should make this configurable perhaps via a flag for now. Different environments will likely need different thresholds.

I would rather levarage this to make this configurable. #16

liu-cong · 2025-02-28T17:52:32Z

pkg/epp/scheduling/filter.go

+// Returns:
+//   - Filtered slice of pod metrics based on affinity and availability
+//   - Error if any issues occur during filtering
+func loRASoftAffinityPredicate(logger logr.Logger, req *LLMRequest, pods []*datastore.PodMetrics) ([]*datastore.PodMetrics, error) {


This is not a predicate, this is a filter, according to the current filter and predicate interface definition.

liu-cong · 2025-02-28T17:54:06Z

pkg/epp/scheduling/filter.go

+	// Categorize pods based on affinity and availability
+	for _, pod := range pods {
+		if pod == nil {
+			continue


pls add a warning log here and state that this should never happen

removed this, as this scenario is captured already upstream

liu-cong · 2025-02-28T17:55:12Z

pkg/epp/scheduling/filter.go

+
+		if _, exists := pod.ActiveModels[req.ResolvedTargetModel]; exists {
+			filtered_affinity = append(filtered_affinity, pod)
+		} else if len(pod.ActiveModels) < pod.MaxActiveModels {


This is essentially the canAcceptNewLoraPredicate function below, are we still using canAcceptNewLoraPredicate?

we are not using canAcceptNewLoraPredicate any more. But would be good to keep I think.

liu-cong · 2025-02-28T18:00:35Z

pkg/epp/scheduling/filter.go

+	}
+
+	// Use crypto/rand for better randomization in production environments
+	randSource := rand.NewSource(time.Now().UnixNano())


This can be a follow up, but it sounds like we can extend the current filter framework to support such probability based filtering. So instead of having one base filter, we have a list of filters with weights. This way we can keep each filter very focused, and make them more reusable

liu-cong · 2025-02-28T18:01:14Z

pkg/epp/scheduling/scheduler.go

+	queueingThresholdLoRA = 128
+	// TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16) Make this configurable.
+	// loraAffinityThreshold indicates the probability with which we prefer a pod with LoRA affinity over a pod without but having room to fit more LoRA adapters.
+	loraAffinityThreshold = 0.999


do you have some insights to show why this is needed and why this value is picked?

liu-cong · 2025-02-28T18:14:38Z

pkg/epp/backend/vllm/metrics.go

@@ -37,6 +37,7 @@ import (
 const (
 	LoraRequestInfoMetricName                = "vllm:lora_requests_info"
 	LoraRequestInfoRunningAdaptersMetricName = "running_lora_adapters"
+	LoraRequestInfoWaitingAdaptersMetricName = "waiting_lora_adapters"


On one hand, I can see why considering waiting is useful, because waiting loras are going to be served next. However, I have concerns of this weakening the lora affinity. running is bound by the max lora, waiting is not bound. If we enter an unstable state with a long waiting, we can lose the affinity benefit.

An improvement algorithm could be we prioritize waiting over running, what do you think?

so using waiting + running for affinity is always superior to using just running. Because adapters get served in First come first serve basis. So we know for sure thar waiting if not available will get loaded for sure. But yes, within waiting + running prioritizing waiting over running makes sense I think, but need to test first.

kaushikmitr · 2025-02-28T22:27:17Z

The algorithm is not using the waiting_lora_adapters metric, right?

It is, we are now checking for both waiting + running to determine affinity

scheduling changes for lora affinity load balancing

ad15e84

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 27, 2025

k8s-ci-robot requested review from Jeffwan and liu-cong February 27, 2025 21:37

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 27, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 27, 2025

ahg-g reviewed Feb 28, 2025

View reviewed changes

liu-cong reviewed Feb 28, 2025

View reviewed changes

scheduling changes for lora affinity load balancing #423

Are you sure you want to change the base?

scheduling changes for lora affinity load balancing #423

Conversation

kaushikmitr commented Feb 27, 2025 • edited Loading

Scheduling Logic Enhancements:

Deployment Configuration Changes:

Metrics Collection Updates:

k8s-ci-robot commented Feb 27, 2025

k8s-ci-robot commented Feb 27, 2025

netlify bot commented Feb 27, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

ahg-g commented Feb 27, 2025

k8s-ci-robot commented Feb 27, 2025

ahg-g left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaushikmitr Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

kaushikmitr commented Feb 28, 2025

kaushikmitr commented Feb 27, 2025 •

edited

Loading

netlify bot commented Feb 27, 2025 •

edited

Loading

ahg-g Feb 27, 2025 •

edited

Loading

kaushikmitr Feb 28, 2025 •

edited

Loading