refresh metrics per pod periodically #223

spacewander · 2025-01-24T02:49:18Z

Signed-off-by: 罗泽轩 <[email protected]>

k8s-ci-robot · 2025-01-24T02:49:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: spacewander
Once this PR has been reviewed and has the lgtm label, please assign danehans for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-01-24T02:49:28Z

Hi @spacewander. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2025-01-24T02:49:35Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`23e4c17`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67af1f9f1fc0ce0008dd6e2b
😎 Deploy Preview	https://deploy-preview-223--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

ahg-g · 2025-01-25T01:20:20Z

/ok-to-test

spacewander · 2025-01-26T02:38:19Z

/retest

pkg/ext-proc/backend/types.go

pkg/ext-proc/backend/suite_test.go

pkg/ext-proc/backend/provider.go

pkg/ext-proc/backend/pod_metrics_refresher.go

pkg/ext-proc/backend/provider.go

liu-cong · 2025-01-28T20:52:37Z

pkg/ext-proc/backend/provider.go

@@ -120,56 +151,13 @@ func (p *Provider) refreshPodsOnce() {
 		pod := k.(Pod)
 		if _, ok := p.datastore.pods.Load(pod); !ok {
 			p.podMetrics.Delete(pod)


Should we remove this, given the LoadAndDelete below?

The map loads the pod is different from the map deletes the pod.

I am confused, in line 161 p.podMetrics.Delete(pod), and then in line 162, p.podMetrics.LoadAndDelete(pod), won't line 162 always return false since the pod is already deleted in line 161?

Thanks for catching up! Sorry for misunderstanding you at the beginning. The p.podMetrics.LoadAndDelete(pod) should be p.podMetricsRefresher.LoadAndDelete(pod). I will fix it soon.

Why do we need another map in the first place? can't we put the PodMetricsRefresher in the PodMetrics struct?

@ahg-g
I tried it before. But if we put PodMetricsRefresher under PodMetrics, it's not easy to make it concurrency-safe without a big refactor. As PodMetricsRefresher needs to change PodMetrics on the fly.

I believe we should be able to do that, I sent out #350 which does a major refactor to the datastore/provider layer. It consolidating storage in one place under datastore, and so it does in-place updates to the metrics. Please take a look and advice is there are any concurrency issues that I may have missed.

pkg/ext-proc/backend/pod_metrics_refresher.go

Co-authored-by: Cong Liu <[email protected]>

Signed-off-by: spacewander <[email protected]>

spacewander · 2025-02-11T10:58:11Z

@liu-cong
I have solved the merge conflicts. Would you review this PR again? Thanks!

pkg/ext-proc/backend/provider.go

liu-cong · 2025-02-13T23:08:19Z

pkg/ext-proc/backend/provider.go

@@ -120,56 +151,13 @@ func (p *Provider) refreshPodsOnce() {
 		pod := k.(Pod)
 		if _, ok := p.datastore.pods.Load(pod); !ok {
 			p.podMetrics.Delete(pod)


I am confused, in line 161 p.podMetrics.Delete(pod), and then in line 162, p.podMetrics.LoadAndDelete(pod), won't line 162 always return false since the pod is already deleted in line 161?

liu-cong · 2025-02-13T23:11:02Z

pkg/ext-proc/scheduling/scheduler.go

 }

 // Schedule finds the target pod based on metrics and the requested lora adapter.
 func (s *Scheduler) Schedule(req *LLMRequest) (targetPod backend.Pod, err error) {
-	klog.V(logutil.VERBOSE).Infof("request: %v; metrics: %+v", req, s.podMetricsProvider.AllPodMetrics())
-	pods, err := s.filter.Filter(req, s.podMetricsProvider.AllPodMetrics())
+	klog.V(logutil.VERBOSE).Infof("request: %v; metrics: %+v", req, s.podMetricsProvider.AllFreshPodMetrics())


Can you remove the AllFreshPodMetrics() from this logging? This was probably added in initial POC. This is really DEBUG level logging and can be very heavy if there are many pods in the pool

@liu-cong
Updated.

liu-cong · 2025-02-14T18:18:11Z

@spacewander The change looks good to me, thank you!

Since this is a pretty big change, I was wondering if you can perform some manual testing and update the testing results.

/lgtm

/hold For testing update

ahg-g · 2025-02-15T17:15:34Z

@spacewander The change looks good to me, thank you!

Since this is a pretty big change, I was wondering if you can perform some manual testing and update the testing results.

/lgtm

/hold For testing update

I think we need to run a benchmark with this change.

ahg-g · 2025-02-15T20:12:28Z

I have a couple of concerns with this design, and I am hesitant to move forward with it:

We are not limiting the number of go routines being created, at some point this is no going to scale. Ideally we should have a pool of threads that process work (in this case probing).
We now have a pod reconciler, we can benefit from that to schedule probing work. We can queue work at specific intervals. This eliminates all the added logic related to created and handling probing threads. The TTL feature in JobSet uses this pattern: https://github.com/kubernetes-sigs/jobset/blob/main/pkg/controllers/ttl_after_finished.go#L113

k8s-ci-robot · 2025-02-15T20:12:36Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

liu-cong · 2025-02-15T20:24:56Z

I have a couple of concerns with this design, and I am hesitant to move forward with it:

Agree with both concerns however both exist in current implementation and are not something introduced by this PR. IMO this PR is an improvement, provided that the benchmark shows improvements (or at least no regression).

ahg-g · 2025-02-16T01:56:20Z

I have a couple of concerns with this design, and I am hesitant to move forward with it:

Agree with both concerns however both exist in current implementation and are not something introduced by this PR. IMO this PR is an improvement, provided that the benchmark shows improvements (or at least no regression).

Yes, they do exist now, but it tries to solve a problem using the same approach (with a higher likelihood of Goroutine leak). Another issue is that it adds yet another cache, after this PR we will have three caches tracking pods: datastore.pods, podRefresher and prover.podMetrics and podMetricsRefresher, each has its type too: PodMetricsRefresher, backend.Pod, PodMetrics.

We only need one cache and one type. Cleaning up the datastore cache started in the informer cache PR, but then I realized that we will benefit from a pod reconciler to handle probing, hence walking back that change.

We could do this over two PRs:

Remove the pod cache in the datastore. The pod reconciler directly adds/remove entries from the PodMetrics cache.
Have the pod reconciler execute the metrics probing logic and move the PodMetrics cache to the datastore and completely remove provider.go

I am happy to send a PR for step 1 to help accelerate execution in this direction.

spacewander · 2025-02-17T03:04:56Z

Agree with the "we have too many caches" part. Maybe we can do a refactor after this PR is merged (if the direction is acceptable, of course)? As this PR is broken by merge conflicts again.

We are not limiting the number of go routines being created, at some point this is no going to scale. Ideally we should have a pool of threads that process work (in this case probing).

Using requeue may be good, but here we use one go routines per task so that slow probes won't affect the other probes.

Go runtime already uses a thread pool to schedule go routines, so it can manage the go routines effectively. Inventing another level of thread pool may be less helpful.

BTW, is there any SLO about how many pods/models that an inference extension should handle?

Anyway, we can reduce the chance of goroutine leaks by reducing the cache, which makes the lifecycle of pod/podmetrics clearer.

ahg-g · 2025-02-17T23:44:30Z

Agree with the "we have too many caches" part. Maybe we can do a refactor after this PR is merged (if the direction is acceptable, of course)? As this PR is broken by merge conflicts again.

We are not limiting the number of go routines being created, at some point this is no going to scale. Ideally we should have a pool of threads that process work (in this case probing).

Using requeue may be good, but here we use one go routines per task so that slow probes won't affect the other probes.

Go runtime already uses a thread pool to schedule go routines, so it can manage the go routines effectively. Inventing another level of thread pool may be less helpful.

BTW, is there any SLO about how many pods/models that an inference extension should handle?

Anyway, we can reduce the chance of goroutine leaks by reducing the cache, which makes the lifecycle of pod/podmetrics clearer.

Right, we could continue with this approach of using separate goroutines with careful tracking, I agree that go runtime will anyway limit the number of actual threads, but again I am concerned about goroutine leak, which happened to us in the past in kube-scheduler.

My hunch is that using the pod controller will make probing straightforward, but I could be wrong. I sent out #350, which consolidates storage in the datastore, and the pod controller already accesses that to add/delete/update pod entries, I think it should be easy now to add probing updates as well.

liu-cong · 2025-02-18T17:09:08Z

I suggested a similar consolidation followup refactor in #223 (comment).

I will review #350 and hopefully this PR can be simplified based on that.

liu-cong · 2025-02-18T19:05:58Z

I reviewed PR #350 and I believe it will make this PR much cleaner. @spacewander If you don't mind, I suggest rebasing once PR #350 is merged, and make the following changes:

Make a new package podmetrics that handles the lifecycle of the refreshers. You should be able to reuse most of the refresher code and just add some helper methods to create/stop the refreshers.
Updates to the pods map in the datastore should trigger those refresher lifecycle methods (e.g., deleting a pod should stop the refresher goroutine).
I think with the above we can get rid of the provider.go

spacewander · 2025-02-20T07:50:19Z

It seems that there will be other ongoing refactor PRs from #350 (comment).

I would like to continue the work once the codebase is stable.

ahg-g · 2025-02-20T16:15:25Z

It seems that there will be other ongoing refactor PRs from #350 (comment).

I would like to continue the work once the codebase is stable.

Apologies for the churn. Note that the two main refactoring, which restructured the repo, are done. What is left are improving test coverage, which should not impact this enhancement. Renaming the PodMetrics struct is not high priority and we can hold off that. I think this enhancement is the top priority now and we can prioritize getting it in.

To align on the direction: I think creating a prober type that we instantiate every time we add a new PodMetric instance to the map is likely a good path forward, wdyt?

liu-cong · 2025-02-21T19:24:29Z

@spacewander Thank you again for your patience!

+1 with Abdullah. All the important refactoring is done and and I support holding off on any further refactoring and prioritizing this one, as this addresses a big risk.

So just for the sake of clarity, I think we are really close of getting this done:

Remove provider.go. Its functionality is fully replaced by the new per-pod refresher.
Keep your pod_metrics_refresher.go. And instead of passing the provider, you simply need the PodMetricsClient. Now each pod will simply hold its own pod_metrics_refresher
The datastore now handles the lifecycle of pods. So you just need to start/stop the refresher when you add/remove a pod, in PodUpdateOrAddIfNotExist and PodDelete and PodDeleteAll

I am happy to help benchmarking this change once it's ready for review. What do you think @spacewander ?

spacewander · 2025-02-23T11:50:50Z

@ahg-g @liu-cong
Thanks for your suggestion. I tried on updating the PR this weekend, but found it too tough to solve the merge conflicts. Anyone who wants to solve this need to create their solution based on the current main branch. Unfortunately, I am too busy recently with my job (at least for two or three weeks) and I can't find sparse time to carry on.

liu-cong · 2025-02-24T20:02:30Z

@spacewander Really appreciate your effort in this! Let me give it a try based the great work you already have.

refresh metrics per pod periodically

91fe439

Signed-off-by: 罗泽轩 <[email protected]>

k8s-ci-robot requested review from Jeffwan and liu-cong January 24, 2025 02:49

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 24, 2025

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 24, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 24, 2025

spacewander mentioned this pull request Jan 24, 2025

ensure metrics refresh time <= refreshMetricsInterval #207

Closed

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 25, 2025

fix lint

2301c0e

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 28, 2025

liu-cong reviewed Jan 28, 2025

View reviewed changes

spacewander and others added 3 commits February 7, 2025 17:34

Update pkg/ext-proc/backend/provider.go

71ec51f

Co-authored-by: Cong Liu <[email protected]>

Merge branch 'main' into 124

ea944a4

update according to the review

e048b6f

Signed-off-by: spacewander <[email protected]>

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 7, 2025

make linter happy

1798358

Signed-off-by: spacewander <[email protected]>

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 10, 2025

Merge branch 'main' into 124

b68029a

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 11, 2025

let fmt happy

0359982

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 12, 2025

liu-cong reviewed Feb 13, 2025

View reviewed changes

Merge branch 'main' into 124

f994d7f

spacewander mentioned this pull request Feb 14, 2025

Make metrics stale time configurable #336

Open

resolve comments

23e4c17

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 14, 2025

k8s-ci-robot assigned liu-cong Feb 14, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 15, 2025

ahg-g mentioned this pull request Feb 17, 2025

Consolidating all storage behind datastore #350

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refresh metrics per pod periodically #223

refresh metrics per pod periodically #223

spacewander commented Jan 24, 2025

k8s-ci-robot commented Jan 24, 2025

k8s-ci-robot commented Jan 24, 2025

netlify bot commented Jan 24, 2025 •

edited

Loading

ahg-g commented Jan 25, 2025

spacewander commented Jan 26, 2025

liu-cong Jan 28, 2025

spacewander Feb 7, 2025

liu-cong Feb 13, 2025

spacewander Feb 14, 2025 •

edited

Loading

ahg-g Feb 15, 2025

spacewander Feb 17, 2025

ahg-g Feb 17, 2025

spacewander commented Feb 11, 2025

liu-cong Feb 13, 2025

liu-cong Feb 13, 2025

spacewander Feb 14, 2025

liu-cong commented Feb 14, 2025

ahg-g commented Feb 15, 2025

ahg-g commented Feb 15, 2025

k8s-ci-robot commented Feb 15, 2025

liu-cong commented Feb 15, 2025

ahg-g commented Feb 16, 2025 •

edited

Loading

spacewander commented Feb 17, 2025

ahg-g commented Feb 17, 2025

liu-cong commented Feb 18, 2025

liu-cong commented Feb 18, 2025

spacewander commented Feb 20, 2025

ahg-g commented Feb 20, 2025

liu-cong commented Feb 21, 2025

spacewander commented Feb 23, 2025

liu-cong commented Feb 24, 2025

refresh metrics per pod periodically #223

Are you sure you want to change the base?

refresh metrics per pod periodically #223

Conversation

spacewander commented Jan 24, 2025

k8s-ci-robot commented Jan 24, 2025

k8s-ci-robot commented Jan 24, 2025

netlify bot commented Jan 24, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

ahg-g commented Jan 25, 2025

spacewander commented Jan 26, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spacewander Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spacewander commented Feb 11, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liu-cong commented Feb 14, 2025

ahg-g commented Feb 15, 2025

ahg-g commented Feb 15, 2025

k8s-ci-robot commented Feb 15, 2025

liu-cong commented Feb 15, 2025

ahg-g commented Feb 16, 2025 • edited Loading

spacewander commented Feb 17, 2025

ahg-g commented Feb 17, 2025

liu-cong commented Feb 18, 2025

liu-cong commented Feb 18, 2025

spacewander commented Feb 20, 2025

ahg-g commented Feb 20, 2025

liu-cong commented Feb 21, 2025

spacewander commented Feb 23, 2025

liu-cong commented Feb 24, 2025

netlify bot commented Jan 24, 2025 •

edited

Loading

spacewander Feb 14, 2025 •

edited

Loading

ahg-g commented Feb 16, 2025 •

edited

Loading