Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refresh metrics per pod periodically #223

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

spacewander
Copy link
Contributor

Fix #99

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 24, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: spacewander
Once this PR has been reviewed and has the lgtm label, please assign danehans for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 24, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @spacewander. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 24, 2025
Copy link

netlify bot commented Jan 24, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 23e4c17
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67af1f9f1fc0ce0008dd6e2b
😎 Deploy Preview https://deploy-preview-223--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@ahg-g
Copy link
Contributor

ahg-g commented Jan 25, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 25, 2025
@spacewander
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 28, 2025
@@ -120,56 +151,13 @@ func (p *Provider) refreshPodsOnce() {
pod := k.(Pod)
if _, ok := p.datastore.pods.Load(pod); !ok {
p.podMetrics.Delete(pod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this, given the LoadAndDelete below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The map loads the pod is different from the map deletes the pod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused, in line 161 p.podMetrics.Delete(pod), and then in line 162, p.podMetrics.LoadAndDelete(pod), won't line 162 always return false since the pod is already deleted in line 161?

Copy link
Contributor Author

@spacewander spacewander Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching up! Sorry for misunderstanding you at the beginning. The p.podMetrics.LoadAndDelete(pod) should be p.podMetricsRefresher.LoadAndDelete(pod). I will fix it soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need another map in the first place? can't we put the PodMetricsRefresher in the PodMetrics struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahg-g
I tried it before. But if we put PodMetricsRefresher under PodMetrics, it's not easy to make it concurrency-safe without a big refactor. As PodMetricsRefresher needs to change PodMetrics on the fly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should be able to do that, I sent out #350 which does a major refactor to the datastore/provider layer. It consolidating storage in one place under datastore, and so it does in-place updates to the metrics. Please take a look and advice is there are any concurrency issues that I may have missed.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 7, 2025
Signed-off-by: spacewander <[email protected]>
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 10, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 11, 2025
@spacewander
Copy link
Contributor Author

@liu-cong
I have solved the merge conflicts. Would you review this PR again? Thanks!

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 12, 2025
@@ -120,56 +151,13 @@ func (p *Provider) refreshPodsOnce() {
pod := k.(Pod)
if _, ok := p.datastore.pods.Load(pod); !ok {
p.podMetrics.Delete(pod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused, in line 161 p.podMetrics.Delete(pod), and then in line 162, p.podMetrics.LoadAndDelete(pod), won't line 162 always return false since the pod is already deleted in line 161?

}

// Schedule finds the target pod based on metrics and the requested lora adapter.
func (s *Scheduler) Schedule(req *LLMRequest) (targetPod backend.Pod, err error) {
klog.V(logutil.VERBOSE).Infof("request: %v; metrics: %+v", req, s.podMetricsProvider.AllPodMetrics())
pods, err := s.filter.Filter(req, s.podMetricsProvider.AllPodMetrics())
klog.V(logutil.VERBOSE).Infof("request: %v; metrics: %+v", req, s.podMetricsProvider.AllFreshPodMetrics())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the AllFreshPodMetrics() from this logging? This was probably added in initial POC. This is really DEBUG level logging and can be very heavy if there are many pods in the pool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liu-cong
Updated.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2025
@liu-cong
Copy link
Contributor

@spacewander The change looks good to me, thank you!

Since this is a pretty big change, I was wondering if you can perform some manual testing and update the testing results.

/lgtm

/hold For testing update

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 14, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2025
@ahg-g
Copy link
Contributor

ahg-g commented Feb 15, 2025

@spacewander The change looks good to me, thank you!

Since this is a pretty big change, I was wondering if you can perform some manual testing and update the testing results.

/lgtm

/hold For testing update

I think we need to run a benchmark with this change.

@ahg-g
Copy link
Contributor

ahg-g commented Feb 15, 2025

I have a couple of concerns with this design, and I am hesitant to move forward with it:

  1. We are not limiting the number of go routines being created, at some point this is no going to scale. Ideally we should have a pool of threads that process work (in this case probing).

  2. We now have a pod reconciler, we can benefit from that to schedule probing work. We can queue work at specific intervals. This eliminates all the added logic related to created and handling probing threads. The TTL feature in JobSet uses this pattern: https://github.com/kubernetes-sigs/jobset/blob/main/pkg/controllers/ttl_after_finished.go#L113

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 15, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@liu-cong
Copy link
Contributor

I have a couple of concerns with this design, and I am hesitant to move forward with it:

Agree with both concerns however both exist in current implementation and are not something introduced by this PR. IMO this PR is an improvement, provided that the benchmark shows improvements (or at least no regression).

@ahg-g
Copy link
Contributor

ahg-g commented Feb 16, 2025

I have a couple of concerns with this design, and I am hesitant to move forward with it:

Agree with both concerns however both exist in current implementation and are not something introduced by this PR. IMO this PR is an improvement, provided that the benchmark shows improvements (or at least no regression).

Yes, they do exist now, but it tries to solve a problem using the same approach (with a higher likelihood of Goroutine leak). Another issue is that it adds yet another cache, after this PR we will have three caches tracking pods: datastore.pods, podRefresher and prover.podMetrics and podMetricsRefresher, each has its type too: PodMetricsRefresher, backend.Pod, PodMetrics.

We only need one cache and one type. Cleaning up the datastore cache started in the informer cache PR, but then I realized that we will benefit from a pod reconciler to handle probing, hence walking back that change.

We could do this over two PRs:

  1. Remove the pod cache in the datastore. The pod reconciler directly adds/remove entries from the PodMetrics cache.
  2. Have the pod reconciler execute the metrics probing logic and move the PodMetrics cache to the datastore and completely remove provider.go

I am happy to send a PR for step 1 to help accelerate execution in this direction.

@spacewander
Copy link
Contributor Author

Agree with the "we have too many caches" part. Maybe we can do a refactor after this PR is merged (if the direction is acceptable, of course)? As this PR is broken by merge conflicts again.

We are not limiting the number of go routines being created, at some point this is no going to scale. Ideally we should have a pool of threads that process work (in this case probing).

Using requeue may be good, but here we use one go routines per task so that slow probes won't affect the other probes.

Go runtime already uses a thread pool to schedule go routines, so it can manage the go routines effectively. Inventing another level of thread pool may be less helpful.

BTW, is there any SLO about how many pods/models that an inference extension should handle?

Anyway, we can reduce the chance of goroutine leaks by reducing the cache, which makes the lifecycle of pod/podmetrics clearer.

@ahg-g
Copy link
Contributor

ahg-g commented Feb 17, 2025

Agree with the "we have too many caches" part. Maybe we can do a refactor after this PR is merged (if the direction is acceptable, of course)? As this PR is broken by merge conflicts again.

We are not limiting the number of go routines being created, at some point this is no going to scale. Ideally we should have a pool of threads that process work (in this case probing).

Using requeue may be good, but here we use one go routines per task so that slow probes won't affect the other probes.

Go runtime already uses a thread pool to schedule go routines, so it can manage the go routines effectively. Inventing another level of thread pool may be less helpful.

BTW, is there any SLO about how many pods/models that an inference extension should handle?

Anyway, we can reduce the chance of goroutine leaks by reducing the cache, which makes the lifecycle of pod/podmetrics clearer.

Right, we could continue with this approach of using separate goroutines with careful tracking, I agree that go runtime will anyway limit the number of actual threads, but again I am concerned about goroutine leak, which happened to us in the past in kube-scheduler.

My hunch is that using the pod controller will make probing straightforward, but I could be wrong. I sent out #350, which consolidates storage in the datastore, and the pod controller already accesses that to add/delete/update pod entries, I think it should be easy now to add probing updates as well.

@liu-cong
Copy link
Contributor

I suggested a similar consolidation followup refactor in #223 (comment).

I will review #350 and hopefully this PR can be simplified based on that.

@liu-cong
Copy link
Contributor

I reviewed PR #350 and I believe it will make this PR much cleaner. @spacewander If you don't mind, I suggest rebasing once PR #350 is merged, and make the following changes:

  • Make a new package podmetrics that handles the lifecycle of the refreshers. You should be able to reuse most of the refresher code and just add some helper methods to create/stop the refreshers.
  • Updates to the pods map in the datastore should trigger those refresher lifecycle methods (e.g., deleting a pod should stop the refresher goroutine).
  • I think with the above we can get rid of the provider.go

@spacewander
Copy link
Contributor Author

It seems that there will be other ongoing refactor PRs from #350 (comment).

I would like to continue the work once the codebase is stable.

@ahg-g
Copy link
Contributor

ahg-g commented Feb 20, 2025

It seems that there will be other ongoing refactor PRs from #350 (comment).

I would like to continue the work once the codebase is stable.

Apologies for the churn. Note that the two main refactoring, which restructured the repo, are done. What is left are improving test coverage, which should not impact this enhancement. Renaming the PodMetrics struct is not high priority and we can hold off that. I think this enhancement is the top priority now and we can prioritize getting it in.

To align on the direction: I think creating a prober type that we instantiate every time we add a new PodMetric instance to the map is likely a good path forward, wdyt?

@liu-cong
Copy link
Contributor

@spacewander Thank you again for your patience!

+1 with Abdullah. All the important refactoring is done and and I support holding off on any further refactoring and prioritizing this one, as this addresses a big risk.

So just for the sake of clarity, I think we are really close of getting this done:

  • Remove provider.go. Its functionality is fully replaced by the new per-pod refresher.
  • Keep your pod_metrics_refresher.go. And instead of passing the provider, you simply need the PodMetricsClient. Now each pod will simply hold its own pod_metrics_refresher
  • The datastore now handles the lifecycle of pods. So you just need to start/stop the refresher when you add/remove a pod, in PodUpdateOrAddIfNotExist and PodDelete and PodDeleteAll

I am happy to help benchmarking this change once it's ready for review. What do you think @spacewander ?

@spacewander
Copy link
Contributor Author

@ahg-g @liu-cong
Thanks for your suggestion. I tried on updating the PR this weekend, but found it too tough to solve the merge conflicts. Anyone who wants to solve this need to create their solution based on the current main branch. Unfortunately, I am too busy recently with my job (at least for two or three weeks) and I can't find sparse time to carry on.

@liu-cong
Copy link
Contributor

@spacewander Really appreciate your effort in this! Let me give it a try based the great work you already have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The metrics refresh time might be much larger than the refreshMetricsInterval
4 participants