ensure metrics refresh time <= refreshMetricsInterval #207

spacewander · 2025-01-20T13:49:16Z

Signed-off-by: spacewander <[email protected]>

k8s-ci-robot · 2025-01-20T13:49:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: spacewander
Once this PR has been reviewed and has the lgtm label, please assign kfswain for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-01-20T13:49:26Z

Hi @spacewander. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2025-01-20T13:49:35Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`d3f5765`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/678e545fe707bb00080e832d
😎 Deploy Preview	https://deploy-preview-207--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

liu-cong · 2025-01-21T17:24:01Z

pkg/ext-proc/backend/vllm/metrics.go

 	}
 	defer func() {
 		_ = resp.Body.Close()
 	}()

 	if resp.StatusCode != http.StatusOK {
-		klog.Errorf("unexpected status code from %s: %v", pod, resp.StatusCode)


Can you change this to klog.V(4)? According to the https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/docs/dev.md

The returned error will be gathered to errCh and be logged outside finally.

gateway-api-inference-extension/pkg/ext-proc/backend/provider.go

Line 150 in 26d2765

errCh <- fmt.Errorf("failed to parse metrics from %s: %v", pod, err)

I remove the error log so it won't be logged twice.

liu-cong · 2025-01-21T17:24:18Z

pkg/ext-proc/backend/vllm/metrics.go

@@ -49,22 +49,23 @@ func (p *PodMetricsClientImpl) FetchMetrics(
 	}
 	resp, err := http.DefaultClient.Do(req)
 	if err != nil {
-		klog.Errorf("failed to fetch metrics from %s: %v", pod, err)


Can you change this to klog.V(4)? https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/docs/dev.md

liu-cong · 2025-01-21T17:24:49Z

pkg/ext-proc/backend/vllm/metrics.go

-		return nil, fmt.Errorf("failed to fetch metrics from %s: %w", pod, err)
+		// As we use a short fetch timeout to ensure the metrics are up-to-date, there will be a lot
+		// of timeout of error even only 0.1% requests are timed out.
+		// Return the raw error so that the caller can filter out it via errors.Is(err, context.Canceled)


The %w works for errors.Is right? I don't think there is a need to change this.

Thanks for pointing out that. Would fix it in the next commit.

liu-cong · 2025-01-21T17:25:49Z

pkg/ext-proc/backend/provider.go

@@ -35,23 +32,46 @@ type PodMetricsClient interface {
 	FetchMetrics(ctx context.Context, pod Pod, existing *PodMetrics) (*PodMetrics, error)
 }

+func isPodMetricsStale(pm *PodMetrics) bool {
+	// TODO: make it configurable
+	return time.Since(pm.UpdatedTime) > 5*time.Second


nit: Can we make the 5 second a constant?

liu-cong · 2025-01-21T17:28:54Z

pkg/ext-proc/backend/provider.go

 	p.podMetrics.Store(pod, pm)
 }

 func (p *Provider) GetPodMetrics(pod Pod) (*PodMetrics, bool) {
 	val, ok := p.podMetrics.Load(pod)
 	if ok {
+		// For now, we don't exclude stale metrics with GET operation.


Can you create an issue to track this? We may want to deprioritize or skip a pod if its metrics is stale in the future.

liu-cong · 2025-01-21T17:38:40Z

pkg/ext-proc/backend/provider.go

-func (p *Provider) refreshMetricsOnce() error {
-	ctx, cancel := context.WithTimeout(context.Background(), fetchMetricsTimeout)
+func (p *Provider) refreshMetricsOnce(interval time.Duration) error {
+	ctx, cancel := context.WithTimeout(context.Background(), interval)


I think we should give a longer timeout than the interval, to allow some slow backends to return metrics. Otherwise we need to really carefully pick a reasonable interval in various environments.

I think the fix should be: each pod has its own independent refresh loop.

I previously considered having each pod obtain metrics independently, but couldn't figure out how to adapt to the fact that pods are dynamically added.

Since pods are added dynamically and metrics are collected at fixed intervals, when collecting all pod metrics (AllPodMetrics) at a certain point in time, they won't have simultaneity.

Consider the following scenario, where metrics are collected once per time unit, and the queue length increases by one every half-time unit:

Pod A is added starting at 0, metrics are collected at 0, 1, and 2, and by point 2, its queue length is 4.

Pod B is added starting at 0.5, metrics are collected at 0.5 and 1.5, and by point 2, its queue length is 3.

However, if we make a decision at point 2, using metrics collected at points 2 and 1.5, we get values of 4 and 2, which doesn't reflect the actual situation.

For slow backends, both approaches are the same because metrics obtained from slow backends have latency. While collecting metrics independently for each pod could prevent slow backends from slowing down the overall progress, it would sacrifice the metrics' refreshment of normal backends. Since the vast majority of backends are normal, it's better to stick with the existing approach.

IMO there are 2 things: 1. the metric refresh interval, which is supposed to refresh metrics as fast as possible, and is optimistic. It's OK that sometimes some backends are not fast enough than the interval. 2. The metrics API call timeout. This needs to be long enough so that it covers some non-ideal situations when metrics API call is taking longer than normal. A slow API response is better than no response.

My concern with the current implementation is that it will fail the metrics call if the call timeouts at the interval. This can happen for example in an environment with slow networking, and it requires the operator to carefully pick an interval. Even that, during degraded networking conditions, this can lead to no metric refresh.

I don't think it's a big issue that pod metrics are not refreshed synchronously. Even they are refreshed synchronously, there is always a delay of the real-time state vs. what the ext proc sees.

@liu-cong
I agree with you. What do you think about the following duration setting?
Refresh interval: 50ms
Fetch timeout: 1s
Stale time: 5s

So we have a stale limit to filter out pods which don't have fresh data, and for slow backends we can avoid abnormal abort and have some chances to retry.

We should add some metrics on the metrics API latency to make a more educated guess, but your proposed numbers look good to me for now. And we should make them configurable (adding to the startup flag is OK to begin with). Thanks again for the discussion!

liu-cong · 2025-01-21T17:39:25Z

pkg/ext-proc/backend/provider.go

@@ -89,7 +117,7 @@ func (p *Provider) Init(refreshPodsInterval, refreshMetricsInterval time.Duratio
 		go func() {
 			for {
 				time.Sleep(5 * time.Second)
-				klog.Infof("===DEBUG: Current Pods and metrics: %+v", p.AllPodMetrics())
+				klog.Infof("===DEBUG: Current Pods and metrics: %+v", p.AllPodMetricsIncludingStale())


Can you also add another debug line which is the stale metrics, and it can be a klog.Warning

spacewander · 2025-01-24T02:50:25Z

@liu-cong
Thanks for your review!
I open a new PR to address the issue: #223, as we change the direction to solve it.

ensure metrics refresh time <= refreshMetricsInterval

d3f5765

Signed-off-by: spacewander <[email protected]>

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 20, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 20, 2025

k8s-ci-robot requested review from ahg-g and liu-cong January 20, 2025 13:49

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 20, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 20, 2025

spacewander marked this pull request as ready for review January 20, 2025 13:49

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 20, 2025

k8s-ci-robot requested a review from kfswain January 20, 2025 13:49

spacewander mentioned this pull request Jan 20, 2025

The metrics refresh time might be much larger than the refreshMetricsInterval #99

Open

liu-cong reviewed Jan 21, 2025

View reviewed changes

spacewander closed this Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ensure metrics refresh time <= refreshMetricsInterval #207

ensure metrics refresh time <= refreshMetricsInterval #207

spacewander commented Jan 20, 2025

k8s-ci-robot commented Jan 20, 2025

k8s-ci-robot commented Jan 20, 2025

netlify bot commented Jan 20, 2025 •

edited

Loading

liu-cong Jan 21, 2025

spacewander Jan 22, 2025

liu-cong Jan 21, 2025

liu-cong Jan 21, 2025

spacewander Jan 22, 2025

liu-cong Jan 21, 2025

liu-cong Jan 21, 2025

liu-cong Jan 21, 2025

spacewander Jan 22, 2025

liu-cong Jan 22, 2025

spacewander Jan 22, 2025

liu-cong Jan 22, 2025

liu-cong Jan 21, 2025

spacewander commented Jan 24, 2025

ensure metrics refresh time <= refreshMetricsInterval #207

ensure metrics refresh time <= refreshMetricsInterval #207

Conversation

spacewander commented Jan 20, 2025

k8s-ci-robot commented Jan 20, 2025

k8s-ci-robot commented Jan 20, 2025

netlify bot commented Jan 20, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spacewander commented Jan 24, 2025

netlify bot commented Jan 20, 2025 •

edited

Loading