Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure metrics refresh time <= refreshMetricsInterval #207

Closed
wants to merge 1 commit into from

Conversation

spacewander
Copy link
Contributor

Fix #99

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 20, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: spacewander
Once this PR has been reviewed and has the lgtm label, please assign kfswain for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 20, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 20, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @spacewander. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 20, 2025
Copy link

netlify bot commented Jan 20, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit d3f5765
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/678e545fe707bb00080e832d
😎 Deploy Preview https://deploy-preview-207--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@spacewander spacewander marked this pull request as ready for review January 20, 2025 13:49
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 20, 2025
@k8s-ci-robot k8s-ci-robot requested a review from kfswain January 20, 2025 13:49
}
defer func() {
_ = resp.Body.Close()
}()

if resp.StatusCode != http.StatusOK {
klog.Errorf("unexpected status code from %s: %v", pod, resp.StatusCode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned error will be gathered to errCh and be logged outside finally.

errCh <- fmt.Errorf("failed to parse metrics from %s: %v", pod, err)

I remove the error log so it won't be logged twice.

@@ -49,22 +49,23 @@ func (p *PodMetricsClientImpl) FetchMetrics(
}
resp, err := http.DefaultClient.Do(req)
if err != nil {
klog.Errorf("failed to fetch metrics from %s: %v", pod, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return nil, fmt.Errorf("failed to fetch metrics from %s: %w", pod, err)
// As we use a short fetch timeout to ensure the metrics are up-to-date, there will be a lot
// of timeout of error even only 0.1% requests are timed out.
// Return the raw error so that the caller can filter out it via errors.Is(err, context.Canceled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The %w works for errors.Is right? I don't think there is a need to change this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out that. Would fix it in the next commit.

@@ -35,23 +32,46 @@ type PodMetricsClient interface {
FetchMetrics(ctx context.Context, pod Pod, existing *PodMetrics) (*PodMetrics, error)
}

func isPodMetricsStale(pm *PodMetrics) bool {
// TODO: make it configurable
return time.Since(pm.UpdatedTime) > 5*time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we make the 5 second a constant?

p.podMetrics.Store(pod, pm)
}

func (p *Provider) GetPodMetrics(pod Pod) (*PodMetrics, bool) {
val, ok := p.podMetrics.Load(pod)
if ok {
// For now, we don't exclude stale metrics with GET operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create an issue to track this? We may want to deprioritize or skip a pod if its metrics is stale in the future.

func (p *Provider) refreshMetricsOnce() error {
ctx, cancel := context.WithTimeout(context.Background(), fetchMetricsTimeout)
func (p *Provider) refreshMetricsOnce(interval time.Duration) error {
ctx, cancel := context.WithTimeout(context.Background(), interval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should give a longer timeout than the interval, to allow some slow backends to return metrics. Otherwise we need to really carefully pick a reasonable interval in various environments.

I think the fix should be: each pod has its own independent refresh loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously considered having each pod obtain metrics independently, but couldn't figure out how to adapt to the fact that pods are dynamically added.

Since pods are added dynamically and metrics are collected at fixed intervals, when collecting all pod metrics (AllPodMetrics) at a certain point in time, they won't have simultaneity.

Consider the following scenario, where metrics are collected once per time unit, and the queue length increases by one every half-time unit:

Pod A is added starting at 0, metrics are collected at 0, 1, and 2, and by point 2, its queue length is 4.

Pod B is added starting at 0.5, metrics are collected at 0.5 and 1.5, and by point 2, its queue length is 3.

However, if we make a decision at point 2, using metrics collected at points 2 and 1.5, we get values of 4 and 2, which doesn't reflect the actual situation.

For slow backends, both approaches are the same because metrics obtained from slow backends have latency. While collecting metrics independently for each pod could prevent slow backends from slowing down the overall progress, it would sacrifice the metrics' refreshment of normal backends. Since the vast majority of backends are normal, it's better to stick with the existing approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO there are 2 things: 1. the metric refresh interval, which is supposed to refresh metrics as fast as possible, and is optimistic. It's OK that sometimes some backends are not fast enough than the interval. 2. The metrics API call timeout. This needs to be long enough so that it covers some non-ideal situations when metrics API call is taking longer than normal. A slow API response is better than no response.

My concern with the current implementation is that it will fail the metrics call if the call timeouts at the interval. This can happen for example in an environment with slow networking, and it requires the operator to carefully pick an interval. Even that, during degraded networking conditions, this can lead to no metric refresh.

I don't think it's a big issue that pod metrics are not refreshed synchronously. Even they are refreshed synchronously, there is always a delay of the real-time state vs. what the ext proc sees.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liu-cong
I agree with you. What do you think about the following duration setting?
Refresh interval: 50ms
Fetch timeout: 1s
Stale time: 5s

So we have a stale limit to filter out pods which don't have fresh data, and for slow backends we can avoid abnormal abort and have some chances to retry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some metrics on the metrics API latency to make a more educated guess, but your proposed numbers look good to me for now. And we should make them configurable (adding to the startup flag is OK to begin with). Thanks again for the discussion!

@@ -89,7 +117,7 @@ func (p *Provider) Init(refreshPodsInterval, refreshMetricsInterval time.Duratio
go func() {
for {
time.Sleep(5 * time.Second)
klog.Infof("===DEBUG: Current Pods and metrics: %+v", p.AllPodMetrics())
klog.Infof("===DEBUG: Current Pods and metrics: %+v", p.AllPodMetricsIncludingStale())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add another debug line which is the stale metrics, and it can be a klog.Warning

@spacewander
Copy link
Contributor Author

@liu-cong
Thanks for your review!
I open a new PR to address the issue: #223, as we change the direction to solve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The metrics refresh time might be much larger than the refreshMetricsInterval
3 participants