Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flake] Change image behavior of high-priority-group pod #4438

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mszadkow
Copy link
Contributor

What type of PR is this?

/kind flake

What this PR does / why we need it:

Prevent the situation that high-group-priority finish too fast.
Add control over when the pod group should finish.

Which issue(s) this PR fixes:

Fixes #4434

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. labels Feb 28, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mszadkow
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 28, 2025
Copy link

netlify bot commented Feb 28, 2025

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit fdd8fc7
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/67c1a27f2cdd6d0008e6a44e
😎 Deploy Preview https://deploy-preview-4438--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@@ -517,6 +517,10 @@ var _ = ginkgo.Describe("Pod groups", func() {
}, util.Timeout, util.Interval).Should(gomega.Succeed())
})

ginkgo.By("Call high priority group pods to complete", func() {
util.WaitForActivePodsAndTerminate(ctx, k8sClient, restClient, cfg, ns.Name, 2, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't think we should be terminating them by the /exit, because the Pod is already being deleted due to preemptions. So, we just need to wait for SIGKILL by kubelet. To make it faster we can specify spec.graceTerminationPeriodSeconds. See #4434 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, this is already terminating the high-priority group. makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not terminate them they will never finish and replacement pods can't be ungated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, got it.

@mimowo
Copy link
Contributor

mimowo commented Feb 28, 2025

@mszadkow were you able to repro the issue locally before fix and confirm the code fixes it?

@mszadkow
Copy link
Contributor Author

mszadkow commented Feb 28, 2025

@mszadkow were you able to repro the issue locally before fix and confirm the code fixes it?

100 times repeated, but I did not catch it, even once.
Thus the idea of changing the approach a little bit and make sure we have more control over the test.

@mimowo
Copy link
Contributor

mimowo commented Feb 28, 2025

Ok, but as you use "BehaviorWaitForDeletion" command, isn't the "Check that the preempted pods are deleted" step now taking long becuase we need to wait 30s for SIGKILL? If this is the case we may just limit the graceful termination period to 1s.

@mszadkow
Copy link
Contributor Author

mszadkow commented Feb 28, 2025

Ok, but as you use "BehaviorWaitForDeletion" command, isn't the "Check that the preempted pods are deleted" step now taking long becuase we need to wait 30s for SIGKILL? If this is the case we may just limit the graceful termination period to 1s.

It's a different group (default) that I didn't touch.
But as you said it, I think we can decrease the time.
Instead of deleting pods we could use different behaviour and send exit code 1 then we have the same effect but faster

Update:
I am wrong as the deletion happens from the preemption, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky Test: Pod groups when Single CQ should allow to preempt the lower priority group
3 participants