Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TAS: fix topology assignment for the RayJob's submitter Job #4341

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

mszadkow
Copy link
Contributor

@mszadkow mszadkow commented Feb 21, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fix the bug which didn't respect the TAS annotations on the template for the Ray submitter Job.
We needed a sanity check e2e test that proves RayJob support with TAS.

Which issue(s) this PR fixes:

Fixes #3716

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix the bug which didn't respect the TAS annotations on the template for the Ray submitter Job.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. labels Feb 21, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mszadkow
Once this PR has been reviewed and has the lgtm label, please assign mimowo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 21, 2025
Copy link

netlify bot commented Feb 21, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 90412ea
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/67c0976e4b4bb00008dfed64

@mszadkow mszadkow force-pushed the feature/e2e-tas-rayjob branch from 596cc28 to fee109f Compare February 21, 2025 08:10
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 21, 2025
@mszadkow mszadkow force-pushed the feature/e2e-tas-rayjob branch 2 times, most recently from 12d622e to 10b4701 Compare February 21, 2025 09:44
@gabesaba
Copy link
Contributor

/assign

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Feb 24, 2025
@mszadkow mszadkow force-pushed the feature/e2e-tas-rayjob branch from 10b4701 to af12b3f Compare February 25, 2025 14:47
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 25, 2025
@mszadkow
Copy link
Contributor Author

@gabesaba it's rebased and ready for review

@mszadkow mszadkow force-pushed the feature/e2e-tas-rayjob branch from af12b3f to 22d367c Compare February 25, 2025 16:00
@gabesaba
Copy link
Contributor

/unassign

As I won't be able to take a look until next week

@mimowo
Copy link
Contributor

mimowo commented Feb 27, 2025

/retitle TAS: fix topology assignment for the RayJob's submitter Job

@k8s-ci-robot k8s-ci-robot changed the title [Feature] E2E tests for RayJob with TAS TAS: fix topology assignment for the RayJob's submitter Job Feb 27, 2025
@mimowo
Copy link
Contributor

mimowo commented Feb 27, 2025

/remove-kind feature
/kind bug
@mszadkow please update also PR description to say "/kind bug"

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 27, 2025
@k8s-ci-robot k8s-ci-robot removed the kind/feature Categorizes issue or PR as related to a new feature. label Feb 27, 2025
gomega.Expect(util.DeleteNamespace(ctx, k8sClient, ns)).To(gomega.Succeed())
util.ExpectObjectToBeDeleted(ctx, k8sClient, clusterQueue, true)
util.ExpectObjectToBeDeleted(ctx, k8sClient, tasFlavor, true)
util.ExpectObjectToBeDeleted(ctx, k8sClient, topology, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we should ensure all the pods are also deleted in the AfterEach to make sure they don't occupy the nodes affter tests. PTAL how this is done for other tests. cc @mbobrovskyi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be done by this: DeleteAllRayJobsInNamespace

@mszadkow mszadkow force-pushed the feature/e2e-tas-rayjob branch from 1a8cd39 to 2f30451 Compare February 27, 2025 14:32
ginkgo.By("verify the assignment of pods are as expected with rank-based ordering", func() {
gomega.Expect(k8sClient.List(ctx, pods, client.InNamespace(ns.Name))).To(gomega.Succeed())
workersAssignedNodes := readWorkersAssignedNodes(pods.Items)
gomega.Expect(workersAssignedNodes).Should(gomega.HaveLen(workerReplicas))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, could we keep the check on the actual node names and counts as before? It seems wasteful to compute them in the helper fuction, but not use to assert on. Unless it is flaky?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, it is going to be flaky for CPU based asserts because it will depend on some other Pods we don't have control over, so could change with kind version for example, hmm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, asserting on workers should be enough, but still, I would like to assert for the actual set of workers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, we can
no it's not flaky, it's just different from other types where we have replica/pod index label - where I could think of some general function with params

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I wouldn't bind to specific worker node numbers just if they were properly spread, I will rethink this slightly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thus let's stick to the set and check set size, that's it

@mszadkow mszadkow force-pushed the feature/e2e-tas-rayjob branch from f229159 to 90412ea Compare February 27, 2025 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TAS: Implement e2e tests for RayJob with TAS
4 participants