-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TAS: fix topology assignment for the RayJob's submitter Job #4341
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mszadkow The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
596cc28
to
fee109f
Compare
12d622e
to
10b4701
Compare
/assign |
10b4701
to
af12b3f
Compare
@gabesaba it's rebased and ready for review |
af12b3f
to
22d367c
Compare
/unassign As I won't be able to take a look until next week |
8c51803
to
1a8cd39
Compare
/retitle TAS: fix topology assignment for the RayJob's submitter Job |
/remove-kind feature |
gomega.Expect(util.DeleteNamespace(ctx, k8sClient, ns)).To(gomega.Succeed()) | ||
util.ExpectObjectToBeDeleted(ctx, k8sClient, clusterQueue, true) | ||
util.ExpectObjectToBeDeleted(ctx, k8sClient, tasFlavor, true) | ||
util.ExpectObjectToBeDeleted(ctx, k8sClient, topology, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC we should ensure all the pods are also deleted in the AfterEach to make sure they don't occupy the nodes affter tests. PTAL how this is done for other tests. cc @mbobrovskyi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be done by this: DeleteAllRayJobsInNamespace
1a8cd39
to
2f30451
Compare
2f30451
to
f229159
Compare
test/e2e/tas/rayjob_test.go
Outdated
ginkgo.By("verify the assignment of pods are as expected with rank-based ordering", func() { | ||
gomega.Expect(k8sClient.List(ctx, pods, client.InNamespace(ns.Name))).To(gomega.Succeed()) | ||
workersAssignedNodes := readWorkersAssignedNodes(pods.Items) | ||
gomega.Expect(workersAssignedNodes).Should(gomega.HaveLen(workerReplicas)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, could we keep the check on the actual node names and counts as before? It seems wasteful to compute them in the helper fuction, but not use to assert on. Unless it is flaky?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, it is going to be flaky for CPU based asserts because it will depend on some other Pods we don't have control over, so could change with kind version for example, hmm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, asserting on workers should be enough, but still, I would like to assert for the actual set of workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, we can
no it's not flaky, it's just different from other types where we have replica/pod index label - where I could think of some general function with params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I wouldn't bind to specific worker node numbers just if they were properly spread, I will rethink this slightly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thus let's stick to the set and check set size, that's it
f229159
to
90412ea
Compare
What type of PR is this?
/kind bug
What this PR does / why we need it:
Fix the bug which didn't respect the TAS annotations on the template for the Ray submitter Job.
We needed a sanity check e2e test that proves RayJob support with TAS.
Which issue(s) this PR fixes:
Fixes #3716
Special notes for your reviewer:
Does this PR introduce a user-facing change?