-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] better support for scale-to-zero autoscaling node pools #4935
base: master
Are you sure you want to change the base?
Conversation
e8c6353
to
91fb916
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving review comments to new locations after the last commit
sky/provision/kubernetes/utils.py
Outdated
container_service = gcp.build('container', | ||
'v1', | ||
credentials=credentials) | ||
cluster = container_service.projects().locations().clusters() \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO return TRUE if the API call fails due to credential issues or otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to just wrap the whole thing in try/catch and return True if there is any error.
sky/provision/kubernetes/utils.py
Outdated
# disk_size = node_pool['config']['diskSizeGb'] | ||
# print(f"vcpus: {vcpus}, mem: {mem}, diskSizeGb: {disk_size}, maxNodeCount: {max_node_count}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes_utils.check_instance_fits (the function I was benchmarking this one off of) doesn't check disk size. I can get this information here but left it commented out for now - is the lack of disk size check in kubernetes_utils.check_instance_fits intentional?
sky/provision/kubernetes/utils.py
Outdated
# pylint: disable=import-outside-toplevel | ||
import google.auth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should avoid this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also avoid crashing if the user doesn't have the relevant python library installed. If you only pip install skypilot[kubernetes]
you probably won't have it.
40839da
to
c702207
Compare
@@ -21,6 +21,7 @@ | |||
from sky import models | |||
from sky import sky_logging | |||
from sky import skypilot_config | |||
from sky.adaptors import gcp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flag for self: potentially problematic if user did not install skypilot[gcp]
9505cf6
to
41001c4
Compare
fits, reason = kubernetes_utils.check_instance_fits( | ||
context, instance_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the whole thing takes some time, we may want to avoid going into the loop when we have an unsupported autoscaler, since we will eventually end up adding every region anyway. We can short-circuit in that case.
If we have an unsupported autoscaler, a nice improvement would be to reorder the contexts to put known good contexts before unknown. That is, if a cluster already has a node that our job will fit on, we should try it before trying a zero-scaled cluster that might support our job.
That said, idk if reordering the regions returned by this function will actually affect the order we try them in. My guess is no. Might need some additional plumbing in that case, could be a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter paragraph is a good point - even for supported autoscalers, if there is a context with a suitable node already there vs. a context that can autoscale the node, the former should be preferred.
But I agree that can be addressed as a follow-up. For now I implemented the short-circuit to return all regions in case of an unsupported autoscaler.
75b2f86
to
58e8251
Compare
cb2d04e
to
4999bc1
Compare
6151f03
to
0d999db
Compare
Currently, if a node autoscaler is configured in a k8s cluster, the only thing skypilot knows about the autoscaler is the configuration provided by the user. Specifically, skypilot has no idea if there is a node pool with the node type that may be able to handle a job that has simply been autoscaled to zero. Currently, skypilot gets around this by simply submitting a pod to each context with autoscaler enabled and seeing if the pod is scheduled before timeout.
While this approach is functional, it is inefficient because:
^ note on above: the k8s autoscaler configuration is global, not per-context. A per-context autoscaler config could also solve this specific bullet point.
This PR attempts to solve these challenges for GKE autoscaler specifically. This is done by querying each context for its node pools, detecting if any node pool has autoscaling configured, and checking if any node can be spun up that can satisfy the job request.
Assumptions in code:
gke_PROJECT-ID_ZONE_CLUSTER-NAME
Tested (run the relevant ones):
bash format.sh
/smoke-test
(CI) orpytest tests/test_smoke.py
(local)/smoke-test -k test_name
(CI) orpytest tests/test_smoke.py::test_name
(local)/quicktest-core
(CI) orconda deactivate; bash -i tests/backward_compatibility_tests.sh
(local)