[k8s] better support for scale-to-zero autoscaling node pools #4935

SeungjinYang · 2025-03-12T18:07:44Z

Currently, if a node autoscaler is configured in a k8s cluster, the only thing skypilot knows about the autoscaler is the configuration provided by the user. Specifically, skypilot has no idea if there is a node pool with the node type that may be able to handle a job that has simply been autoscaled to zero. Currently, skypilot gets around this by simply submitting a pod to each context with autoscaler enabled and seeing if the pod is scheduled before timeout.

While this approach is functional, it is inefficient because:

A context (=cluster) may have an autoscaling node pool, but the node pool may not provide the VM needed to satisfy the request. For example, there may be an autoscaler on a node pool with A100 GPU VMs - skypilot doesn’t know this, only that there is an autoscaler group, and will try to launch H100 resources on it.
The autoscaling node pool may have the correct accelerator but different number of accelerators, CPU / memory constraints. For example, a node pool that spins up VMs with 1 A100 cannot handle launch requests with A100:8, but again skypilot doesn’t know that.
If there are multiple allowed contexts, and only some of them have autoscalers on them, there is no way for skypilot to know that. So skypilot may try to schedule a pod on a context w/o an autoscaler that cannot schedule the said pod.
^ note on above: the k8s autoscaler configuration is global, not per-context. A per-context autoscaler config could also solve this specific bullet point.

This PR attempts to solve these challenges for GKE autoscaler specifically. This is done by querying each context for its node pools, detecting if any node pool has autoscaling configured, and checking if any node can be spun up that can satisfy the job request.

Assumptions in code:

context name follows convention of gke_PROJECT-ID_ZONE_CLUSTER-NAME
customer has GCP auth set up for skypilot to query for GKE cluster details

Tested (run the relevant ones):

Code formatting: install pre-commit (auto-check on commit) or bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
Backward compatibility: /quicktest-core (CI) or conda deactivate; bash -i tests/backward_compatibility_tests.sh (local)

sky/clouds/kubernetes.py

sky/provision/kubernetes/utils.py

SeungjinYang

Moving review comments to new locations after the last commit

SeungjinYang · 2025-03-12T20:01:33Z

sky/provision/kubernetes/utils.py

+        container_service = gcp.build('container',
+                                    'v1',
+                                    credentials=credentials)
+        cluster = container_service.projects().locations().clusters() \


TODO return TRUE if the API call fails due to credential issues or otherwise

We may want to just wrap the whole thing in try/catch and return True if there is any error.

sky/provision/kubernetes/utils.py

SeungjinYang · 2025-03-12T20:02:54Z

sky/provision/kubernetes/utils.py

+        # disk_size = node_pool['config']['diskSizeGb']
+        # print(f"vcpus: {vcpus}, mem: {mem}, diskSizeGb: {disk_size}, maxNodeCount: {max_node_count}") 


kubernetes_utils.check_instance_fits (the function I was benchmarking this one off of) doesn't check disk size. I can get this information here but left it commented out for now - is the lack of disk size check in kubernetes_utils.check_instance_fits intentional?

sky/clouds/kubernetes.py

sky/provision/kubernetes/utils.py

sky/clouds/kubernetes.py

sky/provision/kubernetes/utils.py

cg505 · 2025-03-12T20:41:34Z

sky/provision/kubernetes/utils.py

+        # pylint: disable=import-outside-toplevel
+        import google.auth


we should avoid this

We should also avoid crashing if the user doesn't have the relevant python library installed. If you only pip install skypilot[kubernetes] you probably won't have it.

sky/provision/kubernetes/utils.py

sky/clouds/kubernetes.py

SeungjinYang · 2025-03-12T22:10:03Z

sky/provision/kubernetes/utils.py

@@ -21,6 +21,7 @@
 from sky import models
 from sky import sky_logging
 from sky import skypilot_config
+from sky.adaptors import gcp


flag for self: potentially problematic if user did not install skypilot[gcp]

cg505 · 2025-03-13T00:26:34Z

sky/clouds/kubernetes.py

+                fits, reason = kubernetes_utils.check_instance_fits(
+                    context, instance_type)


Since the whole thing takes some time, we may want to avoid going into the loop when we have an unsupported autoscaler, since we will eventually end up adding every region anyway. We can short-circuit in that case.

If we have an unsupported autoscaler, a nice improvement would be to reorder the contexts to put known good contexts before unknown. That is, if a cluster already has a node that our job will fit on, we should try it before trying a zero-scaled cluster that might support our job.
That said, idk if reordering the regions returned by this function will actually affect the order we try them in. My guess is no. Might need some additional plumbing in that case, could be a follow-up.

The latter paragraph is a good point - even for supported autoscalers, if there is a context with a suitable node already there vs. a context that can autoscale the node, the former should be preferred.
But I agree that can be addressed as a follow-up. For now I implemented the short-circuit to return all regions in case of an unsupported autoscaler.

sky/provision/kubernetes/utils.py

sky/clouds/kubernetes.py

… info request

SeungjinYang requested review from cg505 and romilbhardwaj March 12, 2025 18:07

SeungjinYang force-pushed the k8s-gke-autoscaler branch from e8c6353 to 91fb916 Compare March 12, 2025 18:08

SeungjinYang commented Mar 12, 2025

View reviewed changes

sky/clouds/kubernetes.py Outdated Show resolved Hide resolved

sky/clouds/kubernetes.py Outdated Show resolved Hide resolved

sky/clouds/kubernetes.py Outdated Show resolved Hide resolved

sky/clouds/kubernetes.py Outdated Show resolved Hide resolved

SeungjinYang commented Mar 12, 2025

View reviewed changes

sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved

SeungjinYang commented Mar 12, 2025

View reviewed changes

cg505 reviewed Mar 12, 2025

View reviewed changes

sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved

cg505 reviewed Mar 12, 2025

View reviewed changes

sky/clouds/kubernetes.py Outdated Show resolved Hide resolved

SeungjinYang force-pushed the k8s-gke-autoscaler branch 3 times, most recently from 40839da to c702207 Compare March 12, 2025 21:56

SeungjinYang commented Mar 12, 2025

View reviewed changes

SeungjinYang force-pushed the k8s-gke-autoscaler branch 2 times, most recently from 9505cf6 to 41001c4 Compare March 13, 2025 00:18

cg505 reviewed Mar 13, 2025

View reviewed changes

SeungjinYang added 11 commits March 12, 2025 20:30

working codepath

57eaaf1

remove prints and an assert

1c34678

make into classes

48412b4

minor changes

edb567f

update codepath comment

c5e9b8a

lint

5d1049b

slight reformat

8e58ed5

review feedback

a33e9ac

autoscale_detecror -> autoscaler

a329ded

unnest regions_with_offering logic

8b71e1a

short circuit on unsupported autoscaler

58e8251

SeungjinYang force-pushed the k8s-gke-autoscaler branch from 75b2f86 to 58e8251 Compare March 13, 2025 03:30

formalize context name validation, add exception handling for cluster…

4999bc1

… info request

SeungjinYang force-pushed the k8s-gke-autoscaler branch from cb2d04e to 4999bc1 Compare March 13, 2025 04:06

account for TPUs

0d999db

SeungjinYang force-pushed the k8s-gke-autoscaler branch from 6151f03 to 0d999db Compare March 14, 2025 06:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] better support for scale-to-zero autoscaling node pools #4935

[k8s] better support for scale-to-zero autoscaling node pools #4935

SeungjinYang commented Mar 12, 2025

SeungjinYang left a comment

SeungjinYang Mar 12, 2025

cg505 Mar 12, 2025

SeungjinYang Mar 12, 2025

cg505 Mar 12, 2025

cg505 Mar 12, 2025

SeungjinYang Mar 12, 2025

cg505 Mar 13, 2025

SeungjinYang Mar 13, 2025 •

edited

Loading

		# disk_size = node_pool['config']['diskSizeGb']
		# print(f"vcpus: {vcpus}, mem: {mem}, diskSizeGb: {disk_size}, maxNodeCount: {max_node_count}")

		fits, reason = kubernetes_utils.check_instance_fits(
		context, instance_type)

[k8s] better support for scale-to-zero autoscaling node pools #4935

Are you sure you want to change the base?

[k8s] better support for scale-to-zero autoscaling node pools #4935

Conversation

SeungjinYang commented Mar 12, 2025

SeungjinYang left a comment

Choose a reason for hiding this comment

SeungjinYang Mar 12, 2025

Choose a reason for hiding this comment

cg505 Mar 12, 2025

Choose a reason for hiding this comment

SeungjinYang Mar 12, 2025

Choose a reason for hiding this comment

cg505 Mar 12, 2025

Choose a reason for hiding this comment

cg505 Mar 12, 2025

Choose a reason for hiding this comment

SeungjinYang Mar 12, 2025

Choose a reason for hiding this comment

cg505 Mar 13, 2025

Choose a reason for hiding this comment

SeungjinYang Mar 13, 2025 • edited Loading

Choose a reason for hiding this comment

SeungjinYang Mar 13, 2025 •

edited

Loading