Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] better support for scale-to-zero autoscaling node pools #4935

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

SeungjinYang
Copy link
Collaborator

Currently, if a node autoscaler is configured in a k8s cluster, the only thing skypilot knows about the autoscaler is the configuration provided by the user. Specifically, skypilot has no idea if there is a node pool with the node type that may be able to handle a job that has simply been autoscaled to zero. Currently, skypilot gets around this by simply submitting a pod to each context with autoscaler enabled and seeing if the pod is scheduled before timeout.

While this approach is functional, it is inefficient because:

  • A context (=cluster) may have an autoscaling node pool, but the node pool may not provide the VM needed to satisfy the request. For example, there may be an autoscaler on a node pool with A100 GPU VMs - skypilot doesn’t know this, only that there is an autoscaler group, and will try to launch H100 resources on it.
  • The autoscaling node pool may have the correct accelerator but different number of accelerators, CPU / memory constraints. For example, a node pool that spins up VMs with 1 A100 cannot handle launch requests with A100:8, but again skypilot doesn’t know that.
  • If there are multiple allowed contexts, and only some of them have autoscalers on them, there is no way for skypilot to know that. So skypilot may try to schedule a pod on a context w/o an autoscaler that cannot schedule the said pod.
    ^ note on above: the k8s autoscaler configuration is global, not per-context. A per-context autoscaler config could also solve this specific bullet point.

This PR attempts to solve these challenges for GKE autoscaler specifically. This is done by querying each context for its node pools, detecting if any node pool has autoscaling configured, and checking if any node can be spun up that can satisfy the job request.

Assumptions in code:

  • context name follows convention of gke_PROJECT-ID_ZONE_CLUSTER-NAME
  • customer has GCP auth set up for skypilot to query for GKE cluster details

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or conda deactivate; bash -i tests/backward_compatibility_tests.sh (local)

Copy link
Collaborator Author

@SeungjinYang SeungjinYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving review comments to new locations after the last commit

container_service = gcp.build('container',
'v1',
credentials=credentials)
cluster = container_service.projects().locations().clusters() \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO return TRUE if the API call fails due to credential issues or otherwise

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to just wrap the whole thing in try/catch and return True if there is any error.

Comment on lines 653 to 654
# disk_size = node_pool['config']['diskSizeGb']
# print(f"vcpus: {vcpus}, mem: {mem}, diskSizeGb: {disk_size}, maxNodeCount: {max_node_count}")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes_utils.check_instance_fits (the function I was benchmarking this one off of) doesn't check disk size. I can get this information here but left it commented out for now - is the lack of disk size check in kubernetes_utils.check_instance_fits intentional?

Comment on lines 604 to 615
# pylint: disable=import-outside-toplevel
import google.auth
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should avoid this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also avoid crashing if the user doesn't have the relevant python library installed. If you only pip install skypilot[kubernetes] you probably won't have it.

@SeungjinYang SeungjinYang force-pushed the k8s-gke-autoscaler branch 3 times, most recently from 40839da to c702207 Compare March 12, 2025 21:56
@@ -21,6 +21,7 @@
from sky import models
from sky import sky_logging
from sky import skypilot_config
from sky.adaptors import gcp
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flag for self: potentially problematic if user did not install skypilot[gcp]

@SeungjinYang SeungjinYang force-pushed the k8s-gke-autoscaler branch 2 times, most recently from 9505cf6 to 41001c4 Compare March 13, 2025 00:18
Comment on lines +240 to +250
fits, reason = kubernetes_utils.check_instance_fits(
context, instance_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the whole thing takes some time, we may want to avoid going into the loop when we have an unsupported autoscaler, since we will eventually end up adding every region anyway. We can short-circuit in that case.

If we have an unsupported autoscaler, a nice improvement would be to reorder the contexts to put known good contexts before unknown. That is, if a cluster already has a node that our job will fit on, we should try it before trying a zero-scaled cluster that might support our job.
That said, idk if reordering the regions returned by this function will actually affect the order we try them in. My guess is no. Might need some additional plumbing in that case, could be a follow-up.

Copy link
Collaborator Author

@SeungjinYang SeungjinYang Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter paragraph is a good point - even for supported autoscalers, if there is a context with a suitable node already there vs. a context that can autoscale the node, the former should be preferred.
But I agree that can be addressed as a follow-up. For now I implemented the short-circuit to return all regions in case of an unsupported autoscaler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants