Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API server] cleanup executor processes on shutdown #4912

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

aylei
Copy link
Collaborator

@aylei aylei commented Mar 7, 2025

close #4894

SIGTERM handling is relevant not only because sky cancel, but also because Kubernetes sends SIGTERM to restart failed server when liveness probe fails. Besides, some frameworks like openkruise on k8s also send SIGTERM to restart the Pod in-place when updating the container image.

This also closes #4856 via properly interrupting executors.

Future works:

  1. unify the cleanup in sky api stop to server-side, then sky api stop only sends SIGTERM for consistency
  2. handle SIGKILL, typically sent by OOM killer. This requires setting up some local state I think, either a monitor process or a filelock that hints the restarted server there is a uncleaned state and some cleanup work should be taken

@cg505 @Michaelvll PTAL if this makes sense

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manually tested SIGTERM a server with and without --foreground
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@aylei aylei requested a review from Michaelvll March 7, 2025 06:39
@aylei aylei marked this pull request as ready for review March 7, 2025 06:39
@aylei aylei requested a review from cg505 March 7, 2025 06:39
@aylei aylei changed the title [API server] cleanup executor on shutdown [API server] cleanup executor processes on shutdown Mar 7, 2025
aylei added 2 commits March 7, 2025 14:50
while True:
process_request(executor)
except KeyboardInterrupt:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that ideally we shouldn't use KeyboardInterrupt for the worker SIGTERM. It makes sense for request execution since all the core code is designed to work with KeyboardInterrupt, but here we can fully control the exception type we throw and catch. We could use a custom exception.
Super low priority, probably better to just merge this. Could add a TODO comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! My original intent is to handle Ctrl-C (in --foreground mode) and SIGTERM from parent process in a uniform way, but agree it is better to use a distinct exception for SIGTERM

Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change, I think this is good. @Michaelvll should review as well.

  1. handle SIGKILL, typically sent by OOM killer. This requires setting up some local state I think, either a monitor process or a filelock that hints the restarted server there is a uncleaned state and some cleanup work should be taken

Another option could be that the worker itself periodically checks if the api server PID is still alive, and exits if not. This could avoid needing a separate monitor process. Just an idea, not sure what's the best option.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aylei! LGTM.

@@ -0,0 +1,139 @@
"""Unit tests for subprocess_utils.py."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be under tests/unit_tests/sky/utils/?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch

@Michaelvll
Copy link
Collaborator

/smoke-test -k test_multi_echo
/smoke-test -k test_cancel_pytorch
/smoke-test -k test_cancel_aws --aws
/smoke-test -k test_job_queue

aylei and others added 2 commits March 13, 2025 08:52
@aylei
Copy link
Collaborator Author

aylei commented Mar 13, 2025

/smoke-test -k test_multi_echo
/smoke-test -k test_cancel_pytorch
/smoke-test -k test_cancel_aws --aws
/smoke-test -k test_job_queue

@aylei
Copy link
Collaborator Author

aylei commented Mar 13, 2025

/smoke-test -k test_job_queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants