[API server] cleanup executor processes on shutdown #4912

aylei · 2025-03-07T06:39:02Z

SIGTERM handling is relevant not only because sky cancel, but also because Kubernetes sends SIGTERM to restart failed server when liveness probe fails. Besides, some frameworks like openkruise on k8s also send SIGTERM to restart the Pod in-place when updating the container image.

This also closes #4856 via properly interrupting executors.

Future works:

unify the cleanup in sky api stop to server-side, then sky api stop only sends SIGTERM for consistency
handle SIGKILL, typically sent by OOM killer. This requires setting up some local state I think, either a monitor process or a filelock that hints the restarted server there is a uncleaned state and some cleanup work should be taken

@cg505 @Michaelvll PTAL if this makes sense

Tested (run the relevant ones):

Code formatting: bash format.sh
Manually tested SIGTERM a server with and without --foreground
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Signed-off-by: Aylei <[email protected]>

cg505 · 2025-03-10T18:19:15Z

sky/server/requests/executor.py

        while True:
            process_request(executor)
+    except KeyboardInterrupt:


Seems that ideally we shouldn't use KeyboardInterrupt for the worker SIGTERM. It makes sense for request execution since all the core code is designed to work with KeyboardInterrupt, but here we can fully control the exception type we throw and catch. We could use a custom exception.
Super low priority, probably better to just merge this. Could add a TODO comment.

Sure! My original intent is to handle Ctrl-C (in --foreground mode) and SIGTERM from parent process in a uniform way, but agree it is better to use a distinct exception for SIGTERM

cg505

Thanks for the change, I think this is good. @Michaelvll should review as well.

handle SIGKILL, typically sent by OOM killer. This requires setting up some local state I think, either a monitor process or a filelock that hints the restarted server there is a uncleaned state and some cleanup work should be taken

Another option could be that the worker itself periodically checks if the api server PID is still alive, and exits if not. This could avoid needing a separate monitor process. Just an idea, not sure what's the best option.

Michaelvll

Thanks @aylei! LGTM.

sky/utils/subprocess_utils.py

Michaelvll · 2025-03-12T18:43:27Z

tests/unit_tests/sky/test_subprocess_utils.py

@@ -0,0 +1,139 @@
+"""Unit tests for subprocess_utils.py."""


Should this be under tests/unit_tests/sky/utils/?

Ah, good catch

Michaelvll · 2025-03-13T00:07:53Z

/smoke-test -k test_multi_echo
/smoke-test -k test_cancel_pytorch
/smoke-test -k test_cancel_aws --aws
/smoke-test -k test_job_queue

Co-authored-by: Zhanghao Wu <[email protected]>

Signed-off-by: Aylei <[email protected]>

aylei · 2025-03-13T01:08:55Z

/smoke-test -k test_multi_echo
/smoke-test -k test_cancel_pytorch
/smoke-test -k test_cancel_aws --aws
/smoke-test -k test_job_queue

aylei · 2025-03-13T02:25:14Z

/smoke-test -k test_job_queue

[API server] cleanup executor on shutdown

559ba5f

Signed-off-by: Aylei <[email protected]>

aylei requested a review from Michaelvll March 7, 2025 06:39

aylei marked this pull request as ready for review March 7, 2025 06:39

aylei requested a review from cg505 March 7, 2025 06:39

aylei changed the title ~~[API server] cleanup executor on shutdown~~ [API server] cleanup executor processes on shutdown Mar 7, 2025

aylei added 2 commits March 7, 2025 14:50

refine

0381b40

Signed-off-by: Aylei <[email protected]>

just raise impossible exceptions

1799a28

Signed-off-by: Aylei <[email protected]>

cg505 reviewed Mar 10, 2025

View reviewed changes

cg505 approved these changes Mar 10, 2025

View reviewed changes

Michaelvll approved these changes Mar 13, 2025

View reviewed changes

aylei and others added 2 commits March 13, 2025 08:52

Update sky/utils/subprocess_utils.py

6e5855e

Co-authored-by: Zhanghao Wu <[email protected]>

Address review comments

2b14efc

Signed-off-by: Aylei <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API server] cleanup executor processes on shutdown #4912

[API server] cleanup executor processes on shutdown #4912

aylei commented Mar 7, 2025 •

edited

Loading

cg505 Mar 10, 2025

aylei Mar 11, 2025

cg505 left a comment

Michaelvll left a comment

Michaelvll Mar 12, 2025

aylei Mar 13, 2025

Michaelvll commented Mar 13, 2025

aylei commented Mar 13, 2025

aylei commented Mar 13, 2025

[API server] cleanup executor processes on shutdown #4912

Are you sure you want to change the base?

[API server] cleanup executor processes on shutdown #4912

Conversation

aylei commented Mar 7, 2025 • edited Loading

cg505 Mar 10, 2025

Choose a reason for hiding this comment

aylei Mar 11, 2025

Choose a reason for hiding this comment

cg505 left a comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Mar 12, 2025

Choose a reason for hiding this comment

aylei Mar 13, 2025

Choose a reason for hiding this comment

Michaelvll commented Mar 13, 2025

aylei commented Mar 13, 2025

aylei commented Mar 13, 2025

aylei commented Mar 7, 2025 •

edited

Loading