-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API server] cleanup executor processes on shutdown #4912
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
while True: | ||
process_request(executor) | ||
except KeyboardInterrupt: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that ideally we shouldn't use KeyboardInterrupt for the worker SIGTERM. It makes sense for request execution since all the core code is designed to work with KeyboardInterrupt, but here we can fully control the exception type we throw and catch. We could use a custom exception.
Super low priority, probably better to just merge this. Could add a TODO comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! My original intent is to handle Ctrl-C (in --foreground
mode) and SIGTERM from parent process in a uniform way, but agree it is better to use a distinct exception for SIGTERM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change, I think this is good. @Michaelvll should review as well.
- handle SIGKILL, typically sent by OOM killer. This requires setting up some local state I think, either a monitor process or a filelock that hints the restarted server there is a uncleaned state and some cleanup work should be taken
Another option could be that the worker itself periodically checks if the api server PID is still alive, and exits if not. This could avoid needing a separate monitor process. Just an idea, not sure what's the best option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aylei! LGTM.
@@ -0,0 +1,139 @@ | |||
"""Unit tests for subprocess_utils.py.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be under tests/unit_tests/sky/utils/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch
/smoke-test -k test_multi_echo |
Co-authored-by: Zhanghao Wu <[email protected]>
Signed-off-by: Aylei <[email protected]>
/smoke-test -k test_multi_echo |
/smoke-test -k test_job_queue |
close #4894
SIGTERM handling is relevant not only because
sky cancel
, but also because Kubernetes sends SIGTERM to restart failed server when liveness probe fails. Besides, some frameworks like openkruise on k8s also send SIGTERM to restart the Pod in-place when updating the container image.This also closes #4856 via properly interrupting executors.
Future works:
sky api stop
to server-side, thensky api stop
only sends SIGTERM for consistency@cg505 @Michaelvll PTAL if this makes sense
Tested (run the relevant ones):
bash format.sh
--foreground
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh