-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: More efficient sky logs
#4792
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Aylei <[email protected]>
@@ -134,6 +135,19 @@ def process_subprocess_stream(proc, args: _ProcessingArgs) -> Tuple[str, str]: | |||
stderr = '' | |||
return stdout, stderr | |||
|
|||
# TODO(aylei): Prototype class to support async call, include process_stream | |||
# thread to be more general. | |||
class ProcFuture: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if this overcomplicate it a bit. Would it be possible we just start a thread for logging in the logs
FastAPI function without calling into the executor, i.e., making the logs
API an interactive call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Totally missed this point, will do some tests!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interruption would be the major problem if we run logs
in a thread. core.tail_logs
relies on KeyboardInterrupt
to cancel the execution from deep call stacks. I think switching to sub-thread requires non-trivial refactor work to make each call stack of tail_logs
cooperative. For example, run_with_log
has to do something like:
stdout, stderr = process_subprocess_stream(proc, args)
- proc.wait()
+ done = False
+ while not self_thread.stop:
+ try:
+ proc.wait(timeout=0.1)
+ done = True
+ break
+ except subprocess.TimeoutExpired:
+ continue
+ if not done:
+ proc.kill()
if require_outputs:
return proc.returncode, stdout, stderr
return proc.returncode
I feel like asyncio is a more developed solution if we toward the direction to make the call cooperative, but still requires an overhauling I think
for #4767
This PR proposes adding async call support for
CommandRunner
to preventtail_log()
from blocking the background executor processes (executors
hereafter). To demonstrate this change, a very draft code patch is also submitted in this PR.Background
As discussed in #4731, the
executors
are memory-consuming so the number ofexecutors
is limited by system resources. However, thesky logs
command is not time bounded, which can block all executors in worst case.Mitigation
sky logs
request can be divided into 2 phases:ssh
orkubectl
command to tail remote logsWhile an
executor
process can stably consume up to ~250MB, thessh
orkubectl
command only consumes about ~3MB and ~40MB. So, phase 2 is orders of magnitude lighter on resources compared to phase 1. This leads to the primary modification:Though the memory footprint of command process is relatively low and I don't expect users suffering hang requests due to too many parallel
sky logs
, a safe belt of max parallel tail requests is still necessary to prevent the server from being overwhelmed by infinite tail requests. I expect we are able to set a default value that is unreachable in most normal scenarios, for example 1024.There are still many details that have not yet been fully explored, e.g. the server cancels a request by sending SIGTERM to the executor process. After this change, the signal should be sent to the executor process in phase 1 and the command process in phase 2, which incurs careful handling of race conditions. But most of the uncertainties are engineering issues I think, which does affect the overall design.
This PR is also a workable MVP: I launched 100 concurrent
sky logs
processes on an API server that is limited to 2c4g (in this case, only 2 long executors and 4 short executors will be launched) and it seems just fine except an error caused by too many SSH connections I think:This is also relevant to efficient
sky logs
, but I would like to move this problem to following up PRs to keep this one focused.Alternatives
Provision a dedicated
executor
forsky logs
request, so thatsky logs
won't block the long running workers in the fixed pool:sky logs
still block each otherThis actually works. Consequently, a key consideration arises: is this a case of premature optimization? I think the judgment is very subjective and to me the added complexity looks payoff. I want to hear everyone's opinions, thanks!
Appendix
The memory consumption of 100 log tail processes: