Proposal: More efficient `sky logs` #4792

aylei · 2025-02-21T16:12:00Z

This PR proposes adding async call support for CommandRunner to prevent tail_log() from blocking the background executor processes (executors hereafter). To demonstrate this change, a very draft code patch is also submitted in this PR.

Background

As discussed in #4731, the executors are memory-consuming so the number of executors is limited by system resources. However, the sky logs command is not time bounded, which can block all executors in worst case.

Mitigation

sky logs request can be divided into 2 phases:

phase 1: pre-checks, cloud auth, look-ups and finally generate an ssh or kubectl command to tail remote logs
phase 2: run the command util done or interrupted

While an executor process can stably consume up to ~250MB, the ssh or kubectl command only consumes about ~3MB and ~40MB. So, phase 2 is orders of magnitude lighter on resources compared to phase 1. This leads to the primary modification:

The tail log call stack returns the command process to the uppermost executor instead of waiting the process;
The executor then starts a thread to monitor the command and proceed to run next requests without blocking;

Though the memory footprint of command process is relatively low and I don't expect users suffering hang requests due to too many parallel sky logs, a safe belt of max parallel tail requests is still necessary to prevent the server from being overwhelmed by infinite tail requests. I expect we are able to set a default value that is unreachable in most normal scenarios, for example 1024.

There are still many details that have not yet been fully explored, e.g. the server cancels a request by sending SIGTERM to the executor process. After this change, the signal should be sent to the executor process in phase 1 and the command process in phase 2, which incurs careful handling of race conditions. But most of the uncertainties are engineering issues I think, which does affect the overall design.

This PR is also a workable MVP: I launched 100 concurrent sky logs processes on an API server that is limited to 2c4g (in this case, only 2 long executors and 4 short executors will be launched) and it seems just fine except an error caused by too many SSH connections I think:

7:ControlSocket /tmp/skypilot_ssh_57339c81/ac3b06fa0c/81d65dd183d50155f879e9c989af2b560c072c50 already exists, disabling multiplexing

This is also relevant to efficient sky logs, but I would like to move this problem to following up PRs to keep this one focused.

Alternatives

Provision a dedicated executor for sky logs request, so that sky logs won't block the long running workers in the fixed pool:

pros:
- easy and straightforward
cons
- concurrent sky logs still block each other
- the more executor groups we have, the more likely of wasting resources

This actually works. Consequently, a key consideration arises: is this a case of premature optimization? I think the judgment is very subjective and to me the added complexity looks payoff. I want to hear everyone's opinions, thanks!

Appendix

The memory consumption of 100 log tail processes:

企业微信截图_7ddac9e4-252e-4011-b05d-69c1b2ebb648

Signed-off-by: Aylei <[email protected]>

Michaelvll · 2025-02-21T22:48:25Z

sky/skylet/log_lib.py

@@ -134,6 +135,19 @@ def process_subprocess_stream(proc, args: _ProcessingArgs) -> Tuple[str, str]:
        stderr = ''
    return stdout, stderr

+# TODO(aylei): Prototype class to support async call, include process_stream
+# thread to be more general.
+class ProcFuture:


I am wondering if this overcomplicate it a bit. Would it be possible we just start a thread for logging in the logs FastAPI function without calling into the executor, i.e., making the logs API an interactive call?

Sounds good! Totally missed this point, will do some tests!

Interruption would be the major problem if we run logs in a thread. core.tail_logs relies on KeyboardInterrupt to cancel the execution from deep call stacks. I think switching to sub-thread requires non-trivial refactor work to make each call stack of tail_logs cooperative. For example, run_with_log has to do something like:

stdout, stderr = process_subprocess_stream(proc, args) - proc.wait() + done = False + while not self_thread.stop: + try: + proc.wait(timeout=0.1) + done = True + break + except subprocess.TimeoutExpired: + continue + if not done: + proc.kill() if require_outputs: return proc.returncode, stdout, stderr return proc.returncode

I feel like asyncio is a more developed solution if we toward the direction to make the call cooperative, but still requires an overhauling I think

proposal: more efficient logs

87c161c

Signed-off-by: Aylei <[email protected]>

Michaelvll reviewed Feb 21, 2025

View reviewed changes

aylei mentioned this pull request Mar 4, 2025

[Core] sky exec now waits cluster to be started #4867

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: More efficient `sky logs` #4792

Proposal: More efficient `sky logs` #4792

aylei commented Feb 21, 2025 •

edited

Loading

Michaelvll Feb 21, 2025

aylei Feb 22, 2025 •

edited

Loading

aylei Feb 26, 2025 •

edited

Loading

Proposal: More efficient sky logs #4792

Are you sure you want to change the base?

Proposal: More efficient sky logs #4792

Conversation

aylei commented Feb 21, 2025 • edited Loading

Background

Mitigation

Alternatives

Appendix

Michaelvll Feb 21, 2025

Choose a reason for hiding this comment

aylei Feb 22, 2025 • edited Loading

Choose a reason for hiding this comment

aylei Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Proposal: More efficient `sky logs` #4792

Proposal: More efficient `sky logs` #4792

aylei commented Feb 21, 2025 •

edited

Loading

aylei Feb 22, 2025 •

edited

Loading

aylei Feb 26, 2025 •

edited

Loading