Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: More efficient sky logs #4792

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Proposal: More efficient sky logs #4792

wants to merge 1 commit into from

Conversation

aylei
Copy link
Collaborator

@aylei aylei commented Feb 21, 2025

for #4767

This PR proposes adding async call support for CommandRunner to prevent tail_log() from blocking the background executor processes (executors hereafter). To demonstrate this change, a very draft code patch is also submitted in this PR.

Background

As discussed in #4731, the executors are memory-consuming so the number of executors is limited by system resources. However, the sky logs command is not time bounded, which can block all executors in worst case.

Mitigation

sky logs request can be divided into 2 phases:

  • phase 1: pre-checks, cloud auth, look-ups and finally generate an ssh or kubectl command to tail remote logs
  • phase 2: run the command util done or interrupted

While an executor process can stably consume up to ~250MB, the ssh or kubectl command only consumes about ~3MB and ~40MB. So, phase 2 is orders of magnitude lighter on resources compared to phase 1. This leads to the primary modification:

  • The tail log call stack returns the command process to the uppermost executor instead of waiting the process;
  • The executor then starts a thread to monitor the command and proceed to run next requests without blocking;

Though the memory footprint of command process is relatively low and I don't expect users suffering hang requests due to too many parallel sky logs, a safe belt of max parallel tail requests is still necessary to prevent the server from being overwhelmed by infinite tail requests. I expect we are able to set a default value that is unreachable in most normal scenarios, for example 1024.

There are still many details that have not yet been fully explored, e.g. the server cancels a request by sending SIGTERM to the executor process. After this change, the signal should be sent to the executor process in phase 1 and the command process in phase 2, which incurs careful handling of race conditions. But most of the uncertainties are engineering issues I think, which does affect the overall design.

This PR is also a workable MVP: I launched 100 concurrent sky logs processes on an API server that is limited to 2c4g (in this case, only 2 long executors and 4 short executors will be launched) and it seems just fine except an error caused by too many SSH connections I think:

7:ControlSocket /tmp/skypilot_ssh_57339c81/ac3b06fa0c/81d65dd183d50155f879e9c989af2b560c072c50 already exists, disabling multiplexing

This is also relevant to efficient sky logs, but I would like to move this problem to following up PRs to keep this one focused.

Alternatives

Provision a dedicated executor for sky logs request, so that sky logs won't block the long running workers in the fixed pool:

  • pros:
    • easy and straightforward
  • cons
    • concurrent sky logs still block each other
    • the more executor groups we have, the more likely of wasting resources

This actually works. Consequently, a key consideration arises: is this a case of premature optimization? I think the judgment is very subjective and to me the added complexity looks payoff. I want to hear everyone's opinions, thanks!

Appendix

The memory consumption of 100 log tail processes:

企业微信截图_7ddac9e4-252e-4011-b05d-69c1b2ebb648

@@ -134,6 +135,19 @@ def process_subprocess_stream(proc, args: _ProcessingArgs) -> Tuple[str, str]:
stderr = ''
return stdout, stderr

# TODO(aylei): Prototype class to support async call, include process_stream
# thread to be more general.
class ProcFuture:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if this overcomplicate it a bit. Would it be possible we just start a thread for logging in the logs FastAPI function without calling into the executor, i.e., making the logs API an interactive call?

Copy link
Collaborator Author

@aylei aylei Feb 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Totally missed this point, will do some tests!

Copy link
Collaborator Author

@aylei aylei Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interruption would be the major problem if we run logs in a thread. core.tail_logs relies on KeyboardInterrupt to cancel the execution from deep call stacks. I think switching to sub-thread requires non-trivial refactor work to make each call stack of tail_logs cooperative. For example, run_with_log has to do something like:

                stdout, stderr = process_subprocess_stream(proc, args)
-           proc.wait()
+          done = False
+          while not self_thread.stop:
+            try:
+              proc.wait(timeout=0.1)
+              done = True
+              break
+            except subprocess.TimeoutExpired:
+              continue
+          if not done:
+            proc.kill()
            if require_outputs:
                return proc.returncode, stdout, stderr
            return proc.returncode

I feel like asyncio is a more developed solution if we toward the direction to make the call cooperative, but still requires an overhauling I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants