Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][1/N] Make executor to be a long-running Python thread #50644

Closed

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Feb 16, 2025

Why are these changes needed?

Currently, tasks are sometimes executed on the main thread and sometimes on other threads which is created by C++. However, C++ threads are only considered Python threads when they execute Python functions. Hence, when a task finishes, the thread-local state will be garbage-collected. See #46336 (comment) for more details.

This PR uses the CPython API to treat C++ threads as Python threads, even if they do not execute Python functions.
https://docs.python.org/3/c-api/init.html#non-python-created-threads

This PR handles synchronous tasks. After this PR is merged, the follow-up tasks are:

  • Always create a default executor.
  • Move the constructor to the default executor.
  • Support asynchronous tasks.

Related issue number

Part of #46336

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline:

@kevin85421 kevin85421 force-pushed the 20250209-devbox1-tmux-12-ray6 branch from 2b66d24 to 4218e5e Compare February 23, 2025 23:59
@kevin85421
Copy link
Member Author

TODO: undefined symbol for Java

Screenshot 2025-02-23 at 10 07 27 PM

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Feb 27, 2025
cdef function[void()] callback;
with gil:
gstate = PyGILState_Ensure()
callback = GetThreadReleaser(gstate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetThreadReleaser should be defined as another cdef function in this file instead of in core worker cpp code (that is another abstraction leak similar to the original PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied in #50644 (comment). I agree that it is the best case if we can avoid the C++ util function.

Would you mind sharing more details why it's an abstraction leak?

  1. GetThreadReleaser is exclusively used by Cython and is not included in the shared code path for any other language frontends.
  2. It doesn't add any new dependencies.
  3. There is a directory core_worker/lib/java.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core_worker/lib/java is just autogenerated bindings for the core worker API (it could/should be moved elsewhere)

Comment on lines 24 to 26
inline std::function<void()> GetThreadReleaser(PyGILState_STATE gstate) {
return [gstate]() { PyGILState_Release(gstate); };
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to exist -- you can define it in cython and pass it down

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the best of my knowledge, this may not be possible in Cython. I haven't seen any documentation that explicitly states it's impossible. I tried looking for examples on GitHub, but I didn't find any way to capture gstate and return a callable.

I tried multiple ways to capture gstate and return a callable, but I failed. In addition, this method is referenced in the unit tests for Cython's <functional>. You can take a look at _get_function_ptr_from_name in cython/cython@954f0ec.

I'm not familiar with Cython, but this is my current guess based on my investigation. It would be helpful if we could avoid defining this C++ function. Would you mind sharing more details if you have related experiences?

Copy link
Contributor

@edoakes edoakes Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to use std::bind and std::placeholder1 to bind the gstate as an arg and then pass the lambda into the c++ layer

or alternatively (maybe easier?) you can wrap the gstate as a unique_ptr<void*>, pass it into c++, then cast it back to gstate in the callback. we do something similar currently for the callback passed to c++ for async get:

def set_get_async_callback(self, ObjectRef object_ref, user_callback: Callable):

@israbbani has a follow-up to avoid the manual incref/decref on that codepath and use a unique_ptr instead

Copy link
Member Author

@kevin85421 kevin85421 Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use std::bind and it works! Initially, I tried using a lambda function to capture gstate. I didn't know that std::bind existed. In addition, I haven't seen any CPython examples where users need to control the reference count of PyGILState_STATE.

@kevin85421 kevin85421 force-pushed the 20250209-devbox1-tmux-12-ray6 branch from 80f8f0e to 1b63aa9 Compare February 28, 2025 05:30
@kevin85421 kevin85421 marked this pull request as ready for review February 28, 2025 07:57
@kevin85421 kevin85421 requested a review from edoakes February 28, 2025 08:12
Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for figuring out the tricky cython bits! All style nits now.

Please add a test for concurrency_group_manager that encodes the expected semantics of the initialize_thread_callback (called before thread executes anything, the returned callback is called on shutdown).

Are we certain that the gstate object does not need to be reference counted? Just want to make sure there is no concern of crashing during the shutdown sequence due to a memory corruption issue.

@kevin85421
Copy link
Member Author

Please add a test for concurrency_group_manager that encodes the expected semantics of the initialize_thread_callback (called before thread executes anything, the returned callback is called on shutdown).

Test added. ee93309

@kevin85421
Copy link
Member Author

Are we certain that the gstate object does not need to be reference counted? Just want to make sure there is no concern of crashing during the shutdown sequence due to a memory corruption issue.

I will conduct some experiments to verify it.

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@edoakes
Copy link
Contributor

edoakes commented Feb 28, 2025

Let me know when ready for merge. DCO is failing btw.

Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
@kevin85421
Copy link
Member Author

I followed the instructions on the DCO page to fix the DCO issue, but somehow it messed up the commits. I will open a new PR instead.

@kevin85421 kevin85421 closed this Feb 28, 2025
@kevin85421
Copy link
Member Author

Open a new PR: #51005

@kevin85421
Copy link
Member Author

Open a new PR: #51016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.