Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove tracer provider guard. #444

Merged
merged 10 commits into from
Feb 27, 2021
Merged

Conversation

TommyCpp
Copy link
Contributor

@TommyCpp TommyCpp commented Jan 28, 2021

Resolve #427
Resolve #364

Problem statement

The problem we have is how to allow users to send remaining spans before exits. The current approach works for multiple thread tokio runtime, but not really working for the current thread tokio runtime.

Also currently we allow the users to use multiple tracer providers throughout the application, which I believe is not a common situation. Most people should have one global tracer provider.

Proposed solution

We currently have the set_tracer_provider function to set the tracer provider. It currently will return a guard, the drop of the guard will trigger the shutdown of the tracer provider.

the function instead of returning the guard, we return the replaced tracer provider. If users need to use multiple tracer providers. They can manage those tracer providers on their own.

In order to shut down tracer provider properly. We could have a shut_down_provider function that will block in place until the current tracer provider shuts down. The problem then becomes that how do we spawn the background task so that it will not block forever when shuts down.

Different situations

The objective is if users shut_down_provider before exit, the remaining spans in BatchSpanProcessor would sending all remaining spans. If not, then all remaining spans in it will be dropped.

Tried a few different set up to see if we can do it.

Runtime How to spawn worker task in BatchProcessor Call shut_down_provider before exit? Result
Single thread tokio tokio::spawn no exit successfully
Multiple thread tokio tokio::spawn no exit successfully
Single thread tokio tokio::spawn yes hang forever
Multiple thread tokio tokio::spawn yes send spans and exit
Single thread tokio spawn_blocking no hang forever
Multiple thread tokio spawn_blocking no hang forever
Single thread tokio spawn_blocking yes send spans and exit
Multiple thread tokio spawn_blocking yes send spans and exit

* spawn_blocking = |fut| tokio::spawn_blocking(|| futures::executor::block_on(fut))

So my theory on why opentelemetry hang forever when using the single thread tokio:

when shutdown BatchSpanProcessor. We need to wait for the worker task to finish. But by blocking here, we prevent the worker task from being polled. If we don't block and wait here. When the shutdown finishes, the runtime will go ahead and drop the worker task, so we can't guarantee the spans in BatchSpanProcessor can be sent.

fn shutdown(&mut self) -> TraceResult<()> {
let mut sender = self.message_sender.lock().map_err(|_| TraceError::from("When shutting down the BatchSpanProcessor, the message sender's lock has been poisoned"))?;
let (res_sender, res_receiver) = oneshot::channel::<Vec<ExportResult>>();
sender.try_send(BatchMessage::Shutdown(res_sender))?;
for result in futures::executor::block_on(res_receiver)? {
result?;
}
Ok(())
}

I have been proking around and tried random stuff on it but none of them seem to work. So would love some advice here as I believe some of you might have more experience on how tokio works.

Final solution

  • set_tracer_provider function should return the replaced tracer provider.
  • shut_down_tracer_provider function will be used to shut down the tracer provider.
  • tokio::spawn will be used to spawn background tasks in BatchProcessor if the runtime has multiple threads.
  • If tokio runtime only have one thread. We will create a new thread and a new runtime to spawn background tasks.

@codecov
Copy link

codecov bot commented Jan 28, 2021

Codecov Report

Merging #444 (2695bdf) into main (57f76ac) will decrease coverage by 0.80%.
The diff coverage is 2.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #444      +/-   ##
==========================================
- Coverage   47.73%   46.93%   -0.81%     
==========================================
  Files          95       95              
  Lines        8798     8934     +136     
==========================================
- Hits         4200     4193       -7     
- Misses       4598     4741     +143     
Impacted Files Coverage Δ
opentelemetry-datadog/src/exporter/mod.rs 14.81% <0.00%> (+0.35%) ⬆️
opentelemetry-datadog/src/lib.rs 85.83% <ø> (ø)
opentelemetry-jaeger/src/exporter/mod.rs 40.56% <0.00%> (+0.25%) ⬆️
opentelemetry-jaeger/src/lib.rs 89.36% <ø> (ø)
opentelemetry-zipkin/src/exporter/mod.rs 0.00% <0.00%> (ø)
opentelemetry-zipkin/src/lib.rs 100.00% <ø> (ø)
opentelemetry/src/global/trace.rs 19.69% <0.00%> (-35.19%) ⬇️
opentelemetry/src/lib.rs 100.00% <ø> (ø)
opentelemetry/src/sdk/export/trace/mod.rs 98.11% <ø> (ø)
opentelemetry/src/sdk/export/trace/stdout.rs 9.09% <ø> (-8.56%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 57f76ac...2695bdf. Read the comment docs.

@TommyCpp
Copy link
Contributor Author

TommyCpp commented Feb 5, 2021

So for single thread runtime. One solution that would do the trick is spawn a new thread and a new runtime on it to run the batch processor. That solve the deadlock issue. For example

      let spawn = |box_future: BoxFuture<'static, ()>| {
            thread::spawn(move || {
                let rt = tokio::runtime::Builder::new_current_thread().enable_all().build().unwrap();
                rt.block_on(box_future);
            });
        };

(Need to handle the result and make some changes to batch processor also)

@djc
Copy link
Contributor

djc commented Feb 8, 2021

That seems like a fine solution!

@TommyCpp TommyCpp force-pushed the tommycpp/427 branch 2 times, most recently from 21f1b56 to fa19431 Compare February 10, 2021 02:39
@jtescher
Copy link
Member

May want to consider if any of these potentially deadlock if used in tests which execute in parallel.

@TommyCpp
Copy link
Contributor Author

May want to consider if any of these potentially deadlock if used in tests which execute in parallel.

I don't think it will cause deadlock as we guard the global tracer provider with a lock. But if two tests try to modify the global tracer provider at the same time, the result will likely be non-deterministic. Thus, I run tests in global/trace.rs sequentially.

This add a suite of tests to see in different runtimes does global tracer provider has trouble to shutdown.
@TommyCpp TommyCpp force-pushed the tommycpp/427 branch 2 times, most recently from 8dcfcd2 to d65f1f9 Compare February 23, 2021 02:01
…kio runtime use case.

To improve our performance, it would be good if we can use tokio::spawn instead of tokio::spawn_blocking.

The problem with using tokio::spawn to spawn background task that send spans is when shutting down, we need to block on the shutdown function and sending the spans at the same time. In multiple thread runtime, those tasks can be arranged in different thread. But in single runtime we will have a dead lock.

The proposed solution is to spawn the background task in a separate thread from the tokio runtime.
@TommyCpp TommyCpp force-pushed the tommycpp/427 branch 2 times, most recently from 8232412 to 0049e43 Compare February 23, 2021 04:14
@TommyCpp TommyCpp marked this pull request as ready for review February 24, 2021 01:57
@TommyCpp TommyCpp requested a review from a team February 24, 2021 01:57
Copy link
Member

@frigus02 frigus02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

I used opentelemetry recently for a small CLI and ran into 2 issues:

  • Using the Tokio current thread runtime
  • Waiting for remaining spans to export when also using std::process::exit(1); to specify an exit code

I'm beginning to think that I would prefer an async shutdown function that I could call explicitly. It seems to be the hardest thing to accidentally misuse. I had worked about the above mentioned issues by doing something like the following. When I tried to use this PR I added the shutdown function and it didn't work:

async fn main() -> i32 {
    // install_pipeline ... create spans ...
}
fn main() {
    let exit_code = tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .build()
        .unwrap()
        .block_on(async_main);
    // opentelemetry::global::shutdown_tracer_provider(); <-- works without this line on master. spans don't export when I add this line using this PR
    std::process::exit(exit_code);
}

The error I got was "send failed because receiver is gone" (using the Jaeger exporter). I think this makes sense because I tried to shutdown the pipeline after shutting down the Tokio runtime. If shutdown_tracer_provider was async, I wouldn't have tried this code because I couldn't have awaited the future.

That said, though, this PR already seems like a huge improvement. Both of the following much simpler versions of my main function worked just fine based on this PR.

#[tokio::main]
async fn main() {
    let exit_code = async_main().await;
    pipeline::shutdown_pipeline();
    std::process::exit(exit_code);
}
#[tokio::main(flavor = "current_thread")]
async fn main() {
    let exit_code = async_main().await;
    pipeline::shutdown_pipeline();
    std::process::exit(exit_code);
}

So to me this is definitely a nice improvement. And having the tests to make sure we don't regress is nice as well. 👍

Copy link
Member

@jtescher jtescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work on this @TommyCpp, looks good to me. Users may forget to call shutdown and miss a few last unexported spans, but that is preferable to the current situation where shutdown can happen silently and no spans are exported at all.

@jtescher
Copy link
Member

@djc / @frigus02 any other thoughts on this one? global function name and behavior / new tokio feature flag all look good?

@frigus02
Copy link
Member

The global function name seems fine to me.

I think I would love to unify the naming on the different runtime features. Maybe something alone the lines of rt-tokio, rt-tokio-current-thread and rt-async-std. That would more clearly indicate to me that they are mutually exclusive.

We could possibly even throw compile errors if more than one runtime feature is activated. I saw this pattern in another project. Not sure if this requires any special features (translated to our use case but untested):

#[cfg(not(any(feature = "rt-tokio", feature = "rt-tokio-current-thread", feature = "rt-tokio-async-std")))]
#[cfg(all(feature = rt-tokio", feature = "rt-tokio-current-thread"))]
#[cfg(all(feature = rt-tokio", feature = "rt-async-std"))]
#[cfg(all(feature = rt-tokio-current-thread", feature = "rt-async-std"))]
compile_error!(
    "exactly one of the features ['rt-tokio', 'rt-tokio-current-thread', 'rt-async-std'] must be enabled"
);

But I'm happy to discuss both of those things in separate PRs.

@djc
Copy link
Contributor

djc commented Feb 25, 2021

I think I would love to unify the naming on the different runtime features. Maybe something alone the lines of rt-tokio, rt-tokio-current-thread and rt-async-std. That would more clearly indicate to me that they are mutually exclusive.

Sounds good to me.

@TommyCpp
Copy link
Contributor Author

We could possibly even throw compile errors if more than one runtime feature is activated. I saw this pattern in another project. Not sure if this requires any special features (translated to our use case but untested):

#[cfg(not(any(feature = "rt-tokio", feature = "rt-tokio-current-thread", feature = "rt-tokio-async-std")))]
#[cfg(all(feature = rt-tokio", feature = "rt-tokio-current-thread"))]
#[cfg(all(feature = rt-tokio", feature = "rt-async-std"))]
#[cfg(all(feature = rt-tokio-current-thread", feature = "rt-async-std"))]
compile_error!(
    "exactly one of the features ['rt-tokio', 'rt-tokio-current-thread', 'rt-async-std'] must be enabled"
);

I think the current approach when multiple conflicting features are enabled is to choose one of them. For example, enable tokio-support and async-std at the same time is equal to enable the tokio-support.

I'd vote to change this behavior but it seems we should address this in a different PR.

@frigus02
Copy link
Member

I'd vote to change this behavior but it seems we should address this in a different PR.

Definitely agree, this should be a different PR.

@TommyCpp
Copy link
Contributor Author

This is good to merge I guess?

@jtescher jtescher merged commit c697b58 into open-telemetry:main Feb 27, 2021
@TommyCpp TommyCpp deleted the tommycpp/427 branch March 2, 2021 03:54
TommyCpp added a commit to TommyCpp/opentelemetry-rust that referenced this pull request Mar 4, 2021
… same time.

As per open-telemetry#444 (comment).

We should try to throw compile error when users set two exclusive features instead of default to use one of them.
TommyCpp added a commit to TommyCpp/opentelemetry-rust that referenced this pull request Mar 4, 2021
… same time.

As per open-telemetry#444 (comment).

We should try to throw compile error when users set two exclusive features instead of default to use one of them.
TommyCpp added a commit to TommyCpp/opentelemetry-rust that referenced this pull request Mar 9, 2021
… same time.

As per open-telemetry#444 (comment).

We should try to throw compile error when users set two exclusive features instead of default to use one of them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants