Remove tracer provider guard. #444

TommyCpp · 2021-01-28T01:53:57Z

Resolve #427
Resolve #364

Problem statement

The problem we have is how to allow users to send remaining spans before exits. The current approach works for multiple thread tokio runtime, but not really working for the current thread tokio runtime.

Also currently we allow the users to use multiple tracer providers throughout the application, which I believe is not a common situation. Most people should have one global tracer provider.

Proposed solution

We currently have the set_tracer_provider function to set the tracer provider. It currently will return a guard, the drop of the guard will trigger the shutdown of the tracer provider.

the function instead of returning the guard, we return the replaced tracer provider. If users need to use multiple tracer providers. They can manage those tracer providers on their own.

In order to shut down tracer provider properly. We could have a shut_down_provider function that will block in place until the current tracer provider shuts down. The problem then becomes that how do we spawn the background task so that it will not block forever when shuts down.

Different situations

The objective is if users shut_down_provider before exit, the remaining spans in BatchSpanProcessor would sending all remaining spans. If not, then all remaining spans in it will be dropped.

Tried a few different set up to see if we can do it.

Runtime	How to spawn worker task in `BatchProcessor`	Call `shut_down_provider` before exit?	Result
Single thread tokio	`tokio::spawn`	no	exit successfully
Multiple thread tokio	`tokio::spawn`	no	exit successfully
Single thread tokio	`tokio::spawn`	yes	hang forever
Multiple thread tokio	`tokio::spawn`	yes	send spans and exit
Single thread tokio	spawn_blocking	no	hang forever
Multiple thread tokio	spawn_blocking	no	hang forever
Single thread tokio	spawn_blocking	yes	send spans and exit
Multiple thread tokio	spawn_blocking	yes	send spans and exit

* spawn_blocking = |fut| tokio::spawn_blocking(|| futures::executor::block_on(fut))

So my theory on why opentelemetry hang forever when using the single thread tokio:

when shutdown BatchSpanProcessor. We need to wait for the worker task to finish. But by blocking here, we prevent the worker task from being polled. If we don't block and wait here. When the shutdown finishes, the runtime will go ahead and drop the worker task, so we can't guarantee the spans in BatchSpanProcessor can be sent.

opentelemetry-rust/opentelemetry/src/sdk/trace/span_processor.rs

Lines 232 to 240 in f06f38b

    
           fn shutdown(&mut self) -> TraceResult<()> { 
        
               let mut sender = self.message_sender.lock().map_err(|_| TraceError::from("When shutting down the BatchSpanProcessor, the message sender's lock has been poisoned"))?; 
        
               let (res_sender, res_receiver) = oneshot::channel::<Vec<ExportResult>>(); 
        
               sender.try_send(BatchMessage::Shutdown(res_sender))?; 
        
               for result in futures::executor::block_on(res_receiver)? { 
        
                   result?; 
        
               } 
        
               Ok(()) 
        
           }

I have been proking around and tried random stuff on it but none of them seem to work. So would love some advice here as I believe some of you might have more experience on how tokio works.

Final solution

set_tracer_provider function should return the replaced tracer provider.
shut_down_tracer_provider function will be used to shut down the tracer provider.
tokio::spawn will be used to spawn background tasks in BatchProcessor if the runtime has multiple threads.
If tokio runtime only have one thread. We will create a new thread and a new runtime to spawn background tasks.

codecov · 2021-01-28T02:14:35Z

Codecov Report

Merging #444 (2695bdf) into main (57f76ac) will decrease coverage by 0.80%.
The diff coverage is 2.75%.

@@            Coverage Diff             @@
##             main     #444      +/-   ##
==========================================
- Coverage   47.73%   46.93%   -0.81%     
==========================================
  Files          95       95              
  Lines        8798     8934     +136     
==========================================
- Hits         4200     4193       -7     
- Misses       4598     4741     +143

Impacted Files	Coverage Δ
opentelemetry-datadog/src/exporter/mod.rs	`14.81% <0.00%> (+0.35%)`	⬆️
opentelemetry-datadog/src/lib.rs	`85.83% <ø> (ø)`
opentelemetry-jaeger/src/exporter/mod.rs	`40.56% <0.00%> (+0.25%)`	⬆️
opentelemetry-jaeger/src/lib.rs	`89.36% <ø> (ø)`
opentelemetry-zipkin/src/exporter/mod.rs	`0.00% <0.00%> (ø)`
opentelemetry-zipkin/src/lib.rs	`100.00% <ø> (ø)`
opentelemetry/src/global/trace.rs	`19.69% <0.00%> (-35.19%)`	⬇️
opentelemetry/src/lib.rs	`100.00% <ø> (ø)`
opentelemetry/src/sdk/export/trace/mod.rs	`98.11% <ø> (ø)`
opentelemetry/src/sdk/export/trace/stdout.rs	`9.09% <ø> (-8.56%)`	⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 57f76ac...2695bdf. Read the comment docs.

TommyCpp · 2021-02-05T02:08:17Z

So for single thread runtime. One solution that would do the trick is spawn a new thread and a new runtime on it to run the batch processor. That solve the deadlock issue. For example

      let spawn = |box_future: BoxFuture<'static, ()>| {
            thread::spawn(move || {
                let rt = tokio::runtime::Builder::new_current_thread().enable_all().build().unwrap();
                rt.block_on(box_future);
            });
        };

(Need to handle the result and make some changes to batch processor also)

djc · 2021-02-08T08:52:22Z

That seems like a fine solution!

jtescher · 2021-02-11T17:22:55Z

May want to consider if any of these potentially deadlock if used in tests which execute in parallel.

TommyCpp · 2021-02-17T06:07:23Z

May want to consider if any of these potentially deadlock if used in tests which execute in parallel.

I don't think it will cause deadlock as we guard the global tracer provider with a lock. But if two tests try to modify the global tracer provider at the same time, the result will likely be non-deterministic. Thus, I run tests in global/trace.rs sequentially.

This add a suite of tests to see in different runtimes does global tracer provider has trouble to shutdown.

…kio runtime use case. To improve our performance, it would be good if we can use tokio::spawn instead of tokio::spawn_blocking. The problem with using tokio::spawn to spawn background task that send spans is when shutting down, we need to block on the shutdown function and sending the spans at the same time. In multiple thread runtime, those tasks can be arranged in different thread. But in single runtime we will have a dead lock. The proposed solution is to spawn the background task in a separate thread from the tokio runtime.

Also removed two unnecessary dependencies.

frigus02

This is great!

I used opentelemetry recently for a small CLI and ran into 2 issues:

Using the Tokio current thread runtime
Waiting for remaining spans to export when also using std::process::exit(1); to specify an exit code

I'm beginning to think that I would prefer an async shutdown function that I could call explicitly. It seems to be the hardest thing to accidentally misuse. I had worked about the above mentioned issues by doing something like the following. When I tried to use this PR I added the shutdown function and it didn't work:

async fn main() -> i32 {
    // install_pipeline ... create spans ...
}
fn main() {
    let exit_code = tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .build()
        .unwrap()
        .block_on(async_main);
    // opentelemetry::global::shutdown_tracer_provider(); <-- works without this line on master. spans don't export when I add this line using this PR
    std::process::exit(exit_code);
}

The error I got was "send failed because receiver is gone" (using the Jaeger exporter). I think this makes sense because I tried to shutdown the pipeline after shutting down the Tokio runtime. If shutdown_tracer_provider was async, I wouldn't have tried this code because I couldn't have awaited the future.

That said, though, this PR already seems like a huge improvement. Both of the following much simpler versions of my main function worked just fine based on this PR.

#[tokio::main]
async fn main() {
    let exit_code = async_main().await;
    pipeline::shutdown_pipeline();
    std::process::exit(exit_code);
}

#[tokio::main(flavor = "current_thread")]
async fn main() {
    let exit_code = async_main().await;
    pipeline::shutdown_pipeline();
    std::process::exit(exit_code);
}

So to me this is definitely a nice improvement. And having the tests to make sure we don't regress is nice as well. 👍

opentelemetry/src/global/trace.rs

opentelemetry/src/lib.rs

scripts/test.sh

opentelemetry/src/global/trace.rs

jtescher

Thanks for all the work on this @TommyCpp, looks good to me. Users may forget to call shutdown and miss a few last unexported spans, but that is preferable to the current situation where shutdown can happen silently and no spans are exported at all.

jtescher · 2021-02-25T05:27:52Z

@djc / @frigus02 any other thoughts on this one? global function name and behavior / new tokio feature flag all look good?

frigus02 · 2021-02-25T07:32:14Z

The global function name seems fine to me.

I think I would love to unify the naming on the different runtime features. Maybe something alone the lines of rt-tokio, rt-tokio-current-thread and rt-async-std. That would more clearly indicate to me that they are mutually exclusive.

We could possibly even throw compile errors if more than one runtime feature is activated. I saw this pattern in another project. Not sure if this requires any special features (translated to our use case but untested):

#[cfg(not(any(feature = "rt-tokio", feature = "rt-tokio-current-thread", feature = "rt-tokio-async-std")))]
#[cfg(all(feature = rt-tokio", feature = "rt-tokio-current-thread"))]
#[cfg(all(feature = rt-tokio", feature = "rt-async-std"))]
#[cfg(all(feature = rt-tokio-current-thread", feature = "rt-async-std"))]
compile_error!(
    "exactly one of the features ['rt-tokio', 'rt-tokio-current-thread', 'rt-async-std'] must be enabled"
);

But I'm happy to discuss both of those things in separate PRs.

opentelemetry/src/global/trace.rs

djc · 2021-02-25T16:42:47Z

I think I would love to unify the naming on the different runtime features. Maybe something alone the lines of rt-tokio, rt-tokio-current-thread and rt-async-std. That would more clearly indicate to me that they are mutually exclusive.

Sounds good to me.

TommyCpp · 2021-02-26T03:04:12Z

We could possibly even throw compile errors if more than one runtime feature is activated. I saw this pattern in another project. Not sure if this requires any special features (translated to our use case but untested):
#[cfg(not(any(feature = "rt-tokio", feature = "rt-tokio-current-thread", feature = "rt-tokio-async-std")))]
#[cfg(all(feature = rt-tokio", feature = "rt-tokio-current-thread"))]
#[cfg(all(feature = rt-tokio", feature = "rt-async-std"))]
#[cfg(all(feature = rt-tokio-current-thread", feature = "rt-async-std"))]
compile_error!(
    "exactly one of the features ['rt-tokio', 'rt-tokio-current-thread', 'rt-async-std'] must be enabled"
);

I think the current approach when multiple conflicting features are enabled is to choose one of them. For example, enable tokio-support and async-std at the same time is equal to enable the tokio-support.

I'd vote to change this behavior but it seems we should address this in a different PR.

frigus02 · 2021-02-26T08:23:43Z

I'd vote to change this behavior but it seems we should address this in a different PR.

Definitely agree, this should be a different PR.

TommyCpp · 2021-02-27T16:34:50Z

This is good to merge I guess?

… same time. As per open-telemetry#444 (comment). We should try to throw compile error when users set two exclusive features instead of default to use one of them.

TommyCpp force-pushed the tommycpp/427 branch from 65aee8f to 77d3ad3 Compare January 28, 2021 02:04

TommyCpp force-pushed the tommycpp/427 branch 2 times, most recently from 21f1b56 to fa19431 Compare February 10, 2021 02:39

TommyCpp added 2 commits February 22, 2021 20:14

fix: remove tracer provider guard.

a27ad24

This add a suite of tests to see in different runtimes does global tracer provider has trouble to shutdown.

feat: add test for global tracer provider.

3896d34

TommyCpp force-pushed the tommycpp/427 branch 2 times, most recently from 8dcfcd2 to d65f1f9 Compare February 23, 2021 02:01

TommyCpp force-pushed the tommycpp/427 branch 2 times, most recently from 8232412 to 0049e43 Compare February 23, 2021 04:14

feat: remove tracer guard in favor of shut_down_provider method.

c157726

TommyCpp force-pushed the tommycpp/427 branch from 0049e43 to c157726 Compare February 23, 2021 04:25

TommyCpp added 2 commits February 22, 2021 23:53

fix: rename the shutdown function for global tracer provider.

1e1c65e

feat: add doc on runtime. add some tests

ef131b2

Also removed two unnecessary dependencies.

TommyCpp marked this pull request as ready for review February 24, 2021 01:57

TommyCpp requested a review from a team February 24, 2021 01:57

frigus02 reviewed Feb 24, 2021

View reviewed changes

fix: address comments.

1575bc4

jtescher approved these changes Feb 25, 2021

View reviewed changes

frigus02 approved these changes Feb 25, 2021

View reviewed changes

opentelemetry/src/global/trace.rs Outdated Show resolved Hide resolved

TommyCpp added 2 commits February 25, 2021 20:09

chore: rename the runtime features

fe4f076

chore: renaming _guard in tests.

0ab2a80

frigus02 approved these changes Feb 26, 2021

View reviewed changes

Merge branch 'main' into tommycpp/427

2695bdf

jtescher merged commit c697b58 into open-telemetry:main Feb 27, 2021

TommyCpp deleted the tommycpp/427 branch March 2, 2021 03:54

TommyCpp mentioned this pull request Mar 4, 2021

feat: clean up lint.sh, use API to set grpc layer #467

Merged

ramosbugs mentioned this pull request Jun 7, 2021

Upgrade opentelemetry to 0.14 ramosbugs/opentelemetry-honeycomb-rs#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove tracer provider guard. #444

Remove tracer provider guard. #444

TommyCpp commented Jan 28, 2021 •

edited

Loading

codecov bot commented Jan 28, 2021 •

edited

Loading

TommyCpp commented Feb 5, 2021 •

edited

Loading

djc commented Feb 8, 2021

jtescher commented Feb 11, 2021

TommyCpp commented Feb 17, 2021

frigus02 left a comment

jtescher left a comment

jtescher commented Feb 25, 2021

frigus02 commented Feb 25, 2021

djc commented Feb 25, 2021

TommyCpp commented Feb 26, 2021

frigus02 commented Feb 26, 2021

TommyCpp commented Feb 27, 2021

	fn shutdown(&mut self) -> TraceResult<()> {
	let mut sender = self.message_sender.lock().map_err(\|_\| TraceError::from("When shutting down the BatchSpanProcessor, the message sender's lock has been poisoned"))?;
	let (res_sender, res_receiver) = oneshot::channel::<Vec<ExportResult>>();
	sender.try_send(BatchMessage::Shutdown(res_sender))?;
	for result in futures::executor::block_on(res_receiver)? {
	result?;
	}
	Ok(())
	}

Remove tracer provider guard. #444

Remove tracer provider guard. #444

Conversation

TommyCpp commented Jan 28, 2021 • edited Loading

Problem statement

Proposed solution

Different situations

Final solution

codecov bot commented Jan 28, 2021 • edited Loading

Codecov Report

TommyCpp commented Feb 5, 2021 • edited Loading

djc commented Feb 8, 2021

jtescher commented Feb 11, 2021

TommyCpp commented Feb 17, 2021

frigus02 left a comment

Choose a reason for hiding this comment

jtescher left a comment

Choose a reason for hiding this comment

jtescher commented Feb 25, 2021

frigus02 commented Feb 25, 2021

djc commented Feb 25, 2021

TommyCpp commented Feb 26, 2021

frigus02 commented Feb 26, 2021

TommyCpp commented Feb 27, 2021

TommyCpp commented Jan 28, 2021 •

edited

Loading

codecov bot commented Jan 28, 2021 •

edited

Loading

TommyCpp commented Feb 5, 2021 •

edited

Loading