-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix OOM in span_processor_with_async_runtime::BatchSpanProcessor
#2793
Conversation
opentelemetry-sdk/src/trace/span_processor_with_async_runtime.rs
Outdated
Show resolved
Hide resolved
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2793 +/- ##
=======================================
- Coverage 79.6% 79.6% -0.1%
=======================================
Files 124 124
Lines 23174 23181 +7
=======================================
- Hits 18456 18455 -1
- Misses 4718 4726 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
use std::sync::atomic::{AtomicUsize, Ordering}; | ||
use std::sync::Arc; | ||
use tokio::sync::RwLock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't assume that the Tokio runtime is always available in this context. Instead, the runtime abstraction is used to ensure flexibility and avoid locking the code to a specific async runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lalitb Please read the following: https://docs.rs/tokio/latest/tokio/sync/index.html#runtime-compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for link. This makes sense. However, I was just thinking of the scenario where experimental_async_runtime
and rt-async-std
features are enabled for otel-sdk. I believe the compilation will fail in that case, as tokio being optional dependency. Or am I missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes me think - we don't have CI test for this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lalitb Makes sense. We can make experimental_async_runtime
to depend on tokio/sync
(it contains only Tokio sync primitives) to avoid importing some other libraries containing asynchronous RwLock. The result will be
# Cargo.toml
[dependencies]
async-std = { workspace = true, features = ["unstable"], optional = true }
tokio = { workspace = true, optional = true }
tokio-stream = { workspace = true, optional = true }
[features]
experimental_async_runtime = ["dep:tokio", "tokio/sync"]
rt-tokio = ["dep:tokio", "dep:tokio-stream", "tokio/rt", "tokio/time", "experimental_async_runtime"]
rt-tokio-current-thread = ["dep:tokio", "dep:tokio-stream", "tokio/rt", "tokio/time", "experimental_async_runtime"]
rt-async-std = ["dep:async-std", "experimental_async_runtime"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Also, another option could be to use the async_std::sync::RwLock for rt-async-std feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, another option could be to use the async_std::sync::RwLock
This will bring additional #[cfg]
attributes depending on runtime, which will increase the code complexity. As long Tokio's RwLock works for any runtime, consider using it.
Also, here's the demo of using async-std
: https://github.com/50U10FCA7/otel-oom/tree/debug-async-std.
Everything complies (as we discussed above), but it seems like opentelemetry-otlp
crate relying on some Tokio stuff (opentelemetry-otlp
uses hyper-util
which relies on Tokio runtime). Not part of this PR, but just mentioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything complies (as we discussed above), but it seems like opentelemetry-otlp crate relying on some Tokio stuff (opentelemetry-otlp uses hyper-util which relies on Tokio runtime). Not part of this PR, but just mentioning.
Yes, this is expected for otlp/gRPC and otlp/http with hyper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't reviewed this in detail, but I suggest to describe the fix in the PR description to make it easy to review. This looks like a lot of changes, and unless they are directly related to fixing the OOM bug, I'd suggest separate PRs. (Really prefer small, focused PRs)
Additionally, I also suggest to fix the default BatchSpanProcessor (which spawn own thread). It is unknown how we'll support custom async runtimes - it could be enriching the BatchSpanProcessor itself to work better in tokio/other runtimes, and not necessarily in the span_processor_with_async_runtime.
(Not opposed to fixing bugs meanwhile, just sharing that the future of this struct is not decided)
There is only one additional change described in #2793 (comment) (which is minor I think). But OK, I replace these changes with a TODO.
The problem related only to async batch processor. Tested sync version in https://github.com/50U10FCA7/otel-oom/tree/master/src (just replaced async processor with the sync one), no OOM happen. |
4106e4b
to
6c6c596
Compare
if self.spans.len() == self.config.max_export_batch_size { | ||
// Replace the oldest span with the new span to avoid suspending messages | ||
// processing. | ||
self.spans.pop_front(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be silently dropping the old span - Should we log a warning here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lalitb We should. Also, we probably need to bound on self.config.max_queue_size
instead of self.config.max_export_batch_size
. BatchConfig
notifies the spans will be dropped only if max_queue_size
reached, not max_export_batch_size
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear how the current code is leading to OOM. #2793 (comment)
Fixes #2787
Changes
Fixes
BatchSpanProcessor
overflowing.Merge requirement checklist
CHANGELOG.md
files updated for non-trivial, user-facing changes