Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix OOM in span_processor_with_async_runtime::BatchSpanProcessor #2793

Closed
wants to merge 3 commits into from

Conversation

50U10FCA7
Copy link

Fixes #2787

Changes

Fixes BatchSpanProcessor overflowing.

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

@50U10FCA7 50U10FCA7 requested a review from a team as a code owner March 12, 2025 12:56
Copy link

linux-foundation-easycla bot commented Mar 12, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link

codecov bot commented Mar 12, 2025

Codecov Report

Attention: Patch coverage is 61.53846% with 5 lines in your changes missing coverage. Please review.

Project coverage is 79.6%. Comparing base (ad88615) to head (b13bb71).

Files with missing lines Patch % Lines
...sdk/src/trace/span_processor_with_async_runtime.rs 61.5% 5 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #2793     +/-   ##
=======================================
- Coverage   79.6%   79.6%   -0.1%     
=======================================
  Files        124     124             
  Lines      23174   23181      +7     
=======================================
- Hits       18456   18455      -1     
- Misses      4718    4726      +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use tokio::sync::RwLock;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't assume that the Tokio runtime is always available in this context. Instead, the runtime abstraction is used to ensure flexibility and avoid locking the code to a specific async runtime.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for link. This makes sense. However, I was just thinking of the scenario where experimental_async_runtime and rt-async-std features are enabled for otel-sdk. I believe the compilation will fail in that case, as tokio being optional dependency. Or am I missing something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes me think - we don't have CI test for this :)

Copy link
Author

@50U10FCA7 50U10FCA7 Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lalitb Makes sense. We can make experimental_async_runtime to depend on tokio/sync (it contains only Tokio sync primitives) to avoid importing some other libraries containing asynchronous RwLock. The result will be

# Cargo.toml
[dependencies]
async-std = { workspace = true, features = ["unstable"], optional = true }
tokio = { workspace = true, optional = true }
tokio-stream = { workspace = true, optional = true }

[features]
experimental_async_runtime = ["dep:tokio", "tokio/sync"]
rt-tokio = ["dep:tokio", "dep:tokio-stream", "tokio/rt", "tokio/time", "experimental_async_runtime"]
rt-tokio-current-thread = ["dep:tokio", "dep:tokio-stream", "tokio/rt", "tokio/time", "experimental_async_runtime"]
rt-async-std = ["dep:async-std", "experimental_async_runtime"]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Also, another option could be to use the async_std::sync::RwLock for rt-async-std feature.

Copy link
Author

@50U10FCA7 50U10FCA7 Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lalitb

Also, another option could be to use the async_std::sync::RwLock

This will bring additional #[cfg] attributes depending on runtime, which will increase the code complexity. As long Tokio's RwLock works for any runtime, consider using it.

Also, here's the demo of using async-std: https://github.com/50U10FCA7/otel-oom/tree/debug-async-std.

Everything complies (as we discussed above), but it seems like opentelemetry-otlp crate relying on some Tokio stuff (opentelemetry-otlp uses hyper-util which relies on Tokio runtime). Not part of this PR, but just mentioning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything complies (as we discussed above), but it seems like opentelemetry-otlp crate relying on some Tokio stuff (opentelemetry-otlp uses hyper-util which relies on Tokio runtime). Not part of this PR, but just mentioning.

Yes, this is expected for otlp/gRPC and otlp/http with hyper.

Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't reviewed this in detail, but I suggest to describe the fix in the PR description to make it easy to review. This looks like a lot of changes, and unless they are directly related to fixing the OOM bug, I'd suggest separate PRs. (Really prefer small, focused PRs)

Additionally, I also suggest to fix the default BatchSpanProcessor (which spawn own thread). It is unknown how we'll support custom async runtimes - it could be enriching the BatchSpanProcessor itself to work better in tokio/other runtimes, and not necessarily in the span_processor_with_async_runtime.
(Not opposed to fixing bugs meanwhile, just sharing that the future of this struct is not decided)

@50U10FCA7
Copy link
Author

50U10FCA7 commented Mar 12, 2025

@cijothomas

This looks like a lot of changes, and unless they are directly related to fixing the OOM bug.

There is only one additional change described in #2793 (comment) (which is minor I think). But OK, I replace these changes with a TODO.

Additionally, I also suggest to fix the default BatchSpanProcessor (which spawn own thread)

The problem related only to async batch processor. Tested sync version in https://github.com/50U10FCA7/otel-oom/tree/master/src (just replaced async processor with the sync one), no OOM happen.

@50U10FCA7 50U10FCA7 force-pushed the 2787-fix-oom branch 2 times, most recently from 4106e4b to 6c6c596 Compare March 12, 2025 16:59
if self.spans.len() == self.config.max_export_batch_size {
// Replace the oldest span with the new span to avoid suspending messages
// processing.
self.spans.pop_front();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be silently dropping the old span - Should we log a warning here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lalitb We should. Also, we probably need to bound on self.config.max_queue_size instead of self.config.max_export_batch_size. BatchConfig notifies the spans will be dropped only if max_queue_size reached, not max_export_batch_size.

Copy link
Contributor

@utpilla utpilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear how the current code is leading to OOM. #2793 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: OOM in BatchSpanProcessor
4 participants