Skip to content

feat: span processor api refactor #2962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

paullegranddc
Copy link
Contributor

@paullegranddc paullegranddc commented May 13, 2025

Design discussion issue (if applicable)

#2940

Changes

This PR refactors the SpanProcessor API.

The API currently has the following issues:

  1. It's not possible to read a span fields in a processor without calling export_data, which completely copies the data
  2. The SpanProcessor::on_end API means owned span data has to be passed to each span processor, even if it does nothing with it.

One way that was explored to fix number 2. was to pass a &mut Span to on_end do processor can use std::mem::take to consume data without copying, but this is not ideal for at least two reasons:

  • The otel sdk spec specifically forbids mutation of the span in on_end, and modifications done in one processor should not impact data passed to other processors
  • If there are multiple processors, and any processor but the last one grabs the data and leaves the span empty, this will cause bugs dependent on span processor ordering...

This PR thus proposes the following changes changes to the API:

Introduce a ReadableSpan trait, implemented for SdkSpan.

This trait define getters for span fields so we can read the span without copying or consuming data. This is inspired by other Otel libraries such as Java which has a similar interface.

Modify on_end to pass it a FinishedSpan.

FinishedSpan is a new abstraction around a span that has been closed. It implements the ReadableSpan trait so we can read from it without consuming the data.
if a span processor needs to get an owned SpanData on it. This will do one of two things:

  • if this is the last span processor that will be invoked, it gives ownership of the span data without copying
  • otherwise it copies the span data

This design

  1. allows not copying in span processors that don't need owned data during on_end
  2. doesn't expose mutable spans in on_end
  3. eludes the span data copy if the exporter is the last span processor (which is usually the case).

Add on_ending method on thr SpanProcessor trait. [Spec link

](https://github.com/open-telemetry/opentelemetry-specification/blob/e5bc8e18e647a47b25d264b0c67ce6c0d0e1ec93/specification/trace/sdk.md#onending)
This method is called during span ending, before on_end. it is given a mutable span and is very useful to implement obfuscation for instance.
It makes possible the use cases that were the motivation for passing mutable data to on_end while respecting the spec.

Benchmarking

A benchmark running creating a span and dropping it, invoking the on_start, on_ending and on_end methods for a varying number of span processors has been added.

Baseline - main branch:

SpanProcessorApi/0_processors
    time:   [339.66 ns 340.56 ns 341.47 ns]
SpanProcessorApi/1_processors
    time:   [373.10 ns 374.36 ns 375.60 ns]
SpanProcessorApi/2_processors
    time:   [803.10 ns 804.99 ns 807.03 ns]
SpanProcessorApi/4_processors
    time:   [1.2096 µs 1.2137 µs 1.2179 µs]

Candidate - paullegranddc:paullgdc/sdk/span_processor_api_refactor:

SpanProcessorApi/0_processors
    time:   [385.15 ns 386.14 ns 387.25 ns]
SpanProcessorApi/1_processors
    time:   [385.73 ns 387.17 ns 388.85 ns]
SpanProcessorApi/2_processors
    time:   [384.84 ns 385.66 ns 386.50 ns]
SpanProcessorApi/4_processors
    time:   [386.78 ns 388.17 ns 389.58 ns]

On main, the cost of a span lifecycle is proportional to the number of span processors, because the Clone operation on span data is as expensive as creating the span.

After this PR, since span processors don't clone data unless they need it, the cost is constant regardless of the number of span processors.

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

Currently there is no way to read the span data without cloning it.
This causes performance issues when the span for span
processors that need to read the span data.

This commit introduces a new trait `ReadableSpan` in the sdk
implemented by SDK Spans that allows to read the span data without
cloning it.
This API allows to mutations of the span when it is ending.
It's marked as on development in the spec, but it is useful for
span obfuscation for example, which needs to done after
attributes can added to the span anymore.
@paullegranddc paullegranddc requested a review from a team as a code owner May 13, 2025 16:00
Copy link

codecov bot commented May 13, 2025

Codecov Report

Attention: Patch coverage is 44.76190% with 116 lines in your changes missing coverage. Please review.

Project coverage is 81.1%. Comparing base (8fe3dcc) to head (1d3e6d8).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
opentelemetry-sdk/src/trace/span.rs 35.1% 116 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #2962     +/-   ##
=======================================
- Coverage   81.4%   81.1%   -0.4%     
=======================================
  Files        126     126             
  Lines      24305   24465    +160     
=======================================
+ Hits       19808   19860     +52     
- Misses      4497    4605    +108     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@scottgerring scottgerring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @paullegranddc thanks for opening the PR!
I've had a quick zoom through now and have some comments and queries.

This feels like a sensible way of addressing the problems you mentioned (and highlighted in the examples 👍 ). Of course it will break the existing on_end, but as we're not stable here, yet and there's a clear upshot, from my perspective that seems reasonable. I think @cijothomas will also have helpful opinions here - Cijo?

Adding on_ending seems like a no brainer.

It would be great to look at some big downstream projects (tracing-opentelemetry comes to mind), and check if they are impacted. It would also be great to have some perf numbers from benchmarks touching the path through ensure_ended_and_exported which you may already have handy?

@@ -79,11 +79,24 @@ pub trait SpanProcessor: Send + Sync + std::fmt::Debug {
/// synchronously on the thread that started the span, therefore it should
/// not block or throw exceptions.
fn on_start(&self, span: &mut Span, cx: &Context);
#[cfg(feature = "experimental_span_processor_on_ending")]
/// `on_ending` is called when a `Span` is ending. The en timestampe has already
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// `on_ending` is called when a `Span` is ending. The en timestampe has already
/// `on_ending` is called when a `Span` is ending. The end timestamp has already

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also clarify that these are called before on_end - just to make the ordering super explicit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reformulated the comment, it should be a bit clearer

/// `on_end` is called after a `Span` is ended (i.e., the end timestamp is
/// already set). This method is called synchronously within the `Span::end`
/// API, therefore it should not block or throw an exception.
/// TODO - This method should take reference to `SpanData`
fn on_end(&self, span: SpanData);
fn on_end(&self, span: &mut FinishedSpan);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting this is a breaking change

/// * if it called twice in the same SpanProcessor::on_end
pub fn consume(&mut self) -> crate::trace::SpanData {
if self.is_consummed {
panic!("Span data has already been consumed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reckon this shouldn't panic; we can use the internal logging macros to emit an error

}

fn on_end(&self, span: &mut FinishedSpan) {
if !matches!(span.span_kind(), SpanKind::Server) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice motivating example for the on_end changes!

@@ -4,10 +4,10 @@ use opentelemetry::trace::{
SpanContext, SpanId, SpanKind, Status, TraceFlags, TraceId, TraceState,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you run the benchmark suite to do a performance regression against main ? Would be great to include.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've run the existing BatchSpanProcessor benchmarks.
With experimental_span_processor_on_ending not enabled, the benchmark show no detectable changes in performance.

If the experimental_span_processor_on_ending feature is enabled, there is a slight cost (5%) which I believe comes from the fact we have to clone the tracer field, because we have to iterate over the span processor list through a shared reference, while passing the span mutably. Thus we have to clone the tracer associated with the span so we can split the ownership (it's in an Arc cell so this is not that expensive)

Some(data) => data,
None => return,
};
let span_context: SpanContext =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to comment around this bit to make it easier to re-grok -- "take ownership of the span_context leaving an empty context in its place; this saves us a clone" or so, right?

/// If `consume`` is never called, the on_ending method will not perform any copy of
/// the span data.
///
/// Calling any `ReadableSpan` method on the `FinishedSpan` will panic if the span data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably error out through the internal logging macros here and not panic. We try to avoid panic-ing wherever we can.

Adds a benchmark creating a span and dropping it, without an exporter.
The goal is to estimate the cost of the SDK running without the exporter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants