caddy/modules/caddyhttp/tracing
wangdongyong 69d6ace32e
tracing: fix BatchSpanProcessor goroutine leak on config reload (#7826)
* tracing: fix BatchSpanProcessor goroutine leak on config reload

When the `tracing` directive is enabled, each config reload leaks one
`go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).processQueue`
goroutine. On a server that reloads frequently (e.g. polling a remote
config source every few seconds) this accumulates into tens of thousands
of leaked goroutines over time.

A goroutine dump shows many identical stacks:

```
goroutine ... [select]:
go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).processQueue(...)
	.../sdk/trace/batch_span_processor.go:327
go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor.func2()
	.../sdk/trace/batch_span_processor.go:129
created by go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor
	.../sdk/trace/batch_span_processor.go:127
```

`tracing` keeps a global, reference-counted `TracerProvider` so it can be
reused across reloads (`tracerProvider.getTracerProvider`). Caddy reloads
provision the new config *before* cleaning up the old one, so the counter
never drops to 0 and the provider is correctly reused — `Shutdown` is
never called, by design.

The problem is on the caller side in `newOpenTelemetryWrapper`:

```go
traceExporter, err := autoexport.NewSpanExporter(ctx)
...
tracerProvider := globalTracerProvider.getTracerProvider(
    sdktrace.WithBatcher(traceExporter),   // evaluated on every reload
    sdktrace.WithResource(res),
)
```

`sdktrace.WithBatcher(e)` is `WithSpanProcessor(NewBatchSpanProcessor(e))`,
and `NewBatchSpanProcessor` **starts its `processQueue` goroutine eagerly at
construction time** — not when the option is applied. The option is built
on every `Provision` (every reload), but `getTracerProvider` only applies
it when it actually creates a new provider (`t.tracerProvider == nil`). On
the reuse path the option is silently discarded, so the just-started
BatchSpanProcessor goroutine is orphaned: it is never registered with any
provider and therefore never shut down. Result: one leaked goroutine (plus
a leaked exporter) per reload.

Defer construction of the exporter/batcher until a new provider is actually
needed. `getTracerProvider` now takes a `buildOpts` factory that is invoked
only on the create path, so nothing with a side effect is built on the
reuse path.

The reference counter is now incremented only after the provider is
successfully obtained, preserving the previous semantics where a failed
exporter creation did not affect the counter.

- `Test_tracersProvider_buildOptsOnlyOnCreate` — asserts `buildOpts` runs
  exactly once across one create + five reuses (the regression guard).
- `Test_tracersProvider_buildOptsError` — asserts that on a build error the
  provider stays nil and the counter is not incremented.
- Existing tracing tests updated for the new signature and still pass.

Verified manually with a reload loop: before the fix, 50 reloads leaked 50
`processQueue` goroutines; after the fix, 0 are leaked while the provider is
still reused (counter stays at 1).

* add test for buldOpts error path

---------

Co-authored-by: Zen Dodd <mail@steadytao.com>
2026-06-18 11:42:58 -06:00
..
module.go tracing: Add span attributes to tracing module (#7269) 2025-12-31 11:33:18 -07:00
module_test.go tracing: Add span attributes to tracing module (#7269) 2025-12-31 11:33:18 -07:00
tracer.go tracing: fix BatchSpanProcessor goroutine leak on config reload (#7826) 2026-06-18 11:42:58 -06:00
tracer_test.go tracing: Add span attributes to tracing module (#7269) 2025-12-31 11:33:18 -07:00
tracerprovider.go tracing: fix BatchSpanProcessor goroutine leak on config reload (#7826) 2026-06-18 11:42:58 -06:00
tracerprovider_test.go tracing: fix BatchSpanProcessor goroutine leak on config reload (#7826) 2026-06-18 11:42:58 -06:00