caddy/modules
wangdongyong 69d6ace32e
tracing: fix BatchSpanProcessor goroutine leak on config reload (#7826)
* tracing: fix BatchSpanProcessor goroutine leak on config reload

When the `tracing` directive is enabled, each config reload leaks one
`go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).processQueue`
goroutine. On a server that reloads frequently (e.g. polling a remote
config source every few seconds) this accumulates into tens of thousands
of leaked goroutines over time.

A goroutine dump shows many identical stacks:

```
goroutine ... [select]:
go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).processQueue(...)
	.../sdk/trace/batch_span_processor.go:327
go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor.func2()
	.../sdk/trace/batch_span_processor.go:129
created by go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor
	.../sdk/trace/batch_span_processor.go:127
```

`tracing` keeps a global, reference-counted `TracerProvider` so it can be
reused across reloads (`tracerProvider.getTracerProvider`). Caddy reloads
provision the new config *before* cleaning up the old one, so the counter
never drops to 0 and the provider is correctly reused — `Shutdown` is
never called, by design.

The problem is on the caller side in `newOpenTelemetryWrapper`:

```go
traceExporter, err := autoexport.NewSpanExporter(ctx)
...
tracerProvider := globalTracerProvider.getTracerProvider(
    sdktrace.WithBatcher(traceExporter),   // evaluated on every reload
    sdktrace.WithResource(res),
)
```

`sdktrace.WithBatcher(e)` is `WithSpanProcessor(NewBatchSpanProcessor(e))`,
and `NewBatchSpanProcessor` **starts its `processQueue` goroutine eagerly at
construction time** — not when the option is applied. The option is built
on every `Provision` (every reload), but `getTracerProvider` only applies
it when it actually creates a new provider (`t.tracerProvider == nil`). On
the reuse path the option is silently discarded, so the just-started
BatchSpanProcessor goroutine is orphaned: it is never registered with any
provider and therefore never shut down. Result: one leaked goroutine (plus
a leaked exporter) per reload.

Defer construction of the exporter/batcher until a new provider is actually
needed. `getTracerProvider` now takes a `buildOpts` factory that is invoked
only on the create path, so nothing with a side effect is built on the
reuse path.

The reference counter is now incremented only after the provider is
successfully obtained, preserving the previous semantics where a failed
exporter creation did not affect the counter.

- `Test_tracersProvider_buildOptsOnlyOnCreate` — asserts `buildOpts` runs
  exactly once across one create + five reuses (the regression guard).
- `Test_tracersProvider_buildOptsError` — asserts that on a build error the
  provider stays nil and the counter is not incremented.
- Existing tracing tests updated for the new signature and still pass.

Verified manually with a reload loop: before the fix, 50 reloads leaked 50
`processQueue` goroutines; after the fix, 0 are leaked while the provider is
still reused (counter stays at 1).

* add test for buldOpts error path

---------

Co-authored-by: Zen Dodd <mail@steadytao.com>
2026-06-18 11:42:58 -06:00
..
caddyevents refactor: replace HasPrefix+TrimPrefix with CutPrefix (#7095) 2025-06-27 22:04:09 +03:00
caddyfs events: Refactor; move Event into core, so core can emit events (#6930) 2025-03-29 08:15:43 -06:00
caddyhttp tracing: fix BatchSpanProcessor goroutine leak on config reload (#7826) 2026-06-18 11:42:58 -06:00
caddypki tls: Add tls_resolvers global option for DNS challenge configuration (#7297) 2026-03-01 15:32:04 -05:00
caddytls caddytls: skip idna.ToASCII for pure ASCII SNI values (#7770) 2026-05-28 11:18:09 +10:00
filestorage caddyfile: Normalize & flatten all unmarshalers (#6037) 2024-01-23 19:36:59 -05:00
internal/network core: add modular network_proxy support (#6399) 2025-03-21 17:06:15 +00:00
logging chore: clean up wording and typo fixes (#7745) 2026-05-20 16:36:30 +10:00
metrics metrics: scope metrics to active config, add optional per-host metrics (#6531) 2024-10-02 08:23:26 -06:00
standard filesystem: Globally declared filesystems, fs directive (#5833) 2024-01-13 20:12:43 +00:00