Observability¶

Courier ships a single, opt-in observability layer that produces structured logs (tracing), metrics, and W3C-propagated traces from the same in-process call sites. Every signal is exported via OTLP to an OpenTelemetry Collector; Courier itself does not host a Prometheus scrape endpoint or a UI.

Courier ──OTLP/gRPC──▶ OTel Collector ──▶ Prometheus / Tempo / Datadog / …

This keeps Courier focused, lets operators pick any backend the Collector supports, and avoids a hard dependency on opentelemetry-prometheus. A reference Collector config and Compose stack ship with the repository.

What you get for free¶

Built-in components and any third-party node that uses BasicTransform / ManagedSink are instrumented automatically:

a structured log line for every error and retry, with node_id, pipeline, error fields — never the payload or the key by default.
six counters and two histograms per node, plus a channel-depth gauge per mpsc edge.
one span per source / transform / sink iteration, linked end-to-end through Meta.headers["traceparent"].

No instrumentation work is required to publish a custom transform or sink — just implement MapOne or WriteOne.

Configuration¶

Observability is driven entirely by an optional [observability] block in your config file. An empty or absent block keeps Courier silent: text logs to stderr, no metrics, no traces. Setting an otlp_endpoint is what activates each signal.

config.toml

[observability]
service_name = "courier"       # OTel resource service.name
log_format   = "json"          # "text" (default) or "json"
log_level    = "info"          # default filter when RUST_LOG is unset
log_keys     = false           # allow meta.key in span fields (off by default)

[observability.metrics]
otlp_endpoint     = "http://collector:4317"
export_interval_ms = 15000     # OTLP push interval (default 15 s)

[observability.tracing]
otlp_endpoint = "http://collector:4317"
sample_ratio  = 0.1            # parent-based sampling (0.0–1.0)

[observability.logs]
otlp_endpoint = "http://collector:4317"   # bridge tracing events to OTLP logs

Precedence rules:

RUST_LOG always wins over log_level, matching Courier's prior env_logger behavior.
metrics.otlp_endpoint = "" (or unset) disables metric export — counters and histograms still exist in-process but observations are dropped to a private no-op meter.
tracing.otlp_endpoint = "" (or unset) disables span export the same way.
logs.otlp_endpoint = "" (or unset) disables OTLP log export — events still flow to stderr through the local fmt layer.

Validation runs at startup: sample_ratio must be a finite value in [0.0, 1.0], service_name must be non-empty, export_interval_ms must be greater than zero, and log_level must be a valid EnvFilter directive. Bad values fail courier validate and courier run with a path-annotated error.

Logs¶

Courier installs a tracing-subscriber registry on startup. The format is text by default (one line per event, written to stderr so stdout stays clean) and can be switched to structured JSON with log_format = "json". Existing log:: call sites are bridged through tracing-log::LogTracer, so log level filtering still works the way it always did.

Every span and event carries the same canonical fields:

Field	Meaning
`pipeline`	Pipeline name from config.
`node_id`	Hierarchical id: `{pipeline}/src`, `…/t0`, `…/sink0`, `…/edge/…`.
`node_kind`	One of `source`, `transform`, `sink`, `splitter`, `edge`.
`envelope.source_id`	`Meta.source_id` of the envelope being processed.
`envelope.key`	`Meta.key` — only emitted when `log_keys = true`.
`error`	Stringified error chain on failure / retry / dead-letter logs.

Payloads and arbitrary header values are never logged. This is enforced as a privacy default: tracing calls in the runtime never reference env.payload, and meta.key only flows into a span when an operator opts in.

OTLP log export¶

Set [observability.logs].otlp_endpoint to forward every tracing event as an OpenTelemetry LogRecord over OTLP/gRPC. The bridge is additive — local stderr output via the fmt layer is untouched — so operators can keep kubectl logs working while shipping the same lines to a backend.

[observability.logs]
otlp_endpoint = "http://collector:4317"

The exporter shares the service_name resource attribute with metrics and traces, and each emitted log record automatically carries the active span's trace_id / span_id (Loki, Datadog, etc. show this as structured metadata you can pivot on). LOGGER_PROVIDER is force-flushed and shut down alongside the metrics and tracer providers in the SIGINT path, so the last batch is not lost.

The reference Collector ships logs to Loki via OTLP/HTTP (/otlp/v1/logs); see Sample Compose stack below for how to query them in Grafana.

Metrics¶

Every metric carries the labels pipeline, node_id, node_kind. No other labels are added by the runtime: payload-derived values like meta.key or source_id are span fields only, never metric attributes, so cardinality stays bounded by your topology.

Name	Type	Unit	Reported by	Meaning
`courier_envelopes_processed_total`	counter	1	transforms, sinks	Envelopes that completed successfully. Sinks only count writes that returned `Written` (not dead-lettered).
`courier_envelopes_filtered_total`	counter	1	transforms	Envelopes intentionally dropped by `MapOne::map` returning `Ok(None)`. Not counted as `processed`.
`courier_envelopes_failed_total`	counter	1	transforms, sinks	Envelopes that errored after any retries were exhausted. Includes envelopes routed to dead-letter (they are both `failed` and `dead_lettered`).
`courier_retries_total`	counter	1	sinks	One increment per retry attempt — the first try is not counted.
`courier_dead_lettered_total`	counter	1	sinks	Envelopes successfully appended to a dead-letter file after retries were exhausted.
`courier_stage_duration_milliseconds`	histogram	ms	transforms, sinks	Wall-clock time for one envelope through the node, including any retries.
`courier_end_to_end_latency_milliseconds`	histogram	ms	sinks	`now − env.meta.timestamp_ms` at write completion. Skipped if the clock looks skewed (envelope timestamp in the future).
`courier_channel_capacity_used`	histogram	1	per mpsc edge (sampled)	`channel_capacity − sender.capacity()` — items currently in flight on that edge. Sampled every 300 ms (short bursts under 300 ms may not be sampled); carries `node_kind = "edge"` and `node_id = "{pipeline}/edge/{src}->{dst}"`.

A few semantic rules worth knowing:

Filtered ≠ failed. A transform that filters does not bump processed, failed, or retries. Dashboards summing "throughput" should add processed + filtered if they want input volume.
Dead-letter accounting is double-stamped on purpose. A successfully dead-lettered envelope increments both failed and dead_lettered, so an operator alerting on failed always sees the event, while a dead-letter dashboard gets a precise sub-count. The dead-letter file write itself is best-effort: if appending fails, the original error is propagated to the configured on_error policy and dead_lettered_total is not incremented.
Channel-depth is a gauge sampled as a histogram. The 300 ms cadence is fixed — fine enough to catch sustained backpressure, coarse enough to avoid metric spam on busy pipelines.

Metrics are pushed via OTLP/gRPC on the interval configured by export_interval_ms. The SIGINT handler force-flushes the MeterProvider and TracerProvider before joining task handles, so the last batch survives a graceful drain.

Tracing¶

Each iteration of a source / transform / sink runs inside a span:

Span name	Emitted by	Default level
`courier.source`	`SourceCtx::send`	`INFO`
`courier.transform`	`BasicTransform`	`INFO`
`courier.sink`	`ManagedSink`	`INFO`

Spans carry the same pipeline / node_id / node_kind / envelope.source_id fields as logs. envelope.key is added only when log_keys = true.

W3C trace-context propagation¶

Courier carries trace context on Envelope.Meta.headers using the W3C traceparent and tracestate keys — no envelope schema change. The flow is:

SourceCtx::send opens a courier.source span, extracts any incoming traceparent (e.g. set by KafkaSource from message headers), and writes the resulting context back into the outgoing envelope's headers.
Each downstream BasicTransform and ManagedSink extracts the parent from meta.headers, opens its stage span as a child, and re-injects on the way out so the next hop sees a refreshed traceparent.
Built-in egress sinks copy meta.headers["traceparent"] onto the wire — ApiSink as an HTTP request header, KafkaSink as a Kafka message header — so downstream consumers see the same trace.

Sampling is parent-based with the configured sample_ratio at root. A non-sampled trace coming in stays non-sampled all the way through. The default 0.1 is conservative; raise it for low-volume pipelines or when actively debugging.

Privacy and cardinality¶

Payloads are never logged or attached to spans. A CI grep test fails the build if any tracing::*! call references env.payload outside the dedicated trace_context module.
meta.key is off by default. Set log_keys = true only if your keys are non-sensitive — they will appear as the envelope.key span field on every transform and sink hop.
Metric labels are restricted to topology. pipeline, node_id, node_kind, plus outcome where applicable. Anything derived from envelope contents is a span field, not a label.

Sample Collector¶

A reference Collector config lives at examples/otel-collector.yaml. It receives OTLP/gRPC and OTLP/HTTP from Courier, re-exposes a Prometheus scrape endpoint on :9464, forwards traces to Tempo (or, by uncommenting a single block, Jaeger), and ships logs to Loki via OTLP/HTTP:

receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }

exporters:
  prometheus:
    endpoint: 0.0.0.0:9464
  otlp/tempo:
    endpoint: tempo:4317
    tls: { insecure: true }
  otlphttp/loki:
    endpoint: http://loki:3100/otlp

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [prometheus]
    traces:
      receivers: [otlp]
      exporters: [otlp/tempo]
    logs:
      receivers: [otlp]
      exporters: [otlphttp/loki]

Point Courier's otlp_endpoint at http://<collector>:4317 and Prometheus at http://<collector>:9464/metrics to scrape Courier through the Collector.

Sample Compose stack¶

The repository's examples/docker-compose.observability.yml boots a full local stack you can point Courier at:

otel-collector on 4317/4318 (OTLP) and 9464 (Prometheus scrape)
prometheus scraping the Collector
tempo receiving traces over OTLP
loki receiving logs over OTLP/HTTP
grafana preconfigured with all three datasources and the bundled examples/dashboards/courier.json

Set every signal's endpoint to the Collector and boot the stack:

config.toml

[observability]
log_format = "json"

[observability.metrics]
otlp_endpoint = "http://localhost:4317"

[observability.tracing]
otlp_endpoint = "http://localhost:4317"

[observability.logs]
otlp_endpoint = "http://localhost:4317"

docker compose -f examples/docker-compose.observability.yml up -d
COURIER_CONFIG=examples/config.toml courier run

Viewing logs in Grafana¶

Open Grafana at http://localhost:3000 (anonymous admin is enabled).
Go to Explore and select the Loki datasource.
Run {service_name="courier"} to see every line Courier emits. Add |= "error" to grep, or | json to break out structured fields like node_id, pipeline, error.
Each log line carries trace_id as structured metadata — click it to jump straight to the matching trace in Tempo without leaving the line.

Useful starter queries:

Query	What it shows
`{service_name="courier"}`	All Courier logs.
`{service_name="courier"} \\|= "retry"`	Retry warnings from `write_with_retry`.
`{service_name="courier"} \\|= "dead_letter"`	Envelopes routed to the dead-letter sink.
`{service_name="courier"} \\| json \\| node_kind="sink" \\| level="ERROR"`	Sink failures, broken out by structured field.

Field reference¶

Logs / spans
  pipeline             string  pipeline name
  node_id              string  {pipeline}/{src|t{i}|sink{i}|broadcast|edge/...}
  node_kind            enum    source | transform | sink | splitter | edge
  envelope.source_id   string  Envelope.Meta.source_id
  envelope.key         string  Envelope.Meta.key — only when log_keys = true
  error                string  formatted anyhow chain on failure logs

Metrics labels
  pipeline, node_id, node_kind  (no payload-derived attributes)