Lifecycle, health probes, and shutdown¶

Courier tracks the lifecycle state of every pipeline and exposes HTTP endpoints for container orchestrators (Kubernetes, systemd, Docker, etc.) to probe liveness and readiness. The same state machine drives graceful shutdown and exit codes.

Pipeline states¶

Each pipeline transitions through a well-defined set of states:

Starting → Running → Draining → Stopped   (normal shutdown)
               ↘ Failed                    (unrecoverable error)

State	Meaning
`Starting`	Tasks have been spawned but have not yet begun processing envelopes.
`Running`	The pipeline is actively processing envelopes. Reached once `spawn_pipeline` finishes wiring channels.
`Draining`	A shutdown signal was received. Sources stop pulling; sinks drain their channel receivers until upstream closes.
`Stopped`	All tasks completed. Terminal state for graceful shutdown.
`Failed`	An unrecoverable error terminated the pipeline (the `fail_pipeline` error policy). Terminal state — the pipeline does not recover.

State transitions are forward-only (except for Failed, which can be reached from any non-terminal state). Once a pipeline enters Failed or Stopped, no subsequent transition can overwrite it — a concurrent SIGINT arriving after a FailPipeline error will not erase the failure.

Health probes¶

Enable the health server with a [health] block in your config:

[health]
address = "0.0.0.0:9090"

The server provides two endpoints:

Endpoint	Method	Returns
`/health/live`	GET	`200 OK` with body `ok` — the process is alive.
`/health/ready`	GET	`200 OK` when every pipeline is `Running` and no shutdown has been requested; `503 Service Unavailable` otherwise.

The readiness response includes the state of every pipeline:

{
  "status": "not_ready",
  "pipelines": [
    { "name": "orders", "state": "starting" },
    { "name": "events", "state": "running" }
  ]
}

Kubernetes example¶

livenessProbe:
  httpGet:
    path: /health/live
    port: 9090
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 9090
  initialDelaySeconds: 5
  periodSeconds: 10

If [health] is not present in the config, no health server is started.

Graceful shutdown¶

On SIGINT/Ctrl+C, Courier:

Transitions every pipeline to Draining.
Cancels the shared CancellationToken — sources stop pulling, transforms finish their current item, sinks drain their channel receivers.
Waits up to shutdown.timeout_secs (default 30 s) for all tasks to complete.
If the deadline expires, logs a warning and orphan remaining tasks (they keep running until the process exits; dropping a JoinHandle does not abort the underlying tokio task).
Finalizes pipeline states: non-failed pipelines move to Stopped, failed pipelines stay Failed.
Force-flushes metrics, traces, and logs providers.
Aborts the health server task.

Shutdown timeout¶

Configure the drain deadline with a [shutdown] block:

[shutdown]
timeout_secs = 60

Field	Required	Default	Description
`timeout_secs`	no	30	Maximum seconds to wait for in-flight envelopes to drain after SIGINT. Must be greater than 0.

A shorter timeout means faster process exit but risks dropping envelopes that are still in channel buffers. A longer timeout gives sinks more time to flush but delays termination.

What happens to in-flight envelopes¶

Sources stop producing new envelopes immediately when the cancellation token fires. Sinks continue consuming from their channel receivers until the upstream closes (the source task drops its sender). Envelopes already in channel buffers are processed as long as they drain within the timeout. If the timeout expires first, those envelopes are lost.

Exit codes¶

Courier's run command exits with:

Code	Condition
0	All pipelines shut down cleanly (SIGINT or natural source exhaustion).
1	Configuration or startup error, or at least one pipeline hit `fail_pipeline`.
2	Invalid CLI arguments (handled by `clap`).

The fail_pipeline error policy sets the pipeline state to Failed and cancels its token. At process exit, Courier::run() inspects CourierState::has_failures() and returns RunOutcome::Failed, which maps to exit code 1.

This lets supervisors and init systems detect runtime failures:

courier run -c config.toml || echo "Courier exited with error code $?"

In a Kubernetes pod spec, a non-zero exit code triggers pod restart policies automatically.

Interaction with error policies¶

The fail_pipeline error policy and lifecycle state are tightly coupled:

A sink or transform with on_error = "fail_pipeline" encounters an error.
NodeCtx::mark_pipeline_failed() transitions the pipeline to Failed.
The node calls cancel.cancel() to signal all other tasks in the pipeline.
Sources, transforms, and sinks see the cancellation and exit their loops.
At the end of run(), has_failures() returns true and the process exits with code 1.

The drop error policy does not affect lifecycle state — the pipeline continues running.