Lifecycle, health probes, and shutdown¶
Courier tracks the lifecycle state of every pipeline and exposes HTTP endpoints for container orchestrators (Kubernetes, systemd, Docker, etc.) to probe liveness and readiness. The same state machine drives graceful shutdown and exit codes.
Pipeline states¶
Each pipeline transitions through a well-defined set of states:
| State | Meaning |
|---|---|
Starting |
Tasks have been spawned but have not yet begun processing envelopes. |
Running |
The pipeline is actively processing envelopes. Reached once spawn_pipeline finishes wiring channels. |
Draining |
A shutdown signal was received. Sources stop pulling; sinks drain their channel receivers until upstream closes. |
Stopped |
All tasks completed. Terminal state for graceful shutdown. |
Failed |
An unrecoverable error terminated the pipeline (the fail_pipeline error policy). Terminal state — the pipeline does not recover. |
State transitions are forward-only (except for Failed, which can be reached from any non-terminal state). Once a pipeline enters Failed or Stopped, no subsequent transition can overwrite it — a concurrent SIGINT arriving after a FailPipeline error will not erase the failure.
Health probes¶
Enable the health server with a [health] block in your config:
The server provides two endpoints:
| Endpoint | Method | Returns |
|---|---|---|
/health/live |
GET | 200 OK with body ok — the process is alive. |
/health/ready |
GET | 200 OK when every pipeline is Running and no shutdown has been requested; 503 Service Unavailable otherwise. |
The readiness response includes the state of every pipeline:
{
"status": "not_ready",
"pipelines": [
{ "name": "orders", "state": "starting" },
{ "name": "events", "state": "running" }
]
}
Kubernetes example¶
livenessProbe:
httpGet:
path: /health/live
port: 9090
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 9090
initialDelaySeconds: 5
periodSeconds: 10
If [health] is not present in the config, no health server is started.
Graceful shutdown¶
On SIGINT/Ctrl+C, Courier:
- Transitions every pipeline to
Draining. - Cancels the shared
CancellationToken— sources stop pulling, transforms finish their current item, sinks drain their channel receivers. - Waits up to
shutdown.timeout_secs(default 30 s) for all tasks to complete. - If the deadline expires, logs a warning and orphan remaining tasks (they keep running until the process exits; dropping a
JoinHandledoes not abort the underlying tokio task). - Finalizes pipeline states: non-failed pipelines move to
Stopped, failed pipelines stayFailed. - Force-flushes metrics, traces, and logs providers.
- Aborts the health server task.
Shutdown timeout¶
Configure the drain deadline with a [shutdown] block:
| Field | Required | Default | Description |
|---|---|---|---|
timeout_secs |
no | 30 | Maximum seconds to wait for in-flight envelopes to drain after SIGINT. Must be greater than 0. |
A shorter timeout means faster process exit but risks dropping envelopes that are still in channel buffers. A longer timeout gives sinks more time to flush but delays termination.
What happens to in-flight envelopes¶
Sources stop producing new envelopes immediately when the cancellation token fires. Sinks continue consuming from their channel receivers until the upstream closes (the source task drops its sender). Envelopes already in channel buffers are processed as long as they drain within the timeout. If the timeout expires first, those envelopes are lost.
Exit codes¶
Courier's run command exits with:
| Code | Condition |
|---|---|
| 0 | All pipelines shut down cleanly (SIGINT or natural source exhaustion). |
| 1 | Configuration or startup error, or at least one pipeline hit fail_pipeline. |
| 2 | Invalid CLI arguments (handled by clap). |
The fail_pipeline error policy sets the pipeline state to Failed and cancels its token. At process exit, Courier::run() inspects CourierState::has_failures() and returns RunOutcome::Failed, which maps to exit code 1.
This lets supervisors and init systems detect runtime failures:
In a Kubernetes pod spec, a non-zero exit code triggers pod restart policies automatically.
Interaction with error policies¶
The fail_pipeline error policy and lifecycle state are tightly coupled:
- A sink or transform with
on_error = "fail_pipeline"encounters an error. NodeCtx::mark_pipeline_failed()transitions the pipeline toFailed.- The node calls
cancel.cancel()to signal all other tasks in the pipeline. - Sources, transforms, and sinks see the cancellation and exit their loops.
- At the end of
run(),has_failures()returnstrueand the process exits with code 1.
The drop error policy does not affect lifecycle state — the pipeline continues running.