When a pipeline fails

A failed pipeline run tells you not just that it failed, but which step, why, and what to do — the way a CI job does. This page covers how to read that, and how to recover.

Where to look

Every run carries per-step history. Fetch it by run id:

curl -s -H "Authorization: Bearer $JWT" \
  http://localhost:3000/api/v1/pipelines/<pipeline-id>/runs/<run-id>

Agents get the same shape from the helmdeck__pipeline-run-status MCP tool, and the Management UI /pipelines panel renders it with a colored badge per failure class. A failed run looks like:

{
  "status": "failed",
  "failure_class": "caller_fixable",
  "failure_reason": "The inputs given to this step were invalid. Fix them and re-run — the step's error message says what was wrong.",
  "steps": [
    { "step_id": "ground", "status": "succeeded", "...": "..." },
    {
      "step_id": "render",
      "status": "failed",
      "error_code": "invalid_input",
      "failure_class": "caller_fixable",
      "failure_reason": "…",
      "error": "invalid_input: …"
    }
  ]
}

The failing step has the typed error_code; the run mirrors the failing step's failure_class and failure_reason at the top level so you don't have to scan every step.

The four failure classes

`failure_class`	What it means	What to do
`caller_fixable`	The inputs or model handed to the step were invalid — e.g. a model the gateway can't route, or a bad `${{ }}` reference in the pipeline definition.	Fix the input (the step's `error` says what was wrong) and re-run. For a bad model, read `helmdeck://models` for routable IDs.
`pack_bug`	A code-level error inside helmdeck: a pack handler failed in an uncategorized way, broke its own output contract, or hit an engine invariant. Not your input's fault.	File an issue — the `failure_reason` includes a prefilled `github.com/tosin2013/helmdeck/issues/new` link with the pack, code, and message already filled in.
`transient`	A timeout, a session that couldn't be acquired, or an artifact-store blip.	Re-run — it may simply succeed.
`state_changed`	The external state the step acted on moved under it (e.g. a non-fast-forward push).	Refresh the target state and re-run.

The class is derived from the step's typed error code (invalid_input → caller_fixable, handler_failed/internal/invalid_output → pack_bug, timeout/session_unavailable/artifact_failed → transient, schema_mismatch → state_changed).

Re-running

Once you've addressed the cause, re-run with the same pipeline and inputs:

curl -s -X POST -H "Authorization: Bearer $JWT" \
  http://localhost:3000/api/v1/pipelines/<pipeline-id>/runs/<run-id>/rerun
# → { "run_id": "<new-run-id>", "status": "pending" }

Agents use helmdeck__pipeline-rerun; the UI has a Re-run button on each run. This starts a fresh run — every step executes again from the top.

Roadmap

Re-run is the first recovery action. Two more are designed in ADR 044 and land in a later release:

Resume from the failed step — replay the already-succeeded steps' outputs and re-execute only from the failure, so you don't redo expensive work. (The sharp edges: an expired session from a repo.fetch step, and not double-firing a side effect like email.send.)
Auto-retry transient failures — retry transient-class steps a bounded number of times before failing, so an environment blip doesn't surface as a hard failure.

Both build on the attribution above: you can only safely automate recovery from a failure you can first classify.

Where to look​

The four failure classes​

Re-running​

Roadmap​

Where to look

The four failure classes

Re-running

Roadmap