Pipelines Docs is in beta — content is actively being added.
AgentsEnd-to-End Workflows

Runbooks

End-to-end walkthroughs of agent setup.

All runbooks assume:

  • Role. Registration and the Agents pages are visible only to Org Admins and Project Admin Owners.
  • A judge model. Every run is graded by an LLM judge. Pick one per agent field (Models popover in the Pipeline Builder) or set an org default under Settings → Models once and forget it. A run with neither fails as agent_model_unresolved.
  • Publish. A draft agent can't be dispatched. After saving, click Publish.

Any sandbox agent

1. Sandbox a custom Python agent

Goal: run your own Python code in a platform sandbox, end to end, with no repo and no tools: register, seed one task, and read the trajectory and verdict.

Step 1: Register. In the sidebar, Agents → Register agent, pick the Sandbox Agents mode card. Set a Name (e.g. hello-sandbox). Under How your agent runs, select Python function.

Step 2: Paste the code. In the Code source picker choose Paste code and paste the following module:

import platform
import subprocess
from pathlib import Path


def run(task_input, *, proxy_url, run_token):
    instruction = task_input["user_instruction"]

    # Do some real work so the trajectory has steps to show.
    Path("/tmp/notes.txt").write_text(f"task: {instruction}\n")
    uname = subprocess.run(["uname", "-a"], capture_output=True, text=True)

    return {
        "final_response": (
            f"Hello from the sandbox (Python {platform.python_version()}). "
            f"You asked: {instruction!r}. Kernel: {uname.stdout.strip()}"
        )
    }

The contract: a top-level callable taking (task_input, *, proxy_url, run_token), returning {"final_response": ...} (or a plain string, which the platform wraps). An unhandled exception is captured and graded, it doesn't count as an infra failure.

The sandbox cannot import the Pipelines SDK, write SDK-free code. A real agent calls your model provider directly: declare the package (e.g. anthropic) under Sandbox environment (advanced) → Python requirements and the API key as an Environment variable → From credential (runbook 6 covers the environment; the credential path is the only encrypted one).

Step 3: Entrypoint. Leave Entrypoint as run (it must name a top-level callable, dotted paths are rejected). Save, then Publish.

Step 4: Wire and seed. In the Pipeline Builder, add an agent-mode field and select hello-sandbox. In the field's Models popover pick a Judge model (skip if your org default is set). The seed's user column is always required, create one task with this CSV:

user
"Summarize what environment you are running in."

Verify, open the task from the Data Explorer (View trace). The run completes, the Agent Trace tab shows the final response, a Judge verdict (PASS/FAIL with rubric reasoning), and a Trajectory that fills in after the run completes, for this example, a file write and a shell step. No diff and no scorer badges: there's no workspace by design.

If it fails

ErrorCauseFix
FAILED agent_model_unresolvedNo judge model is resolvable for the run.Pick one on the field's Models popover or set the org default.
Trajectory is emptyCapture is best-effort and keys off real syscalls like subprocess, open, and httpx; pure compute returns can produce few steps.Trigger at least one real filesystem, process, or network action in the agent path you are testing.
You expected a diffPlain sandbox runs have no workspace attached.Attach a coding scenario (runbook 7).

2. Ship agent code as a ZIP

Goal: Run a multi-file agent when inline paste limits are exceeded (200 KB for pasted code, 1 MB across Multiple files) or when non-.py assets are required.

You'll need: A .zip archive of your agent (≤ 100 MB compressed, ≤ 500 MB uncompressed) and Org Admin permissions in the agent's organization. Archive uploads are organization-scoped and are not granted by project-level roles.

Step 1: Structure the archive. A single top-level folder is flattened on unzip, so both of these land the same way in /home/user/agent:

my-agent.zip                 my-agent.zip
├── main.py                  └── my-agent/
└── helpers/                     ├── main.py
    └── prompts.py               └── helpers/
                                     └── prompts.py

main.py defines the entrypoint, exactly as in runbook 1:

from helpers.prompts import SYSTEM_PROMPT


def run(task_input, *, proxy_url, run_token):
    return {"final_response": f"{SYSTEM_PROMPT}: {task_input['user_instruction']}"}

Step 2: Upload. Register agent → Sandbox Agents → Python function, then in the Code source picker choose Upload ZIP and drop the file. The form uploads it to storage with a short-lived signed URL and blocks submit until the upload confirms, only the confirmed archive id is saved on the agent, never the bytes.

Step 3: Entrypoint. Set Entrypoint to run and Entrypoint file to main.py (the .py inside the zip that defines it). Save and Publish.

The zip's contents aren't known at save time, so Entrypoint file is only shape-checked on save and verified inside the sandbox at dispatch, a wrong path fails the run, not the save.

Step 4: Seed and dispatch exactly as in runbook 1, steps 4–5.

Verify, at dispatch the archive is validated (size caps, zip-slip and symlink guards) and unzipped into the agent directory /home/user/agent, never into the graded workspace. The run produces a trajectory just like a pasted-code run. Fetch and validation happen before a sandbox boots, so a bad archive costs nothing.

If it fails

ErrorCauseFix
FAILED agent_code_fetch_failed before any sandboxThe file is unconfirmed, cross-org, or failed archive validation (size, zip-slip, or symlink checks).Re-upload the archive into the agent's organization and wait for upload confirmation.
Run fails on the entrypoint probeThe configured Entrypoint file path does not match a valid path inside the zip.Correct the path, accounting for top-level folder flattening, then run again.
Register stays disabledThe upload is still in progress.Wait for upload confirmation and avoid reloading during upload.

3. Pull agent code from a private git repo

Goal: Clone agent source code from a private repository at dispatch time, while ensuring the token never appears in output, logs, or stored configuration.

You'll need: An https clone URL and a PAT, or an organization credential that stores the PAT.

Step 1: Point at the repo. Register agent → Sandbox Agents → Python function, Code source picker → Git repository. Enter the Repository URL (https only, SSH URLs and user:pass@ userinfo are rejected, and an SSRF guard blocks private/loopback hosts) and an optional Ref (optional), branch, tag, or commit SHA, e.g. v1.2.0.

Step 2: Auth. Pick one of the three Auth modes:

  • None, public repo.
  • From credential, an existing org credential; decrypted server-side at dispatch.
  • Inline token, paste the PAT once. It's write-only: stored in a hidden platform-managed credential, never echoed back (edit mode shows rotate-or-keep, never the token).

Step 3: Entrypoint. Set Entrypoint (run) and Entrypoint file (the module path inside the repo, default main.py). Save, Publish, then seed and dispatch as in runbook 1, step 4.

Verify, the PAT is decrypted only when the clone command is built, injected into the clone URL, and the remote is dropped right after the clone. It's registered as a run secret, so it renders as *** on every trajectory and final-response surface, even if your agent echoes it. The clone resolves before a sandbox boots; on later turns of a multi-turn session the populated sandbox is reused without re-cloning.

If it fails

ErrorCauseFix
FAILED agent_code_fetch_failedAgent code checkout is strict, so a missing branch, tag, or SHA is treated as a hard error. Clone failures can also come from an invalid URL or PAT.Correct the Ref value, or fix the repository URL or PAT if clone authentication failed.
FAILED agent_secret_unresolvedThe git credential is missing or cannot be decrypted.Re-bind From credential or re-paste the Inline token.
Save rejectedThe configuration uses a non-https URL, includes embedded user:pass@, or sets both a credential and an inline token.Use https only and configure exactly one authentication method.

4. Give a sandboxed agent platform tools

Goal: Enable a sandbox agent to call platform tools, either simulated against seeded world state or passed through to a registered endpoint, with each call recorded as a ledger row.

You'll need: A registered sandbox agent (runbooks 1–3) or a coding CLI agent (runbook 7).

Step 1: Declare the tools. On the agent form's Tools step, add each tool's name and input_schema and pick its execution mode, sandbox (the platform simulates the response from world state) or passthrough (forwarded verbatim to a bound endpoint). Or import in bulk: Import JSON for a raw array, Import from MCP to discover a connected server's catalog. A minimal sandbox-mode tool:

{
  "name": "get_order",
  "description": "Look up an order by id.",
  "input_schema": {
    "type": "object",
    "properties": { "order_id": { "type": "string" } },
    "required": ["order_id"]
  }
}

Field rules and passthrough bindings: Tools schema.

Step 2: Call them from a Python-function agent. No SDK, no MCP, raw HTTP against the per-run proxy. Declare httpx under Python requirements and paste:

import os

import httpx


def call_tool(name, args):
    r = httpx.post(
        f"{os.environ['PIPELINES_ODYSSEY_PROXY_URL']}/tools/{name}",
        headers={"Authorization": f"Bearer {os.environ['PIPELINES_RUN_TOKEN']}"},
        json=args,
        timeout=60,
    )
    r.raise_for_status()
    return r.json()


def run(task_input, *, proxy_url, run_token):
    order = call_tool("get_order", {"order_id": "4521"})
    return {"final_response": f"Order status: {order}"}

(proxy_url / run_token arrive both as kwargs and as the PIPELINES_* env vars, use either.)

Step 2′: Or let a CLI harness reach them. A shell-command agent (runbook 7) gets the same tools through the pipelines MCP shim, auto-registered into the harness when tools are attached, nothing to wire. Claude Code, Codex, and Cursor are supported; details in MCP tools and Harness customization.

Verify, run a task and open its trace: each tool call is a ledger row with the arguments, the response, and a source badge, simulated, passthrough, injected (a failure rule fired), or error, plus a step-by-step world-state diff.

If it fails

ErrorCauseFix
tool_not_executable on a coding runCoding runs do not silently simulate tool execution.Bind the tool to a real passthrough endpoint, or have the agent perform the work directly in the repository.
503 lock_contention on parallel callsTool calls are serialized per run and concurrent calls contend on the same lock.Retry the call after contention clears.
Tools dropped with a warning at dispatchThe run uses Aider (no MCP support) or an unrecognized run command.Use Claude Code, Codex, or Cursor, or call the proxy directly from your own code as in Step 2.

5. Run a multi-turn conversation

Goal: Run a simulated user through multiple turns against the agent within one persistent session, and obtain a session-level verdict.

You'll need: A registered agent with healthy one-shot runs (runbook 1).

Step 1: Configure the simulator. On the agent field, open the Multi-turn popover and set:

  • simulator_mode, persona (an LLM plays the user; give it a short persona describing goals, tone, escalation) or scripted (a fixed JSON array of user turns, replayed in order).
  • max_turns, hard cap, 1..50 (default 10; 6–10 is a good start).
  • memory_mode, replay (default; the platform re-sends the transcript in input.messages each turn) or stateful (your agent keeps its own memory, keyed by input.session_id).

The popover sets turn_mode = model_as_user on the seed for you. To vary any of these per row instead, opt each column in under the popover's Vary per row from CSV section, a wired column becomes a required CSV column at upload:

user,turn_mode,max_turns,simulator_mode,memory_mode,user_simulator_persona
"Help me fix my failed refund for order #4521.","model_as_user","8","persona","replay","Frustrated but cooperative; expects the agent to remember order details and not repeat verification."

Step 2: Dispatch and read. Run the task and open it: multi-turn rows show the canonical transcript across turns, a per-turn trajectory, and the session-level judge verdict over the whole conversation.

Verify, the session terminates when the simulator signals done, max_turns is hit, or a termination_keyword appears in an agent reply. On a coding session (runbook 7 + this one) the sandbox and repo are seeded once at turn 0 and carried forward; each turn's diff is cumulative against the original baseline, and the session judge sees the cumulative diff alongside the transcript. Full axis reference: Multi-turn testing.

If it fails

ErrorCauseFix
Session FAILED simulator_model_unresolvedNo user-simulator model is configured for the run.Pick one on the field's Models popover or set the org default.
Ran single-shot despite the popoverA turn_mode typo (multi_turn, model_as_users) or a stale seed causes fallback to one-shot mode.Use the exact value model_as_user, then re-check column mapping and re-seed.
Session ends immediatelysimulator_mode = scripted is set with an empty or non-JSON-array scripted_user_turns value.Provide scripted_user_turns as a JSON array of strings.

6. Customize the sandbox runtime

Goal, add system packages, Python deps, a boot step, or a persistent custom image to any code agent. The default image is Python 3.13 with git, ripgrep, unzip, uv, and pytest, start here only if that's not enough.

You'll need, a registered sandbox agent (any of runbooks 1–3, 7).

Step 1: Boot-time layering (per-run, no image build). Open Sandbox environment (advanced) and set any of:

  • System packages (one per line), apt packages (≤ 50), installed as root once at sandbox boot, before your agent.
  • Setup command, one shell command run at sandbox start, after package installs, with your resolved env (so it can use a credential-backed token). A nonzero exit fails the run.
  • Python requirements (one per line) and Python version (3.9–3.13; blank = 3.13), Python-function agents only; pip-installed before your agent runs.

Step 2: Persistent image (for heavier tooling). Switch Base image to Custom Dockerfile and write only the body, the platform prepends FROM pipelines-workspace-base:

RUN sudo apt-get update && sudo apt-get install -y jq
RUN sudo npm install -g @anthropic-ai/claude-code
ENV NODE_OPTIONS=--max-old-space-size=4096

Constraints: only RUN / ENV / WORKDIR (no COPY/ADD, there's no build context, and no second FROM, ENTRYPOINT, CMD, or USER); ≤ 32 KB of text. Two gotchas that account for most build failures:

  • The build runs as a normal user, global installs (npm install -g, system pip, apt-get) need sudo (passwordless in the base image).
  • The base is Python 3.13, Python CLIs pinning old deps won't build there. Use uv tool install --python 3.11 <tool> and call the binary by full path (/home/user/.local/bin/<tool>).

Step 3: Build. Leave Build image now checked when saving (or click Build image on the agent detail page, building is always an explicit action). The Custom image card shows the chip: Not built → Building… (live log) → Ready (or Build failed with the log tail and a Rebuild button). An identical Dockerfile already built in your org is reused without rebuilding.

Verify, wait for Ready, then dispatch. Runs with a custom Dockerfile are pinned to the built image: while Building… or Build failed, dispatch is a hard error, never a silent fallback to the default image.

If it fails

ErrorCauseFix
FAILED environment_setup_failedA System packages install step or the Setup command exited nonzero.Fix the failing command, remembering it runs as the default user and requires sudo for system-level changes.
Chip shows Build failedThe Dockerfile build failed, commonly due to missing sudo or a Python 3.13-incompatible package install.Open the failure log on the agent page, correct the Dockerfile, then click Rebuild.
FAILED image_not_ready / image_build_failedDispatch was attempted before the image reached Ready status.Build or rebuild the image first, then run again.
Build request returns 409Only one build can run at a time per organization.Wait for the in-flight build to complete, then retry.

Coding agents

7. Run a coding CLI agent on a repo task

Goal, register Claude Code (or Codex / Cursor / Aider) as an agent, point it at a seeded repo, and read the trajectory, final diff, scorer badges, and verdict.

You'll need, a model-provider API key stored as an org credential, and a repo for the agent to work on (any public repo works for a first run).

Step 1: Register. Agents → Register agent → Sandbox Agents. Under How your agent runs, pick Shell command (any CLI).

Step 2: Run command. Click the Claude Code preset chip, it fills in:

claude -p "$(cat $PIPELINES_TASK_FILE)"

The platform writes the task brief to $PIPELINES_TASK_FILE and, for a recognized CLI, appends the headless/approval flags for you (--print for Claude Code, --yes-always for Aider, --force for Cursor). Call the binary directly, wrapping the command in bash -c "…" is the one form that isn't recognized, and it costs you the rich trajectory.

Step 3: Install the CLI in the image. The base image ships no coding CLIs. Under Sandbox environment (advanced) → Base image → Custom Dockerfile, add the install line and leave Build image now checked:

CLIInstall line
Claude CodeRUN sudo npm install -g @anthropic-ai/claude-code
CodexRUN sudo npm install -g @openai/codex
CursorRUN curl https://cursor.com/install -fsSL | bash
AiderRUN uv tool install --python 3.11 aider-chat

Step 4: Key. Add an Environment variable row, mode From credential (the only encrypted path): ANTHROPIC_API_KEY for Claude Code, OPENAI_API_KEY for Codex, CURSOR_API_KEY for Cursor, the --model provider's key for Aider. Also raise the Run timeout, the 300 s default is tight for coding runs (max 1800 s). Save and Publish; wait for the Custom image chip to reach Ready.

Step 5: Attach a coding setup and seed. A shell-command agent always needs a workspace. In the Pipeline Builder, select the agent on an agent-mode field and open the field's Coding setup popover: pick the Git URL workspace tile and enter your repo's URL, add an optional Setup command (runs before the baseline commit, so it stays out of the graded diff), and Scorers rows if you want mechanical gates. Pick a Judge model, then seed one task:

user
"Fix the failing test in tests/test_parser.py and make the suite pass."

Verify, the platform seeds the repo at /home/user/workspace, commits a baseline, runs your command, and the task page shows all four surfaces: a Trajectory timeline (Shell steps with output and exit codes, Edit steps as red/green diffs, Read/Search, Assistant reasoning, it fills in after the run completes), the Final diff against the baseline, scorer badges, and the Judge verdict. Full UI tour: Inspecting runs.

If it fails

ErrorCauseFix
FAILED agent_command_failed (in the error detail)The command exited nonzero and left an empty diff, typically because the CLI waited on an interactive prompt.Use a recognized binary so headless flags are appended automatically, or pass equivalent headless flags manually. (A nonzero exit with a real diff is not treated as failure; scorers and the judge determine outcome.)
FAILED in_sandbox_requires_workspaceNo coding scenario is attached to the task.Attach a coding scenario via the Coding setup popover (Step 5), then re-seed.
FAILED image_not_readyThe custom image is still building.Wait for Ready status (Step 4), then run again.
Trajectory empty but diff presentAn unrecognized run command (for example, bash -c "…") triggered coarse fallback capture.Invoke the CLI binary directly.
Banner "Eval phase failed, diff/scorers may be missing"Diff or scorer capture failed, but the run itself completed and trajectory data is still available.Expand the banner to inspect the redacted failure cause and re-run if needed.

8. Seed many coding tasks from CSV

Goal, define the repo, setup, and scorers once on the agent field, then fan a CSV of prompts out into many tasks, each frozen against that definition.

You'll need, a coding agent (runbook 7) and a CSV of task prompts.

Step 1: Define the coding setup. On the agent-mode field, open the Coding setup popover: pick the Workspace source (Git URL / Upload ZIP / Empty), an optional Setup, and Scorers rows. The definition saves onto the field, and every task seeded from the workflow starts from it.

The two Setup modes differ in whether the setup work is graded: Platform runs a command executes the Setup command before the baseline commit, so installs and fixtures never pollute the agent's diff; Agent sets up sends the Setup instructions to the agent, whose setup work does land in the graded diff. Use platform mode unless setting up is itself the task.

On the workspace seed path, a bad git Ref silently falls back to the repo's default branch, the run proceeds with no error (unlike the strict agent-code checkout of runbook 3). Double-check branch/tag/SHA spelling.

Step 2: Upload the CSV. The CSV carries the field's wired seed columns, user is always required, one prompt per row:

user
"Fix the failing test in tests/test_parser.py."
"Add a --json flag to the CLI entrypoint."
"Refactor parse() to return a dataclass without changing behavior."

Each row becomes one task carrying a deep-copied, frozen snapshot of the field's coding setup at creation time. Editing the setup later never changes already-seeded tasks, re-seed to pick up edits.

Step 3: Per-row scenario control (API only). A saved, org-scoped scenario library lives at /api/coding-scenarios (names unique per org), and the seeding service accepts per-row CSV axes on top of it: scenario_ref picks a saved scenario by name per row, and workspace_seed / setup_command / eval override it per row (a row cell always wins; the scenario fills gaps). These columns flow only when the workflow's agent field wires them in agentConfig.odyssey_seed_columns via the workflow API, the builder UI doesn't expose toggles for them; its UI path is the Coding setup popover of Step 1.

Verify, one task per CSV row in the Data Explorer, each with the frozen workspace seed, setup, and scorers; open any row to confirm the seed snapshot on the trace tab. Full axis semantics: Coding scenarios.

If it fails

ErrorCauseFix
A row fails with no task created (API path)A present-but-blank or malformed scenario_ref, workspace_seed, or eval cell triggers a hard-error axis. Coding tasks are not silently downgraded to non-workspace runs.Correct the malformed cell. Omitting the entire column is valid if that axis is not needed.
ScenarioResolutionError (API path)The scenario_ref value is unknown, cross-org, or archived.Use the exact name of a live scenario in the task's organization. Use GET /api/coding-scenarios?include_archived=true to locate archived entries.
POST /api/coding-scenarios returns 409Scenario name collision occurred because names are unique per organization.Rename the scenario and retry creation.
FAILED workspace_seed_failedThe repository could not be cloned or unzipped in the sandbox due to bad URL/host, disallowed scheme, invalid Subdirectory (optional), or archive validation failure.Correct the workspace seed configuration and run again.
FAILED workspace_setup_failedThe platform-mode Setup command failed after repository seeding.Reproduce the setup command locally against the same repository, fix the failure, then re-run.