Pipelines Docs is in beta — content is actively being added.
Platform GuideEvaluations

Evaluations Overview

Layer custom scoring dimensions on top of agent outputs with human ratings, LLM judges, and programmatic checks.

The built-in LLM judge already renders the primary pass/fail verdict on each agent run — see Inspecting runs. Evaluations let you layer your own additional scoring dimensions on top of those agent outputs (and on any field output). Scores can come from a human evaluating an agent trace, an LLM acting as a judge, or an automated programmatic check (regex, keyword, JSON validity, etc.).

The building blocks

  1. Criterion — a single, reusable scoring metric (e.g. Factual accuracy, 1–5 or Response is valid JSON). Criteria live in the Evaluations → Criteria library and are the atomic unit: every evaluation result in the system is tied to exactly one criterion at a specific version.
  2. Evaluator — a criterion that has been attached to a specific node so it scores that node's agent output. This is what actually produces a result for a task. Evaluators are configured in one of two places:
    • Inline, via the Form Builder — surfaced as a form field a human reviewer sees and acts on.
    • Hidden, via the Evaluators panel — run server-side, invisible to the reviewer.

Inline vs. Hidden evaluators

Both kinds produce evaluation results attached to tasks; the difference is who sees the evaluator while reviewing an agent's output.

Inline evaluatorHidden evaluator
Where configuredForm Builder (as a field)Evaluators panel
Visible to a human reviewerYes (part of the form)No (runs in the background)
Supported criterion typesHuman Rating, LLM Judge, ProgrammaticLLM Judge, Programmatic only (not Human Rating)
When to useYou want a reviewer to score it themselves or see the score, or the agent output under evaluation is also something the reviewer should rate or re-runYou want automated scoring that the reviewer should not see or be influenced by

You can mix both on the same node — for example, a reviewer fills in an inline human rating while a hidden LLM judge scores the same agent response in the background.

Where things live in the UI

ThingHow to get there
Criteria librarySidebar → EvaluationsCriteria tab
Attach an inline evaluatorPipeline Builder → select a node → open the Form Builder, add a field (see Running evaluations)
Attach a hidden evaluatorPipeline Builder → select a node → Evaluators panel → Add
Trigger a manual evaluator on existing tasksPipeline → Data Explorer → select rows → Evaluate button
View results per taskData Explorer table, or click View on a row to open the Task Detail panel
View aggregate analyticsPipeline → Data ExplorerEvaluation Analytics tab

End-to-end flow

  1. Create the criteria you need in the Criteria tab, or reuse existing ones.

  2. Open the Pipeline Builder, select the node whose agent output you want to score, and attach evaluators — either inline via the Form Builder or hidden via the Evaluators panel. See Running evaluations for the three ways to make a field evaluative.

  3. For hidden evaluators only, choose a Trigger:

    • On Submit — runs automatically when the node is submitted.
    • Manual — only runs when triggered from the Data Explorer.

    Inline evaluative fields (Criteria fields and toggled-evaluative form fields) do not have a trigger selector — they always execute inline as part of the node's lifecycle.

  4. Run the agent and submit tasks. Inline and On-Submit evaluators produce results as the node is submitted (inline fields populate immediately; hidden On-Submit evaluators queue as a background job).

  5. For Manual hidden evaluators, open the Data Explorer, select tasks, click Evaluate, and confirm in the dialog.

  6. View results in the Data Explorer table (per-task) and the Evaluation Analytics tab (aggregate charts and breakdowns).

Versioning

Criteria are versioned. A new criterion version is created when its display label, config, or output schema changes on save; name- or description-only edits do not bump the version. Pipelines that reference the criterion do not auto-update — they keep running against the version they were pinned to until the pipeline is edited.

Existing evaluation results stay attached to the version that produced them, so historical data is never silently rewritten.