LLM Graders

LLM graders use a language model to evaluate agent responses against custom criteria defined in a prompt file.

Explicit LLM Graders

Put semantic grading requirements in assert. Plain strings are handled by the built-in llm-rubric rubric grader. Use type: llm-grader when you need a custom prompt, target, or grader-specific preprocessing:

tests:
  - id: simple-eval
    input: "Debug this function..."
    assert:
      - Correctly explains the bug and proposes a fix

expected_output is passive gold/reference data. It is available to graders but does not create an LLM grading call by itself. Depending on the grader, it can be used as an exact target, a semantic reference answer, a structured object, or supporting context. See How reference fields and assertions interact.

Configuration

Reference an LLM grader in your eval file:

assert:
  - name: semantic_check
    type: llm-grader
    prompt: file://graders/correctness.md
    target: grader_gpt_5_mini   # optional: route this grader to a named LLM target

Use target: when you want different llm-grader entries in the same eval to run on different grader models. This is useful for grader panels, majority-vote ensembles, and grader A/B benchmarks.

Prompt Files

The prompt file defines evaluation criteria and scoring guidelines. It can be a markdown text template or a TypeScript/JavaScript dynamic template.

Markdown Template

Write evaluation instructions as markdown. Template variables are interpolated:

# Evaluation Criteria

Evaluate the candidate's response to the following question:

**Question:** {{input}}
**Criteria:** {{criteria}}
**Reference Answer:** {{expected_output}}
**Candidate Answer:** {{output}}

## Scoring

Score the response from 0.0 to 1.0 based on:
1. Correctness — does the output match the expected outcome?
2. Completeness — does it address all parts of the question?
3. Clarity — is the response clear and well-structured?

Available Template Variables

Variable	Source
`criteria`	Test `criteria` field
`input`	Resolved input text
`expected_output`	Reference answer text
`output`	Candidate answer text
`metadata`	Test metadata as formatted JSON
`metadata_json`	Test metadata as compact JSON
`rubric`	Rubric data as structured JSON when available, or criteria text otherwise
`rubrics`	LLM-grader rubric items as formatted JSON
`rubrics_json`	LLM-grader rubric items as compact JSON
`file_changes`	Unified diff of workspace file changes (populated when `workspace` is configured)
`tool_calls`	Formatted summary of tool calls from agent execution (tool name + key inputs per call)

Use prompt: ./path/to/prompt.md for the common relative-path case. Use prompt: file://path/to/prompt.md only when you need to force file-reference resolution explicitly.

Structured task input belongs in input. If input is a message whose content is a JSON object, {{input}} renders that object as formatted JSON for the grader prompt; no separate grader-only input field is required. Use metadata for provenance or suite-level source fields, and rubrics_json for rubric arrays.

Suite-level metadata is inherited by every test. When rubric items vary per test, keep the grader on each test and reuse the prompt file:

metadata:
  source_repo: https://github.com/virattt/dexter
  source_commit: 8d9419829f443f84b804d033bb2c3b1fbd788629
  source_file: src/evals/dataset/finance_agent.csv

tests:
  - id: apple-research
    input:
      company: Apple
      ticker: AAPL
    metadata:
      row: 1
    assert:
      - name: dexter_semantic
        type: llm-grader
        prompt: file://prompts/dexter-grader.md
        rubrics:
          - operator: correctness
            criteria: Uses the provided ticker and company.

Per-Grader Target

By default, an llm-grader uses the suite target’s grader_target. Override it per grader when you need multiple grader models in one run:

assert:
  - name: grader-gpt
    type: llm-grader
    target: grader_gpt_5_mini
    prompt: ./prompts/pass-fail.md
  - name: grader-haiku
    type: llm-grader
    target: grader_claude_haiku
    prompt: ./prompts/pass-fail.md

Each target: value must match a named LLM target in .agentv/targets.yaml.

TypeScript Template

For dynamic prompt generation, use the definePromptTemplate function from @agentv/sdk:

#!/usr/bin/env bun
import { definePromptTemplate } from '@agentv/sdk';

function textFromMessages(messages: Array<{ content?: unknown }>): string {
  return messages
    .map((message) => typeof message.content === 'string' ? message.content : '')
    .filter(Boolean)
    .join('\n');
}

export default definePromptTemplate((ctx) => {
  const rubric = ctx.config?.rubric as string | undefined;
  const question = textFromMessages(ctx.input.filter((message) => message.role === 'user'));
  const referenceAnswer = textFromMessages(ctx.expectedOutput);
  const candidateAnswer = ctx.output ?? '';

  return `You are evaluating an AI assistant's response.

## Question
${question}

## Candidate Answer
${candidateAnswer}

${referenceAnswer ? `## Reference Answer\n${referenceAnswer}` : ''}

${rubric ? `## Evaluation Criteria\n${rubric}` : ''}

Evaluate and provide a score from 0 to 1.`;
});

How It Works

AgentV renders the prompt template with variables from the test
The rendered prompt is sent to the grader target (configured in targets.yaml)
The LLM returns a structured evaluation with score, assertions array, and reasoning
Results are recorded in the output JSONL

Command Configuration

When using TypeScript templates, configure them in YAML with optional config data passed to the command:

assert:
  - name: custom-eval
    type: llm-grader
    prompt:
      command: [bun, run, ../prompts/custom-grader.ts]
      config:
        rubric: "Your rubric here"
        strictMode: true

The config object is available as ctx.config inside the template function.

Preprocessing File Outputs

If an agent returns a ContentFile block instead of plain text, you can preprocess that file into text before llm-grader builds the candidate prompt.

AgentV always tries a default UTF-8 text read first. That is enough for text-based formats such as CSV, JSON, SQL, Markdown, YAML, HTML, XML, and plain text. For binary formats such as .xlsx, .pdf, or .docx, add a preprocessor command:

preprocessors:
  - type: xlsx
    command: ["bun", "run", "scripts/preprocessors/xlsx-to-csv.ts"]

tests:
  - id: spreadsheet-output
    input: Generate the spreadsheet report
    assert:
      - Output includes the revenue rows
      - name: spreadsheet-check
        type: llm-grader
        prompt: |
          Check whether the transformed spreadsheet text contains the revenue rows:

          {{ output }}

type accepts either a short alias such as xlsx or a full MIME type such as application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.

Resolution order:

per-grader preprocessors override suite-level entries
if no preprocessor matches, AgentV falls back to a UTF-8 text read
if the fallback read looks binary or invalid, the grader receives a warning note instead of failing the test run

See examples/features/preprocessors/ for a runnable example with a file-producing target and a custom preprocessor script.

Available Context Fields

TypeScript templates receive a context object with these fields:

Field	Type	Description
`input`	`Message[]`	Full resolved input messages
`output`	`string \| null`	Candidate final answer / scored result
`answer`	`string`	Same final answer string, exposed for ergonomic handler code
`messages`	`Message[]`	Transcript messages from the target execution
`criteria`	`string`	Test `criteria` field
`expectedOutput`	`Message[]`	Full resolved expected output
`trace`	`Trace`	Full execution trace with messages, events, metrics, and provenance
`traceSummary`	`TraceSummary`	Lightweight execution metrics summary
`metadata`	`object`	Test metadata after suite defaults are merged
`config`	`object`	Custom config from YAML

The raw prompt-template stdin uses snake_case keys such as expected_output, trace_summary, and token_usage. definePromptTemplate() converts them to SDK camelCase fields before calling your handler.

Template Variable Derivation

Template variables are derived internally through three layers:

1. Authoring Layer

What users write in YAML or JSONL:

input may be a shorthand string or a full message array. input: "What is 2+2?" expands to [{ role: "user", content: "What is 2+2?" }].
expected_output may be a shorthand string or a full message array. expected_output: "4" expands to [{ role: "assistant", content: "4" }].

2. Resolved Layer

After parsing, canonical message arrays replace the shorthand fields:

input: TestMessage[] — canonical resolved input
expected_output: TestMessage[] — canonical resolved expected output

At this layer, input and expected_output no longer exist as separate fields.

3. Template Variable Layer

Derived strings injected into grader prompts:

Variable	Derivation
`criteria`	Passed through from the test field
`input`	Resolved input text
`expected_output`	Reference answer text
`output`	Candidate answer text
`metadata_json`	Test metadata, compact JSON
`file_changes`	Unified diff of workspace file changes (populated when `workspace` is configured)
`tool_calls`	Formatted summary of tool calls from agent execution (tool name + key inputs per call)

Example flow:

# User writes:
input: "What is 2+2?"
expected_output: "The answer is 4"

# Resolved:
input:    [{ role: "user", content: "What is 2+2?" }]
expected_output: [{ role: "assistant", content: "The answer is 4" }]

# Derived template variables:
input:           "What is 2+2?"
expected_output: "The answer is 4"
output:          (extracted from provider output at runtime)