LLM Graders
LLM graders use a language model to evaluate agent responses against custom criteria defined in a prompt file.
Explicit LLM Graders
Section titled “Explicit LLM Graders”Put semantic grading requirements in assert. Plain strings are
handled by the built-in llm-rubric rubric grader. Use type: llm-grader when you
need a custom prompt, target, or grader-specific preprocessing:
tests: - id: simple-eval input: "Debug this function..." assert: - Correctly explains the bug and proposes a fixexpected_output is passive gold/reference data. It is available to graders but
does not create an LLM grading call by itself. Depending on the grader, it can
be used as an exact target, a semantic reference answer, a structured object, or
supporting context. See How reference fields and assertions interact.
Configuration
Section titled “Configuration”Reference an LLM grader in your eval file:
assert: - name: semantic_check type: llm-grader prompt: file://graders/correctness.md target: grader_gpt_5_mini # optional: route this grader to a named LLM targetUse target: when you want different llm-grader entries in the same eval to run on different grader models. This is useful for grader panels, majority-vote ensembles, and grader A/B benchmarks.
Prompt Files
Section titled “Prompt Files”The prompt file defines evaluation criteria and scoring guidelines. It can be a markdown text template or a TypeScript/JavaScript dynamic template.
Markdown Template
Section titled “Markdown Template”Write evaluation instructions as markdown. Template variables are interpolated:
# Evaluation Criteria
Evaluate the candidate's response to the following question:
**Question:** {{input}}**Criteria:** {{criteria}}**Reference Answer:** {{expected_output}}**Candidate Answer:** {{output}}
## Scoring
Score the response from 0.0 to 1.0 based on:1. Correctness — does the output match the expected outcome?2. Completeness — does it address all parts of the question?3. Clarity — is the response clear and well-structured?Available Template Variables
Section titled “Available Template Variables”| Variable | Source |
|---|---|
criteria | Test criteria field |
input | Resolved input text |
expected_output | Reference answer text |
output | Candidate answer text |
metadata | Test metadata as formatted JSON |
metadata_json | Test metadata as compact JSON |
rubric | Rubric data as structured JSON when available, or criteria text otherwise |
rubrics | LLM-grader rubric items as formatted JSON |
rubrics_json | LLM-grader rubric items as compact JSON |
file_changes | Unified diff of workspace file changes (populated when workspace is configured) |
tool_calls | Formatted summary of tool calls from agent execution (tool name + key inputs per call) |
Use prompt: ./path/to/prompt.md for the common relative-path case. Use prompt: file://path/to/prompt.md only when you need to force file-reference resolution explicitly.
Structured task input belongs in input. If input is a message whose content is a JSON object, {{input}} renders that object as formatted JSON for the grader prompt; no separate grader-only input field is required. Use metadata for provenance or suite-level source fields, and rubrics_json for rubric arrays.
Suite-level metadata is inherited by every test. When rubric items vary per test, keep the grader on each test and reuse the prompt file:
metadata: source_repo: https://github.com/virattt/dexter source_commit: 8d9419829f443f84b804d033bb2c3b1fbd788629 source_file: src/evals/dataset/finance_agent.csv
tests: - id: apple-research input: company: Apple ticker: AAPL metadata: row: 1 assert: - name: dexter_semantic type: llm-grader prompt: file://prompts/dexter-grader.md rubrics: - operator: correctness criteria: Uses the provided ticker and company.Per-Grader Target
Section titled “Per-Grader Target”By default, an llm-grader uses the suite target’s grader_target. Override it per grader when you need multiple grader models in one run:
assert: - name: grader-gpt type: llm-grader target: grader_gpt_5_mini prompt: ./prompts/pass-fail.md - name: grader-haiku type: llm-grader target: grader_claude_haiku prompt: ./prompts/pass-fail.mdEach target: value must match a named LLM target in .agentv/targets.yaml.
TypeScript Template
Section titled “TypeScript Template”For dynamic prompt generation, use the definePromptTemplate function from @agentv/sdk:
#!/usr/bin/env bunimport { definePromptTemplate } from '@agentv/sdk';
function textFromMessages(messages: Array<{ content?: unknown }>): string { return messages .map((message) => typeof message.content === 'string' ? message.content : '') .filter(Boolean) .join('\n');}
export default definePromptTemplate((ctx) => { const rubric = ctx.config?.rubric as string | undefined; const question = textFromMessages(ctx.input.filter((message) => message.role === 'user')); const referenceAnswer = textFromMessages(ctx.expectedOutput); const candidateAnswer = ctx.output ?? '';
return `You are evaluating an AI assistant's response.
## Question${question}
## Candidate Answer${candidateAnswer}
${referenceAnswer ? `## Reference Answer\n${referenceAnswer}` : ''}
${rubric ? `## Evaluation Criteria\n${rubric}` : ''}
Evaluate and provide a score from 0 to 1.`;});How It Works
Section titled “How It Works”- AgentV renders the prompt template with variables from the test
- The rendered prompt is sent to the grader target (configured in targets.yaml)
- The LLM returns a structured evaluation with score, assertions array, and reasoning
- Results are recorded in the output JSONL
Command Configuration
Section titled “Command Configuration”When using TypeScript templates, configure them in YAML with optional config data passed to the command:
assert: - name: custom-eval type: llm-grader prompt: command: [bun, run, ../prompts/custom-grader.ts] config: rubric: "Your rubric here" strictMode: trueThe config object is available as ctx.config inside the template function.
Preprocessing File Outputs
Section titled “Preprocessing File Outputs”If an agent returns a ContentFile block instead of plain text, you can preprocess that file into text before llm-grader builds the candidate prompt.
AgentV always tries a default UTF-8 text read first. That is enough for text-based formats such as CSV, JSON, SQL, Markdown, YAML, HTML, XML, and plain text. For binary formats such as .xlsx, .pdf, or .docx, add a preprocessor command:
preprocessors: - type: xlsx command: ["bun", "run", "scripts/preprocessors/xlsx-to-csv.ts"]
tests: - id: spreadsheet-output input: Generate the spreadsheet report assert: - Output includes the revenue rows - name: spreadsheet-check type: llm-grader prompt: | Check whether the transformed spreadsheet text contains the revenue rows:
{{ output }}type accepts either a short alias such as xlsx or a full MIME type such as application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.
Resolution order:
- per-grader
preprocessorsoverride suite-level entries - if no preprocessor matches, AgentV falls back to a UTF-8 text read
- if the fallback read looks binary or invalid, the grader receives a warning note instead of failing the test run
See examples/features/preprocessors/ for a runnable example with a file-producing target and a custom preprocessor script.
Available Context Fields
Section titled “Available Context Fields”TypeScript templates receive a context object with these fields:
| Field | Type | Description |
|---|---|---|
input | Message[] | Full resolved input messages |
output | string | null | Candidate final answer / scored result |
answer | string | Same final answer string, exposed for ergonomic handler code |
messages | Message[] | Transcript messages from the target execution |
criteria | string | Test criteria field |
expectedOutput | Message[] | Full resolved expected output |
trace | Trace | Full execution trace with messages, events, metrics, and provenance |
traceSummary | TraceSummary | Lightweight execution metrics summary |
metadata | object | Test metadata after suite defaults are merged |
config | object | Custom config from YAML |
The raw prompt-template stdin uses snake_case keys such as expected_output, trace_summary, and token_usage. definePromptTemplate() converts them to SDK camelCase fields before calling your handler.
Template Variable Derivation
Section titled “Template Variable Derivation”Template variables are derived internally through three layers:
1. Authoring Layer
Section titled “1. Authoring Layer”What users write in YAML or JSONL:
inputmay be a shorthand string or a full message array.input: "What is 2+2?"expands to[{ role: "user", content: "What is 2+2?" }].expected_outputmay be a shorthand string or a full message array.expected_output: "4"expands to[{ role: "assistant", content: "4" }].
2. Resolved Layer
Section titled “2. Resolved Layer”After parsing, canonical message arrays replace the shorthand fields:
input: TestMessage[]— canonical resolved inputexpected_output: TestMessage[]— canonical resolved expected output
At this layer, input and expected_output no longer exist as separate fields.
3. Template Variable Layer
Section titled “3. Template Variable Layer”Derived strings injected into grader prompts:
| Variable | Derivation |
|---|---|
criteria | Passed through from the test field |
input | Resolved input text |
expected_output | Reference answer text |
output | Candidate answer text |
metadata_json | Test metadata, compact JSON |
file_changes | Unified diff of workspace file changes (populated when workspace is configured) |
tool_calls | Formatted summary of tool calls from agent execution (tool name + key inputs per call) |
Example flow:
# User writes:input: "What is 2+2?"expected_output: "The answer is 4"# Resolved:input: [{ role: "user", content: "What is 2+2?" }]expected_output: [{ role: "assistant", content: "The answer is 4" }]
# Derived template variables:input: "What is 2+2?"expected_output: "The answer is 4"output: (extracted from provider output at runtime)