Agent Skills evals.json Adapter

Overview

Agent Skills uses evals.json for lightweight skill-scoped datasets: a prompt, optional expected outcome, optional fixture files, and natural-language assertions or expectations.

AgentV treats evals.json as a built-in read adapter input, not as a native core eval format. Detection requires a top-level skill_name string and evals array, so arbitrary .json files are still rejected. You can run a detected Agent Skills file directly:

agentv eval evals.json --target claude

Or convert it to AgentV EVAL YAML when you want to edit the generated suite:

agentv convert evals.json --out EVAL.yaml
agentv eval EVAL.yaml --target claude

This keeps AgentV’s core authoring formats focused on YAML, JSONL, and TypeScript while still making Agent Skills suites easy to onboard. The boundary is the same adapter layer used for external datasets: external schema in, AgentV-native cases at runtime.

Quick Start

Create evals.json:

{
  "skill_name": "csv-analyzer",
  "evals": [
    {
      "id": 1,
      "prompt": "I have a CSV of monthly sales data in evals/files/sales.csv. Find the top 3 months by revenue.",
      "expected_output": "The top 3 months by revenue are November ($22,500), September ($20,100), and December ($19,400).",
      "files": ["evals/files/sales.csv"],
      "assertions": [
        "Output identifies November as the highest revenue month",
        "Output includes exactly 3 months",
        "Revenue figures are included for each month"
      ]
    }
  ]
}

Run it directly or convert it first:

agentv eval evals.json --target claude

agentv convert evals.json --out EVAL.yaml
agentv eval EVAL.yaml --target claude

The --target flag selects the agent harness. The agent evaluates itself; skills load through the normal agent runtime.

CLI Surface

Run a detected Agent Skills file directly:

agentv eval evals.json --target claude --output .agentv/results/csv-analyzer

Import the definition into editable AgentV YAML without running a target:

agentv convert evals.json --out EVAL.yaml

Prepare one converted case for a human or external agent without running the target provider:

agentv prepare EVAL.yaml --test-id "1" --target claude --out .agentv/prepared/csv-analyzer-1

agentv import is reserved for agent session transcripts and selected external datasets such as Hugging Face. Agent Skills evals.json uses the eval read adapter for execution and convert for definition import.

Field Mapping

The read adapter promotes evals.json fields into AgentV cases. The converter writes the same mapping to YAML:

evals.json	EVAL.yaml output	Notes
`prompt`	`input`	Written as prompt text
`expected_output`	`criteria` + `llm-rubric` criterion	Agent Skills uses this as expected outcome/rubric context, not AgentV passive reference-data `expected_output`
`assertions[]`	`llm-rubric` criteria	Strings are grouped into one rubric with one criterion per assertion
`expectations[]`	`llm-rubric` criteria	Same handling as `assertions[]`
`files[]`	`input_files`	Resolved relative to the `evals.json` file
`skill_name`	`tags.skill`, `description`	Used for suite grouping
`id`	`id`	Converted to a string

The generated llm-rubric assertion emits per-criterion grading rows in AgentV artifacts just like other assertion entries. The converted YAML is the editable source of truth after conversion.

Files

evals.json file paths map to AgentV input_files:

tests:
  - id: "1"
    input_files:
      - evals/files/sales.csv
    input: "Analyze the sales data."

Use workspace.repos when the eval should materialize a repository before those fixture paths are read.

Offline Grading

Grade existing agent sessions offline by importing transcripts and running the adapter input or converted YAML:

agentv import claude --list
agentv import claude --session-id <uuid>

agentv eval evals.json --target copilot-log

If another tool owns the original evals.json, keep that file as the source and run it through the read adapter. Convert only when you need to edit the AgentV-native form.

Converted YAML

The converter writes comments that point to native AgentV features:

# Converted from Agent Skills evals.json
# Agent Skills expected_output is treated as expected outcome/rubric context,
# not as AgentV expected_output reference data.
# AgentV features you can add:
#   - type: is-json, contains, regex for deterministic graders
#   - type: script for custom scoring scripts
#   - type: llm-rubric value arrays with weights and score ranges for rubrics
#   - Multi-turn conversations via input message arrays
#   - Multiple assertions with weighted scoring
#   - Workspace isolation with repos and hooks

tags:
  skill: "csv-analyzer"
metadata:
  source_adapter: "agent-skills-evals-json"

tests:
  - id: "1"
    criteria: |-
      The top 3 months by revenue are November, September, and December.
    input: "Find the top 3 months by revenue."
    input_files:
      - "evals/files/sales.csv"
    # Promoted from evals.json expected_output, assertions[], and expectations[]
    # Replace with type: is-json, contains, or regex for deterministic checks
    assert:
      - name: agent-skills-criteria
        type: llm-rubric
        value:
          - id: "expected-outcome"
            outcome: "The top 3 months by revenue are November, September, and December."
            required: true
          - id: "assertion-1"
            outcome: "Output identifies November as the highest revenue month"
            required: true

From there you can add deterministic graders, workspace isolation, multi-turn inputs, target-specific configuration, or script graders in normal AgentV YAML.

When to Keep evals.json

Keep evals.json when another Agent Skills tool owns that file or when you are packaging a skill for an ecosystem that expects it. Use AgentV’s read adapter directly:

agentv eval evals/evals.json --target claude --output .agentv/results/csv-analyzer

Use AgentV YAML directly when AgentV owns the eval lifecycle.

Side-by-side

evals.json

{
  "skill_name": "support-agent",
  "evals": [
    {
      "id": 1,
      "prompt": "A customer says their order #12345 hasn't arrived after 2 weeks. Help them.",
      "expected_output": "An empathetic response that offers to track the order and provides next steps.",
      "assertions": [
        "Response acknowledges the customer's frustration",
        "Response offers to look up order #12345",
        "Response provides clear next steps"
      ]
    }
  ]
}

EVAL.yaml

tests:
  - id: "1"
    input: |
      A customer says their order #12345 hasn't arrived after 2 weeks. Help them.
    criteria: |
      An empathetic response that offers to track the order and provides next steps.
    assert:
      - name: agent-skills-criteria
        type: llm-rubric
        value:
          - id: expected-outcome
            outcome: "An empathetic response that offers to track the order and provides next steps."
            required: true
          - id: assertion-1
            outcome: "Response acknowledges the customer's frustration"
            required: true
          - id: assertion-2
            outcome: "Response offers to look up order #12345"
            required: true
      - name: order-number
        type: contains
        value: "12345"

The YAML version can mix rubric criteria with deterministic checks; the order-number assertion is instant and free.