Evaluations

Run evaluations

What you will do

Run an eval suite, write artifacts, and compare variants.

Dry run

memory eval run --suite evals/examples/memory-smoke --condition full-memory --profile offline --dry-run

Paired run shape

memory eval run --suite evals/suites/memory-improvement-v1 --condition no-memory --condition full-memory --allow-shell --repeat 5
memory eval compare --baseline 'target/memory-evals/*no-memory*.json' --candidate 'target/memory-evals/*full-memory*.json' --text

Use --allow-shell only after reviewing suite scripts and fixtures. Shell-executing evals are code execution inputs, not passive data files.

External retrievers

memory eval run --suite evals/suites/memory-improvement-v1 --condition full-memory --retriever-cmd './my-retriever' --allow-shell

Verify

Inspect the JSON artifacts under target/memory-evals/ and compare item-level results before making claims.

Next

Read Interpreting results and Limitations.

Ablation tests

Compare no-memory and memory-enabled variants item by item.

Benchmark reports

Read and write evaluation reports from immutable artifacts.

On this page

Run evaluations What you will do Dry run Paired run shape External retrievers Verify Next