Quickstart¶
This guide walks you through creating your first understudy test.
Creating a Scene¶
A Scene defines a test scenario with:
The starting prompt
A conversation plan for the simulated user
World state (context)
Expectations for what should/shouldn’t happen
from understudy import Scene, Persona, Expectations
scene = Scene(
id="return_nonreturnable_item",
starting_prompt="I want to return something I bought.",
conversation_plan="""
Goal: Return earbuds from order ORD-10027.
- If asked for order ID: provide ORD-10027
Return reason: defective.
If denied: accept escalation.
""",
persona=Persona.FRUSTRATED_BUT_COOPERATIVE,
max_turns=20,
context={
"orders": {
"ORD-10027": {
"items": [{"name": "Earbuds", "category": "personal_audio"}],
"status": "delivered",
}
},
"policy": {
"non_returnable_categories": ["personal_audio"],
},
},
expectations=Expectations(
required_tools=["lookup_order", "get_return_policy"],
forbidden_tools=["create_return"],
),
)
Or load from YAML:
scene = Scene.from_file("scenes/return_nonreturnable_item.yaml")
Running a Rehearsal¶
from understudy import run
from understudy.adk import ADKApp
app = ADKApp(agent=your_agent)
trace = run(app, scene)
Making Assertions¶
Assert against the trace (what happened), not the prose:
def test_policy_enforcement():
trace = run(app, scene)
assert trace.called("lookup_order")
assert trace.called("get_return_policy")
assert not trace.called("create_return")
Using LLM Judges¶
For subjective qualities like tone and clarity:
from understudy import Judge, TONE_EMPATHY
judge = Judge(rubric=TONE_EMPATHY, samples=5)
result = judge.evaluate(trace)
assert result.score == 1
assert result.agreement_rate >= 0.6
Running a Suite¶
Run all scenes in a directory:
from understudy import Suite
suite = Suite.from_directory("scenes/")
results = suite.run(app, parallel=4)
results.to_junit_xml("test-results/understudy.xml")
assert results.all_passed
Tagging and Comparing Runs¶
Tag runs for later comparison (e.g., comparing model versions):
from understudy import Suite, RunStorage
storage = RunStorage()
suite = Suite.from_directory("scenes/")
# Tag runs with version
suite.run(app, storage=storage, tags={"version": "v1"})
suite.run(app, storage=storage, tags={"version": "v2"})
Then compare via CLI:
# CLI comparison
understudy compare --tag version --before v1 --after v2
# HTML report
understudy compare --tag version --before v1 --after v2 --html comparison.html
The comparison shows per-scene pass rates when running multiple times.