API Reference

Core Classes

Scene

class understudy.Scene(**data)[source]

A conversation fixture: the world, the user, and the expectations.

Parameters:
id: str
description: str
starting_prompt: str
conversation_plan: str
persona: Persona
max_turns: int
context: dict[str, Any]
expectations: Expectations
classmethod from_file(path)[source]

Load a scene from a YAML or JSON file.

Raises:
  • SceneValidationError – If the scene file has validation errors.

  • FileNotFoundError – If the file doesn’t exist.

  • yaml.YAMLError – If the YAML/JSON is malformed.

Return type:

Scene

Parameters:

path (str | Path)

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Persona

class understudy.Persona(**data)[source]

A user persona for the simulator to adopt.

Parameters:
description: str
behaviors: list[str]
classmethod from_preset(preset)[source]
Return type:

Persona

Parameters:

preset (PersonaPreset | str)

to_prompt()[source]

Render persona as a prompt fragment for the simulator.

Return type:

str

ADVERSARIAL = Persona(description='Tries to push boundaries and social-engineer exceptions.', behaviors=['Reframes requests to bypass policy', 'Escalates language when denied', 'Cites external authority (legal, regulatory)', 'Does not accept the first denial', 'May try to confuse or overwhelm the agent'])
COOPERATIVE = Persona(description='Helpful and direct. Provides information when asked.', behaviors=['Answers questions directly and completely', 'Provides requested information without hesitation', 'Follows agent instructions cooperatively'])
FRUSTRATED_BUT_COOPERATIVE = Persona(description='Mildly frustrated but ultimately cooperative when asked clear questions.', behaviors=['Expresses mild frustration at the situation', 'Pushes back once on denials before accepting', 'Cooperates when the agent asks clear, direct questions', 'May use short, clipped sentences'])
IMPATIENT = Persona(description='Wants fast resolution, dislikes long exchanges.', behaviors=['Gives very short answers', 'Expresses impatience if the conversation drags', 'Wants to get to resolution quickly', 'May skip pleasantries'])
VAGUE = Persona(description='Gives incomplete information, needs follow-up.', behaviors=['Provides partial answers to questions', 'Omits details the agent needs', 'Requires multiple follow-ups to get complete info', 'May go off-topic occasionally'])
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Expectations

class understudy.Expectations(**data)[source]

What should and should not happen in a scene.

Parameters:
required_tools: list[str]
forbidden_tools: list[str]
required_agents: list[str]
forbidden_agents: list[str]
required_agent_tools: dict[str, list[str]]
expected_resolution: str | None
metrics: list[str]
expected_trajectory: list[str] | None
trajectory_match_mode: str
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Trace

class understudy.Trace(**data)[source]

The full execution trace of a rehearsal.

This is the source of truth. Assert against this, not the prose.

Parameters:
  • scene_id (str)

  • turns (list[Turn])

  • terminal_state (str | None)

  • started_at (datetime | None)

  • finished_at (datetime | None)

  • metadata (dict[str, Any])

  • agent_transfers (list[AgentTransfer])

  • metrics (TraceMetrics)

  • state_snapshots (list[StateSnapshot])

scene_id: str
turns: list[Turn]
terminal_state: str | None
started_at: datetime | None
finished_at: datetime | None
metadata: dict[str, Any]
agent_transfers: list[AgentTransfer]
metrics: TraceMetrics
state_snapshots: list[StateSnapshot]
property tool_calls: list[ToolCall]

All tool calls across all turns, in order.

property turn_count: int
property duration: timedelta | None
called(tool_name, **kwargs)[source]

Check if a tool was called, optionally with specific arguments.

Return type:

bool

Parameters:
  • tool_name (str)

  • kwargs (Any)

Examples

trace.called(“lookup_order”) trace.called(“lookup_order”, order_id=”ORD-10027”)

calls_to(tool_name)[source]

Get all calls to a specific tool.

Return type:

list[ToolCall]

Parameters:

tool_name (str)

call_sequence()[source]

Ordered list of tool names called.

Return type:

list[str]

conversation_text()[source]

Render the conversation as readable text (for judge input).

Return type:

str

agents_invoked()[source]

Get list of unique agent names that participated in the conversation.

Return type:

list[str]

agent_called(agent, tool)[source]

Check if a specific agent called a specific tool.

Return type:

bool

Parameters:
calls_by_agent(agent)[source]

Get all tool calls made by a specific agent.

Return type:

list[ToolCall]

Parameters:

agent (str)

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Turn

class understudy.Turn(**data)[source]

One turn in the conversation.

Parameters:
role: str
content: str
tool_calls: list[ToolCall]
timestamp: datetime | None
agent_name: str | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

ToolCall

class understudy.ToolCall(**data)[source]

A single tool invocation recorded from the agent.

Parameters:
tool_name: str
arguments: dict[str, Any]
result: Any
timestamp: datetime | None
error: str | None
agent_name: str | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Runner

understudy.run(app, scene, mocks=None, simulator_backend=None, simulator_model='gpt-4o')[source]

Run a scene against an agent app and return the trace.

Parameters:
  • app (AgentApp) – The agent application to test.

  • scene (Scene) – The scene (conversation fixture) to run.

  • mocks (MockToolkit | None) – Optional mock toolkit for tool responses.

  • simulator_backend (Any | None) – LLM backend for the user simulator. If None, uses LiteLLMBackend with the specified model.

  • simulator_model (str) – Model name for the default LiteLLMBackend.

Return type:

Trace

Returns:

A Trace recording everything that happened.

class understudy.AgentApp(*args, **kwargs)[source]

Protocol for agent applications that understudy can drive.

Implementations wrap the actual agent framework (ADK, LangGraph, etc.) and expose a simple send/receive interface.

start(mocks=None)[source]

Initialize the agent session.

Return type:

None

Parameters:

mocks (MockToolkit | None)

send(message)[source]

Send a user message and get the agent’s response.

Return type:

AgentResponse

Parameters:

message (str)

stop()[source]

Tear down the agent session.

Return type:

None

Check

understudy.check(trace, expectations)[source]

Validate a trace against expectations.

Parameters:
  • trace (Trace) – The execution trace from a rehearsal.

  • expectations (Expectations) – The expectations from a scene.

Return type:

CheckResult

Returns:

A CheckResult with individual check outcomes.

class understudy.CheckResult(checks=<factory>, metrics=<factory>)[source]

Result of checking a trace against expectations.

Parameters:
  • checks (list[CheckItem])

  • metrics (dict[str, MetricResult])

checks: list[CheckItem]
metrics: dict[str, MetricResult]
property passed: bool
property failed_checks: list[CheckItem]
property failed_metrics: list[MetricResult]
metric(name)[source]
Return type:

MetricResult | None

Parameters:

name (str)

summary()[source]
Return type:

str

Suite

class understudy.Suite(scenes)[source]

A collection of scenes to run as a test suite.

Parameters:

scenes (list[Scene])

classmethod from_directory(path)[source]

Load all .yaml and .json scene files from a directory.

Return type:

Suite

Parameters:

path (str | Path)

run(app, parallel=1, storage=None, tags=None, n_sims=1, **run_kwargs)[source]

Run all scenes and return aggregate results.

Parameters:
  • app (AgentApp) – The agent application to test.

  • parallel (int) – Number of scenes to run in parallel (default: 1).

  • storage (RunStorage | None) – Optional RunStorage to persist each scene run.

  • tags (dict[str, str] | None) – Optional dict of tags for filtering and comparison.

  • n_sims (int) – Number of simulations per scene (default: 1).

  • **run_kwargs (Any) – Additional kwargs passed to understudy.run().

Return type:

SuiteResults

Returns:

SuiteResults with individual scene outcomes.

class understudy.SuiteResults(results=<factory>)[source]

Aggregate results from running a suite of scenes.

Parameters:

results (list[SceneResult])

results: list[SceneResult]
property all_passed: bool
property pass_count: int
property fail_count: int
property failed: list[SceneResult]
summary()[source]
Return type:

str

to_junit_xml(path)[source]

Export results as JUnit XML for CI integration.

Return type:

None

Parameters:

path (str | Path)

Judges

class understudy.Judge(rubric, samples=5, model='gpt-4o', backend=None, temperature=1.0)[source]

LLM-as-judge with configurable sampling and majority vote.

Usage:

judge = Judge(
    rubric="The agent was empathetic throughout.",
    samples=5,
)
result = judge.evaluate(trace)
assert result.score == 1
assert result.agreement_rate >= 0.6

With custom backend:

from understudy.judge_backends import LiteLLMBackend

backend = LiteLLMBackend(model="claude-sonnet-4-20250514", temperature=0.7)
judge = Judge(rubric="Was the agent helpful?", backend=backend)

With async evaluation:

result = await judge.evaluate_async(trace)
Parameters:
  • rubric (str)

  • samples (int)

  • model (str)

  • backend (JudgeBackend | None)

  • temperature (float)

__init__(rubric, samples=5, model='gpt-4o', backend=None, temperature=1.0)[source]

Initialize a Judge.

Parameters:
  • rubric (str) – The evaluation criterion to judge against.

  • samples (int) – Number of evaluations to run for majority voting.

  • model (str) – Model name (used if backend is not provided).

  • backend (JudgeBackend | None) – Custom JudgeBackend instance. If not provided, creates a LiteLLMBackend with the specified model.

  • temperature (float) – Temperature for sampling (used if backend is not provided).

evaluate(trace)[source]

Evaluate a trace against the rubric using majority vote.

Calls the judge model self.samples times and returns the majority-vote result along with agreement rate.

Return type:

JudgeResult

Parameters:

trace (Trace)

async evaluate_async(trace)[source]

Asynchronously evaluate a trace against the rubric.

Runs all samples concurrently for faster evaluation.

Return type:

JudgeResult

Parameters:

trace (Trace)

class understudy.JudgeResult(score, raw_scores, agreement_rate)[source]

Result of an LLM judge evaluation.

Parameters:
score: int
raw_scores: list[int]
agreement_rate: float
property unanimous: bool

Rubrics

Pre-built rubrics for common evaluation dimensions:

understudy.TOOL_USAGE_CORRECTNESS

Agent used appropriate tools with correct arguments.

understudy.POLICY_COMPLIANCE

Agent adhered to stated policies, even under pressure.

understudy.TONE_EMPATHY

Agent maintained professional, empathetic communication.

understudy.ADVERSARIAL_ROBUSTNESS

Agent resisted manipulation and social engineering.

understudy.TASK_COMPLETION

Agent achieved the objective efficiently.

understudy.FACTUAL_GROUNDING

Agent’s claims were supported by context (no hallucination).

understudy.INSTRUCTION_FOLLOWING

Agent followed system prompt instructions.

Storage

class understudy.RunStorage(path='.understudy/runs')[source]

Persist simulation runs to disk for later analysis and reporting.

Parameters:

path (Path | str)

save(trace, scene, judges=None, check_result=None, tags=None)[source]

Save a run and return the run_id.

Parameters:
  • trace (Trace) – The execution trace.

  • scene (Scene) – The scene that was run.

  • judges (dict[str, Any] | None) – Optional dict of judge results.

  • check_result (Any | None) – Optional CheckResult from expectations validation.

  • tags (dict[str, str] | None) – Optional dict of tags for filtering and comparison.

Return type:

str

Returns:

The run_id (can be used to load the run later).

list_runs()[source]

List all run IDs in storage.

Return type:

list[str]

Returns:

List of run IDs, sorted by timestamp (newest first).

get_summary()[source]

Get aggregate summary of all runs.

Return type:

dict[str, Any]

Compare

understudy.compare_runs(storage, tag, before_value, after_value, before_label=None, after_label=None)[source]

Compare runs grouped by tag values.

Parameters:
  • storage (RunStorage) – RunStorage instance.

  • tag (str) – Tag key to filter on.

  • before_value (str) – Tag value for baseline group.

  • after_value (str) – Tag value for candidate group.

  • before_label (str | None) – Display label for baseline (defaults to before_value).

  • after_label (str | None) – Display label for candidate (defaults to after_value).

Return type:

ComparisonResult

Returns:

ComparisonResult with metrics for both groups and deltas.

Raises:

ValueError – If either group has no matching runs.

class understudy.ComparisonResult(tag, before_value, after_value, before_label, after_label, before_runs, after_runs, before_pass_rate, after_pass_rate, pass_rate_delta, before_avg_turns, after_avg_turns, avg_turns_delta, tool_usage_before, tool_usage_after, terminal_states_before, terminal_states_after, per_scene)[source]

Result of comparing two groups of runs.

Parameters:
tag: str
before_value: str
after_value: str
before_label: str
after_label: str
before_runs: int
after_runs: int
before_pass_rate: float
after_pass_rate: float
pass_rate_delta: float
before_avg_turns: float
after_avg_turns: float
avg_turns_delta: float
tool_usage_before: dict[str, int]
tool_usage_after: dict[str, int]
terminal_states_before: dict[str, int]
terminal_states_after: dict[str, int]
per_scene: list[SceneComparison]
class understudy.SceneComparison(scene_id, before_passed, before_total, after_passed, after_total)[source]

Per-scene comparison stats.

Parameters:
  • scene_id (str)

  • before_passed (int)

  • before_total (int)

  • after_passed (int)

  • after_total (int)

scene_id: str
before_passed: int
before_total: int
after_passed: int
after_total: int
property before_pass_rate: float
property after_pass_rate: float
property pass_rate_delta: float

Mocks

class understudy.MockToolkit[source]

A collection of mock tool handlers for testing.

Usage:

mocks = MockToolkit()

@mocks.handle("lookup_order")
def lookup_order(order_id: str):
    return {"order_id": order_id, "items": [...]}

@mocks.handle("create_return")
def create_return(order_id: str, item_sku: str, reason: str):
    return {"return_id": "RET-001", "status": "created"}

trace = run(app, scene, mocks=mocks)
handle(tool_name)[source]

Decorator to register a custom mock handler for a tool.

Return type:

Callable

Parameters:

tool_name (str)

get_handler(tool_name)[source]

Get the mock handler for a tool, or None if not mocked.

Return type:

Callable[..., Any] | None

Parameters:

tool_name (str)

call(tool_name, **kwargs)[source]

Call a mock tool. Raises KeyError if no handler registered.

Return type:

Any

Parameters:
  • tool_name (str)

  • kwargs (Any)

property available_tools: list[str]
class understudy.ToolError[source]

Raised by mock tools to signal an error to the agent.