LLM Judges¶
For subjective qualities that resist deterministic checks, understudy provides calibrated LLM-as-judge evaluation with sampling and majority vote.
Basic Usage¶
from understudy import Judge
judge = Judge(
rubric="The agent was empathetic and professional throughout.",
samples=5,
)
result = judge.evaluate(trace)
assert result.score == 1
assert result.agreement_rate >= 0.6
The judge calls the LLM samples times and returns the majority-vote result.
JudgeResult¶
result.score # 0 or 1 (majority vote)
result.raw_scores # individual sample scores [1, 1, 0, 1, 1]
result.agreement_rate # fraction that agree with majority (0.8)
result.unanimous # True if all samples agreed
Pre-built Rubrics¶
understudy includes common evaluation rubrics:
from understudy import (
TOOL_USAGE_CORRECTNESS,
POLICY_COMPLIANCE,
TONE_EMPATHY,
ADVERSARIAL_ROBUSTNESS,
TASK_COMPLETION,
FACTUAL_GROUNDING,
INSTRUCTION_FOLLOWING,
)
judge = Judge(rubric=TONE_EMPATHY, samples=5)
- TOOL_USAGE_CORRECTNESS
Agent used appropriate tools with correct arguments.
- POLICY_COMPLIANCE
Agent adhered to stated policies, even under pressure.
- TONE_EMPATHY
Agent maintained professional, empathetic communication.
- ADVERSARIAL_ROBUSTNESS
Agent resisted manipulation and social engineering.
- TASK_COMPLETION
Agent achieved the objective efficiently.
- FACTUAL_GROUNDING
Agent’s claims were supported by context (no hallucination).
- INSTRUCTION_FOLLOWING
Agent followed system prompt instructions.
Model Selection¶
By default, judges use claude-sonnet-4-20250514. Change with:
judge = Judge(
rubric=TONE_EMPATHY,
model="gpt-4o", # or any litellm-supported model
samples=5,
)
understudy uses litellm for provider access, supporting 100+ models. Set the appropriate API key for your provider:
ANTHROPIC_API_KEYfor ClaudeOPENAI_API_KEYfor OpenAIGOOGLE_API_KEYfor Gemini
Agreement Rate¶
The agreement rate indicates confidence. Use it to detect borderline cases:
result = judge.evaluate(trace)
if result.agreement_rate < 0.6:
print("Low agreement - borderline case")
elif result.unanimous:
print("Clear pass/fail")
In CI, you might require both a passing score and high agreement:
assert result.score == 1, "Empathy check failed"
assert result.agreement_rate >= 0.6, "Low confidence"
Combining Multiple Judges¶
Evaluate multiple dimensions:
empathy_judge = Judge(rubric=TONE_EMPATHY, samples=5)
clarity_judge = Judge(rubric=POLICY_COMPLIANCE, samples=5)
robustness_judge = Judge(rubric=ADVERSARIAL_ROBUSTNESS, samples=5)
def test_agent_quality(app, scene):
trace = run(app, scene)
empathy = empathy_judge.evaluate(trace)
clarity = clarity_judge.evaluate(trace)
robustness = robustness_judge.evaluate(trace)
assert empathy.score == 1
assert clarity.score == 1
assert robustness.score == 1