Testing AI agents is hard: manual testing is slow, real users are expensive, and LLM non-determinism makes assertions tricky. mimiq solves this with simulated users that follow scripts, plus deterministic checks on tool calls and terminal states.
mimiq is a complete TypeScript solution for testing AI agents with simulated users. It provides:
No Python required. Everything runs in Node.js.
npm install @gojiplus/mimiq --save-dev
export OPENAI_API_KEY=your-key
# Optional: use a different model
export SIMULATOR_MODEL=gpt-4o # default
cypress.config.ts
import { defineConfig } from "cypress";
import { setupMimiqTasks, createLocalRuntime } from "@gojiplus/mimiq/node";
export default defineConfig({
e2e: {
baseUrl: "http://localhost:5173",
setupNodeEvents(on, config) {
const runtime = createLocalRuntime({
scenesDir: "./scenes",
});
setupMimiqTasks(on, { runtime });
return config;
},
},
});
cypress/support/e2e.ts
import { createDefaultChatAdapter, registerMimiqCommands } from "@gojiplus/mimiq";
registerMimiqCommands({
browserAdapter: createDefaultChatAdapter({
transcript: '[data-test="transcript"]',
messageRow: '[data-test="message-row"]',
messageRoleAttr: "data-role",
messageText: '[data-test="message-text"]',
input: '[data-test="chat-input"]',
send: '[data-test="send-button"]',
idleMarker: '[data-test="agent-idle"]',
}),
});
scenes/return_backpack.yaml
id: return_backpack
description: Customer returns a backpack
starting_prompt: "I'd like to return an item please."
conversation_plan: |
Goal: Return the hiking backpack from order ORD-10031.
- Provide order ID when asked.
- Cooperate with all steps.
persona: cooperative
max_turns: 15
expectations:
required_tools:
- lookup_order
- create_return
forbidden_tools:
- issue_refund
allowed_terminal_states:
- return_created
judges:
- name: empathy
rubric: "The agent maintained a professional and empathetic tone."
samples: 3
describe("return flow", () => {
afterEach(() => cy.mimiqCleanupRun());
it("processes valid return", () => {
cy.visit("/");
cy.mimiqStartRun({ sceneId: "return_backpack" });
cy.mimiqRunToCompletion();
cy.mimiqEvaluate().then((report) => {
expect(report.passed).to.eq(true);
});
});
});
id: string # Unique identifier
description: string # Human-readable description
starting_prompt: string # First message from simulated user
conversation_plan: string # Instructions for user behavior
persona: string # Preset: cooperative, frustrated_but_cooperative, adversarial, vague, impatient
max_turns: number # Maximum turns (default: 15)
context: # World state (optional)
customer: { ... }
orders: { ... }
expectations:
required_tools: [string] # Must be called
forbidden_tools: [string] # Must NOT be called
allowed_terminal_states: [string] # Valid end states
forbidden_terminal_states: [string]
required_agents: [string] # For multi-agent systems
forbidden_agents: [string]
required_agent_tools: # Agent-specific tool requirements
agent_name: [tool1, tool2]
judges: # LLM-as-judge evaluations
- name: string
rubric: string
samples: number # Number of samples (default: 5)
model: string # Model to use (default: gpt-4o)
| Preset | Description |
|---|---|
cooperative |
Helpful, provides information directly |
frustrated_but_cooperative |
Mildly frustrated but ultimately cooperative |
adversarial |
Tries to push boundaries, social-engineer exceptions |
vague |
Gives incomplete information, needs follow-up |
impatient |
Wants fast resolution, short answers |
Add qualitative evaluation with LLM judges:
expectations:
judges:
- name: empathy
rubric: "The agent maintained an empathetic tone throughout."
samples: 5
- name: accuracy
rubric: "All factual claims were grounded in tool results."
Judges use majority voting across multiple samples for reliability.
import { BUILTIN_RUBRICS } from "@gojiplus/mimiq";
// Available rubrics:
BUILTIN_RUBRICS.TASK_COMPLETION
BUILTIN_RUBRICS.INSTRUCTION_FOLLOWING
BUILTIN_RUBRICS.TONE_EMPATHY
BUILTIN_RUBRICS.POLICY_COMPLIANCE
BUILTIN_RUBRICS.FACTUAL_GROUNDING
BUILTIN_RUBRICS.TOOL_USAGE_CORRECTNESS
BUILTIN_RUBRICS.ADVERSARIAL_ROBUSTNESS
| Command | Description |
|---|---|
cy.mimiqStartRun({ sceneId }) |
Start a simulation |
cy.mimiqRunToCompletion() |
Run until done or max turns |
cy.mimiqRunTurn() |
Execute one turn |
cy.mimiqEvaluate() |
Run all checks and judges |
cy.mimiqGetTrace() |
Get conversation trace |
cy.mimiqCleanupRun() |
Clean up |
| Variable | Description |
|---|---|
OPENAI_API_KEY |
API key for simulation and judges |
SIMULATOR_MODEL |
Model for simulation (default: gpt-4o) |
JUDGE_MODEL |
Model for judges (default: gpt-4o) |
OPENAI_BASE_URL |
Base URL for OpenAI-compatible API |
mimiq generates rich, interactive HTML reports. See examples:
Generate reports after tests:
npm run test:report # Runs tests and opens report
┌─────────────────────────────────────────────────────────────────────────┐
│ mimiq │
│ │
│ Browser Layer (Cypress): │
│ - Captures UI state via data-test selectors │
│ - Executes actions (type, click, send) │
│ │
│ Node Layer (Cypress tasks): │
│ - Simulator: LLM generates user messages │
│ - Trace: records conversation + tool calls │
│ - Check: validates against expectations │
│ - Judge: LLM-as-judge evaluation │
│ - Reports: generates HTML summaries │
└─────────────────────────────────────────────────────────────────────────┘
MIT