Synthetic example walkthrough¶

Run the demo:

uv run python examples/quickstart.py

It generates these files:

examples/output/comparison.csv
examples/output/latency_vs_capacity.png
examples/output/required_units_distribution.png
examples/output/scenario_benefit.png
examples/output/percentile_tradeoff.png

What the fake workload is doing¶

The synthetic trace contains three request classes:

chat
rag
reasoning

The optimized scenario applies four changes:

prompt compression
more caching
tighter generation caps
reduced thinking-token budgets

Current synthetic results¶

scenario	objective	target	recommended units	avg spare fraction (1s)	overload probability (1s)	achieved latency quantile
baseline	latency	p95 <= 1.5s	5	0.718	0.030	1.315s
baseline	latency	p99 <= 1.5s	7	0.794	0.006	1.428s
baseline	throughput	p99 units, overload <= 1%	7	0.794	0.006	-
optimized	latency	p95 <= 1.5s	4	0.713	0.032	1.157s
optimized	latency	p99 <= 1.5s	5	0.766	0.012	1.278s
optimized	throughput	p99 units, overload <= 1%	6	0.804	0.005	-

Rendered plots¶

Latency vs capacity¶

Latency vs capacity

Distribution of required units¶

Required units distribution

Optimization benefit¶

Scenario benefit

Percentile vs slack trade-off¶

Slack trade-off

The important pattern is not the exact number. It is that stricter tail planning tends to buy more slack, while prompt/token optimizations can collapse the tail and shrink the reserved-capacity bill.