Synthetic example walkthrough

Run the demo:

uv run python examples/quickstart.py

It generates these files:

What the fake workload is doing

The synthetic trace contains three request classes:

  • chat

  • rag

  • reasoning

The optimized scenario applies four changes:

  • prompt compression

  • more caching

  • tighter generation caps

  • reduced thinking-token budgets

Current synthetic results

scenario

objective

target

recommended units

avg spare fraction (1s)

overload probability (1s)

achieved latency quantile

baseline

latency

p95 <= 1.5s

5

0.718

0.030

1.315s

baseline

latency

p99 <= 1.5s

7

0.794

0.006

1.428s

baseline

throughput

p99 units, overload <= 1%

7

0.794

0.006

-

optimized

latency

p95 <= 1.5s

4

0.713

0.032

1.157s

optimized

latency

p99 <= 1.5s

5

0.766

0.012

1.278s

optimized

throughput

p99 units, overload <= 1%

6

0.804

0.005

-

Rendered plots

Latency vs capacity

Latency vs capacity

Distribution of required units

Required units distribution

Optimization benefit

Scenario benefit

Percentile vs slack trade-off

Slack trade-off

The important pattern is not the exact number. It is that stricter tail planning tends to buy more slack, while prompt/token optimizations can collapse the tail and shrink the reserved-capacity bill.