# Synthetic example walkthrough

Run the demo:

```bash
uv run python examples/quickstart.py
```

It generates these files:

- [`examples/output/comparison.csv`](../examples/output/comparison.csv)
- [`examples/output/latency_vs_capacity.png`](../examples/output/latency_vs_capacity.png)
- [`examples/output/required_units_distribution.png`](../examples/output/required_units_distribution.png)
- [`examples/output/scenario_benefit.png`](../examples/output/scenario_benefit.png)
- [`examples/output/percentile_tradeoff.png`](../examples/output/percentile_tradeoff.png)

## What the fake workload is doing

The synthetic trace contains three request classes:

- chat
- rag
- reasoning

The optimized scenario applies four changes:

- prompt compression
- more caching
- tighter generation caps
- reduced thinking-token budgets

## Current synthetic results

| scenario | objective | target | recommended units | avg spare fraction (1s) | overload probability (1s) | achieved latency quantile |
| --- | --- | --- | ---: | ---: | ---: | ---: |
| baseline | latency | p95 <= 1.5s | 5 | 0.718 | 0.030 | 1.315s |
| baseline | latency | p99 <= 1.5s | 7 | 0.794 | 0.006 | 1.428s |
| baseline | throughput | p99 units, overload <= 1% | 7 | 0.794 | 0.006 | - |
| optimized | latency | p95 <= 1.5s | 4 | 0.713 | 0.032 | 1.157s |
| optimized | latency | p99 <= 1.5s | 5 | 0.766 | 0.012 | 1.278s |
| optimized | throughput | p99 units, overload <= 1% | 6 | 0.804 | 0.005 | - |

## Rendered plots

### Latency vs capacity

![Latency vs capacity](assets/latency_vs_capacity.png)

### Distribution of required units

![Required units distribution](assets/required_units_distribution.png)

### Optimization benefit

![Scenario benefit](assets/scenario_benefit.png)

### Percentile vs slack trade-off

![Slack trade-off](assets/percentile_tradeoff.png)

The important pattern is not the exact number. It is that stricter tail planning tends to buy more slack, while prompt/token optimizations can collapse the tail and shrink the reserved-capacity bill.