Formalization¶
slosizer treats reserved-capacity planning as two related problems.
Throughput planning
Convert every request into provider-specific capacity work, then ask how many reserved units are needed so burst windows stay inside budget.Latency planning
Split end-to-end latency into baseline model latency plus queue delay induced by bursty arrivals and finite reserved capacity.
Generic request representation¶
Each request has:
arrival time
tinput tokens
Icached input tokens
Coutput tokens
Othinking tokens
H
A provider profile supplies the burndown weights:
w_inw_cachew_outw_think
The request work is:
B = w_in * I + w_cache * C + w_out * O + w_think * H
If a profile has long-context rules, the weights can change once the input-token threshold is crossed.
Throughput planning¶
For a window of length Delta, the total work in the bucket is:
D_n(Delta) = sum(B_j for requests in window n)
If one reserved unit serves tau adjusted tokens per second, the required reserved units in that bucket are:
X_n(Delta) = D_n(Delta) / (tau * Delta)
That lets us compute:
mean required units
p95 / p99 required units
overload probability
P(X_n > G)expected overflow
E[(X_n - G)+]average spare capacity
E[(G - X_n)+]
Latency planning¶
Reserved capacity affects latency through queueing, not through the intrinsic model floor.
R = L_base + W
Where:
L_base: model latency with no capacity contentionW: queue delay caused by backlog
The package uses a simple FCFS fluid queue:
Q_(n+1) = max(0, Q_n + arrivals_work - service_rate * elapsed_time)
Queue delay for request j is approximated by:
W_j = backlog_before_j / service_rate
This is not a perfect service simulator; it is a deliberately pragmatic tail-latency approximation.
Queue model assumptions¶
The queue model makes several simplifying assumptions:
Single-server FCFS: All capacity is treated as a single aggregate server processing requests first-come-first-served. Real deployments may have multiple replicas with their own queues.
Fluid approximation: Work is treated as continuous rather than discrete tokens. This smooths over per-token generation time variation.
No preemption: Once a request starts, it runs to completion. The model doesn’t account for request cancellation or timeouts.
Deterministic service rate: The service rate is fixed at
units * throughput_per_unit. Real systems have variable throughput based on prompt complexity, cache hits, and hardware utilization.Instantaneous queue joining: Requests join the queue at their arrival time with no network latency.
These assumptions mean the model tends to underestimate tail latencies when:
Workload is highly variable (bursty arrivals with long gaps)
Requests have significantly different sizes
The system operates near saturation
For safety margins, use headroom_factor to add buffer capacity.
Why percentile choice matters¶
Optimizing for p95 usually buys fewer reserved units and therefore lower average slack.
Optimizing for p99 buys more headroom and therefore lower overload probability, but also more idle capacity on average.
That trade-off is not a bug. It is the whole game.