Provider Adapters¶
The package is generic in the middle and provider-specific at the edges.
CapacityProfile¶
A provider adapter boils down to a CapacityProfile:
throughput_per_unitpurchase_incrementmin_unitsinput_weightcached_input_weightoutput_weightthinking_weightoptional long-context overrides
That is enough to turn requests into adjusted work and then into required reserved units.
Vertex AI GSU¶
The package ships built-in Vertex AI profiles based on Google Cloud’s provisioned throughput documentation.
Available Models¶
Model |
Throughput per GSU |
Output Weight |
Long Context |
|---|---|---|---|
gemini-2.0-flash-001 |
3,360 |
4x |
No |
gemini-2.0-flash-lite-001 |
6,720 |
4x |
No |
gemini-2.5-flash |
2,690 |
9x |
Yes (>200k) |
gemini-2.5-flash-lite |
8,070 |
4x |
No |
gemini-2.5-pro |
650 |
8x |
Yes (>200k) |
gemini-3.1-flash-lite-preview |
4,030 |
6x |
No |
Token Burndown Rates¶
Vertex AI uses different burndown rates for input vs output tokens:
Input tokens: 1x weight (baseline)
Cached input tokens: 0.1x weight (90% discount)
Output tokens: 4-9x weight depending on model
Thinking tokens: Same as output weight
Long Context Threshold¶
For models with long context support, requests exceeding 200,000 input tokens use elevated weights:
Input: 2x (instead of 1x)
Output: 12x (instead of 8-9x)
Usage¶
import slosizer as slz
profile = slz.vertex_profile("gemini-2.5-flash")
These profiles are text-centric. If you use images, audio, video, or other token classes, add columns and extend the profile before trusting the numbers.
Azure OpenAI PTU¶
Azure PTU support is calibration-first. PTU behavior is highly workload-sensitive, so we don’t ship built-in profiles.
Reference: Azure OpenAI Provisioned Throughput
Key Characteristics¶
Workload-sensitive: Throughput varies significantly based on prompt/completion ratios
Token ratio: For GPT-4.1 and later, 1 output token ≈ 4 input tokens
Calibration required: Use Azure capacity calculator + benchmarks
Calibration Process¶
Use the Azure capacity calculator to estimate baseline throughput
Deploy with your actual workload and measure via Azure Monitor
Refine the profile based on observed throughput
Usage¶
import slosizer as slz
profile = slz.azure_profile(
"gpt-4.1",
throughput_per_unit=12000.0,
input_weight=1.0,
output_weight=4.0,
thinking_weight=4.0,
)
Anthropic Claude (Planned)¶
Status: Not Yet Implemented
Anthropic doesn’t offer a provisioned throughput model like Vertex GSU or Azure PTU. Claude uses tier-based rate limits which don’t map cleanly to slosizer’s capacity unit model. A future version may add support for modeling Claude rate limits, but there is currently no built-in
anthropic_profile()function.
Reference: Anthropic Rate Limits