Provider Adapters

The package is generic in the middle and provider-specific at the edges.

CapacityProfile

A provider adapter boils down to a CapacityProfile:

  • throughput_per_unit

  • purchase_increment

  • min_units

  • input_weight

  • cached_input_weight

  • output_weight

  • thinking_weight

  • optional long-context overrides

That is enough to turn requests into adjusted work and then into required reserved units.

Vertex AI GSU

The package ships built-in Vertex AI profiles based on Google Cloud’s provisioned throughput documentation.

Available Models

Model

Throughput per GSU

Output Weight

Long Context

gemini-2.0-flash-001

3,360

4x

No

gemini-2.0-flash-lite-001

6,720

4x

No

gemini-2.5-flash

2,690

9x

Yes (>200k)

gemini-2.5-flash-lite

8,070

4x

No

gemini-2.5-pro

650

8x

Yes (>200k)

gemini-3.1-flash-lite-preview

4,030

6x

No

Token Burndown Rates

Vertex AI uses different burndown rates for input vs output tokens:

  • Input tokens: 1x weight (baseline)

  • Cached input tokens: 0.1x weight (90% discount)

  • Output tokens: 4-9x weight depending on model

  • Thinking tokens: Same as output weight

Long Context Threshold

For models with long context support, requests exceeding 200,000 input tokens use elevated weights:

  • Input: 2x (instead of 1x)

  • Output: 12x (instead of 8-9x)

Usage

import slosizer as slz

profile = slz.vertex_profile("gemini-2.5-flash")

These profiles are text-centric. If you use images, audio, video, or other token classes, add columns and extend the profile before trusting the numbers.

Azure OpenAI PTU

Azure PTU support is calibration-first. PTU behavior is highly workload-sensitive, so we don’t ship built-in profiles.

Reference: Azure OpenAI Provisioned Throughput

Key Characteristics

  • Workload-sensitive: Throughput varies significantly based on prompt/completion ratios

  • Token ratio: For GPT-4.1 and later, 1 output token ≈ 4 input tokens

  • Calibration required: Use Azure capacity calculator + benchmarks

Calibration Process

  1. Use the Azure capacity calculator to estimate baseline throughput

  2. Deploy with your actual workload and measure via Azure Monitor

  3. Refine the profile based on observed throughput

Usage

import slosizer as slz

profile = slz.azure_profile(
    "gpt-4.1",
    throughput_per_unit=12000.0,
    input_weight=1.0,
    output_weight=4.0,
    thinking_weight=4.0,
)

Anthropic Claude (Planned)

Status: Not Yet Implemented

Anthropic doesn’t offer a provisioned throughput model like Vertex GSU or Azure PTU. Claude uses tier-based rate limits which don’t map cleanly to slosizer’s capacity unit model. A future version may add support for modeling Claude rate limits, but there is currently no built-in anthropic_profile() function.

Reference: Anthropic Rate Limits