# Provider Adapters

The package is generic in the middle and provider-specific at the edges.

## CapacityProfile

A provider adapter boils down to a `CapacityProfile`:

- `throughput_per_unit`
- `purchase_increment`
- `min_units`
- `input_weight`
- `cached_input_weight`
- `output_weight`
- `thinking_weight`
- optional long-context overrides

That is enough to turn requests into adjusted work and then into required reserved units.

## Vertex AI GSU

The package ships built-in Vertex AI profiles based on [Google Cloud's provisioned throughput documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/provisioned-throughput/supported-models).

### Available Models

| Model | Throughput per GSU | Output Weight | Long Context |
|-------|-------------------|---------------|--------------|
| gemini-2.0-flash-001 | 3,360 | 4x | No |
| gemini-2.0-flash-lite-001 | 6,720 | 4x | No |
| gemini-2.5-flash | 2,690 | 9x | Yes (>200k) |
| gemini-2.5-flash-lite | 8,070 | 4x | No |
| gemini-2.5-pro | 650 | 8x | Yes (>200k) |
| gemini-3.1-flash-lite-preview | 4,030 | 6x | No |

### Token Burndown Rates

Vertex AI uses different burndown rates for input vs output tokens:

- **Input tokens**: 1x weight (baseline)
- **Cached input tokens**: 0.1x weight (90% discount)
- **Output tokens**: 4-9x weight depending on model
- **Thinking tokens**: Same as output weight

### Long Context Threshold

For models with long context support, requests exceeding 200,000 input tokens use elevated weights:

- Input: 2x (instead of 1x)
- Output: 12x (instead of 8-9x)

### Usage

```python
import slosizer as slz

profile = slz.vertex_profile("gemini-2.5-flash")
```

These profiles are text-centric. If you use images, audio, video, or other token classes, add columns and extend the profile before trusting the numbers.

## Azure OpenAI PTU

Azure PTU support is calibration-first. PTU behavior is highly workload-sensitive, so we don't ship built-in profiles.

Reference: [Azure OpenAI Provisioned Throughput](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/provisioned-throughput)

### Key Characteristics

- **Workload-sensitive**: Throughput varies significantly based on prompt/completion ratios
- **Token ratio**: For GPT-4.1 and later, 1 output token ≈ 4 input tokens
- **Calibration required**: Use Azure capacity calculator + benchmarks

### Calibration Process

1. Use the [Azure capacity calculator](https://oai.azure.com/portal/calculator) to estimate baseline throughput
2. Deploy with your actual workload and measure via Azure Monitor
3. Refine the profile based on observed throughput

### Usage

```python
import slosizer as slz

profile = slz.azure_profile(
    "gpt-4.1",
    throughput_per_unit=12000.0,
    input_weight=1.0,
    output_weight=4.0,
    thinking_weight=4.0,
)
```

## Anthropic Claude (Planned)

> **Status: Not Yet Implemented**
>
> Anthropic doesn't offer a provisioned throughput model like Vertex GSU or Azure PTU. Claude uses tier-based rate limits which don't map cleanly to slosizer's capacity unit model. A future version may add support for modeling Claude rate limits, but there is currently no built-in `anthropic_profile()` function.

Reference: [Anthropic Rate Limits](https://docs.anthropic.com/en/api/rate-limits)