API Reference

Schema

Core data structures for capacity planning.

Core data structures for capacity planning.

This module defines the schema classes used throughout slosizer for representing request traces, capacity profiles, SLO targets, and planning results.

class slosizer.schema.OutputTokenSource(*values)

Source for output token counts in capacity planning.

OBSERVED

Use actual observed output token counts from trace data.

MAX_OUTPUT_TOKENS

Use max_output_tokens limit for worst-case planning.

class slosizer.schema.LatencyMetric(*values)

Latency metric for SLO evaluation.

E2E

End-to-end latency including baseline model latency and queue delay.

QUEUE_DELAY

Queue delay only, excluding baseline model latency.

class slosizer.schema.RequestSchema(time_col='ts', class_col='class_name', input_tokens_col='input_tokens', cached_input_tokens_col='cached_input_tokens', output_tokens_col='output_tokens', thinking_tokens_col='thinking_tokens', max_output_tokens_col='max_output_tokens', latency_col='latency_s')

Column mapping for request trace DataFrames.

Parameters:
  • time_col (str)

  • class_col (str | None)

  • input_tokens_col (str)

  • cached_input_tokens_col (str | None)

  • output_tokens_col (str)

  • thinking_tokens_col (str | None)

  • max_output_tokens_col (str | None)

  • latency_col (str | None)

time_col

Column containing request arrival timestamps.

class_col

Column containing request class labels.

input_tokens_col

Column containing input token counts.

cached_input_tokens_col

Column containing cached input token counts.

output_tokens_col

Column containing output token counts.

thinking_tokens_col

Column containing thinking/reasoning token counts.

max_output_tokens_col

Column containing max output token limits.

latency_col

Column containing observed latency in seconds.

class slosizer.schema.RequestTrace(frame, schema, provider=None, model=None, region=None, metadata=<factory>)

Normalized request trace with canonical columns.

Parameters:
  • frame (DataFrame)

  • schema (RequestSchema)

  • provider (str | None)

  • model (str | None)

  • region (str | None)

  • metadata (Mapping[str, Any])

frame

DataFrame with canonical columns (arrival_s, input_tokens, etc.).

schema

Original schema used to parse the trace.

provider

Cloud provider name (e.g., “vertex”, “azure”).

model

Model identifier.

region

Deployment region.

metadata

Additional trace metadata.

class slosizer.schema.CapacityProfile(provider, model, unit_name, throughput_per_unit, purchase_increment=1, min_units=1, input_weight=1.0, cached_input_weight=0.0, output_weight=4.0, thinking_weight=4.0, long_input_threshold=None, long_input_input_weight=None, long_input_cached_input_weight=None, long_input_output_weight=None, long_input_thinking_weight=None, source='', notes=())

Provider-specific capacity configuration.

Defines how tokens translate to reserved capacity units and the constraints on purchasing those units.

Parameters:
  • provider (str)

  • model (str)

  • unit_name (Literal['GSU', 'PTU', 'capacity_unit'])

  • throughput_per_unit (float | None)

  • purchase_increment (int)

  • min_units (int)

  • input_weight (float)

  • cached_input_weight (float)

  • output_weight (float)

  • thinking_weight (float)

  • long_input_threshold (int | None)

  • long_input_input_weight (float | None)

  • long_input_cached_input_weight (float | None)

  • long_input_output_weight (float | None)

  • long_input_thinking_weight (float | None)

  • source (str)

  • notes (tuple[str, ...])

provider

Cloud provider name.

model

Model identifier.

unit_name

Name of capacity unit (e.g., “GSU”, “PTU”).

throughput_per_unit

Tokens per second per capacity unit.

purchase_increment

Minimum increment for purchasing units.

min_units

Minimum number of units that can be provisioned.

input_weight

Token weight multiplier for input tokens.

cached_input_weight

Token weight multiplier for cached input tokens.

output_weight

Token weight multiplier for output tokens.

thinking_weight

Token weight multiplier for thinking tokens.

long_input_threshold

Input token count above which long-context weights apply.

long_input_input_weight

Input weight for long-context requests.

long_input_cached_input_weight

Cached input weight for long-context requests.

long_input_output_weight

Output weight for long-context requests.

long_input_thinking_weight

Thinking weight for long-context requests.

source

Documentation or calibration source for the profile.

notes

Additional notes about the profile.

class slosizer.schema.LatencySLO(threshold_s, percentile=0.99, metric=LatencyMetric.E2E)

Latency service level objective.

threshold_s

Maximum acceptable latency in seconds.

percentile

Target percentile (e.g., 0.99 for p99).

metric

Latency metric to measure (E2E or QUEUE_DELAY).

Raises:

ValueError – If threshold_s <= 0 or percentile not in (0, 1).

Parameters:
class slosizer.schema.ThroughputTarget(percentile=0.99, max_overload_probability=None, windows_s=(1.0, 5.0, 30.0))

Throughput-based capacity planning target.

percentile

Target percentile for required capacity.

max_overload_probability

Maximum acceptable probability of overload.

windows_s

Time window sizes for bucket analysis.

Raises:

ValueError – If percentile not in (0, 1) or max_overload_probability not in [0, 1].

Parameters:
  • percentile (float | None)

  • max_overload_probability (float | None)

  • windows_s (tuple[float, ...])

label()

Generate a human-readable label for this target.

Return type:

str

Returns:

Descriptive label string.

class slosizer.schema.LatencyTarget(slo)

Latency-based capacity planning target.

Parameters:

slo (LatencySLO)

slo

The latency SLO to meet.

label()

Generate a human-readable label for this target.

Return type:

str

Returns:

Descriptive label string.

class slosizer.schema.BaselineLatencyModel(intercept_s=0.15, input_token_s=3e-05, cached_input_token_s=8e-06, output_token_s=0.0009, thinking_token_s=0.0007)

Linear model for baseline request latency.

Predicts latency as a linear combination of token counts, useful for estimating processing time independent of queueing.

Parameters:
  • intercept_s (float)

  • input_token_s (float)

  • cached_input_token_s (float)

  • output_token_s (float)

  • thinking_token_s (float)

intercept_s

Base latency in seconds.

input_token_s

Seconds per input token.

cached_input_token_s

Seconds per cached input token.

output_token_s

Seconds per output token.

thinking_token_s

Seconds per thinking token.

predict(frame)

Predict baseline latency for each request.

Parameters:

frame (DataFrame) – DataFrame with token count columns.

Return type:

ndarray

Returns:

Array of predicted latencies in seconds.

class slosizer.schema.PlanOptions(output_token_source=OutputTokenSource.OBSERVED, max_units_to_search=200, headroom_factor=0.0, baseline_latency_model=None)

Options for capacity planning.

output_token_source

Use OBSERVED or MAX_OUTPUT_TOKENS for planning.

Maximum capacity units to consider during search.

headroom_factor

Additional capacity buffer as a fraction (e.g., 0.1 for 10%).

baseline_latency_model

Custom latency model; if None, one is fitted.

Raises:

ValueError – If max_units_to_search < 1 or headroom_factor < 0.

Parameters:
class slosizer.schema.SimulationResult(units, unit_name, request_level, latency_summary, slack_summary, assumptions)

Results from a capacity simulation.

Parameters:
  • units (int)

  • unit_name (str)

  • request_level (DataFrame)

  • latency_summary (DataFrame)

  • slack_summary (DataFrame)

  • assumptions (dict[str, Any])

units

Number of capacity units simulated.

unit_name

Name of capacity unit.

request_level

Per-request simulation results.

latency_summary

Aggregate latency statistics.

slack_summary

Spare capacity statistics by time window.

assumptions

Simulation parameters and settings.

class slosizer.schema.PlanResult(objective, target, recommended_units, unit_name, metrics, slack_summary, latency_summary=None, request_level=None, assumptions=<factory>)

Results from capacity planning.

Parameters:
  • objective (str)

  • target (str)

  • recommended_units (int)

  • unit_name (str)

  • metrics (dict[str, Any])

  • slack_summary (DataFrame)

  • latency_summary (DataFrame | None)

  • request_level (DataFrame | None)

  • assumptions (dict[str, Any])

objective

Planning objective (“throughput” or “latency”).

target

Human-readable target description.

recommended_units

Recommended number of capacity units.

unit_name

Name of capacity unit.

metrics

Planning metrics and statistics.

slack_summary

Spare capacity statistics.

latency_summary

Latency statistics (for latency planning).

request_level

Per-request results (for latency planning).

assumptions

Planning parameters and settings.

as_dict()

Convert result to a flat dictionary.

Return type:

dict[str, Any]

Returns:

Dictionary with all metrics and metadata.

Ingestion

Request trace parsing and normalization.

Request trace ingestion and normalization.

This module provides functions to convert raw DataFrames into normalized RequestTrace objects with canonical column names.

slosizer.ingest.from_dataframe(df, *, schema, provider=None, model=None, region=None, validate=True, metadata=None)

Create a RequestTrace from a DataFrame.

Normalizes column names and validates data according to the schema.

Parameters:
  • df (DataFrame) – Source DataFrame with request data.

  • schema (RequestSchema) – Column mapping for the DataFrame.

  • provider (str | None) – Cloud provider name.

  • model (str | None) – Model identifier.

  • region (str | None) – Deployment region.

  • validate (bool) – Whether to validate data constraints.

  • metadata (dict[str, Any] | None) – Additional trace metadata.

Return type:

RequestTrace

Returns:

Normalized RequestTrace.

Raises:

ValueError – If required columns are missing or validation fails.

Simulation

Capacity simulation for queue-based latency modeling.

Capacity simulation for queue-based latency modeling.

This module simulates request processing with finite capacity to estimate latency distributions and capacity utilization.

slosizer.simulation.fit_baseline_latency_model(trace)

Fit a linear latency model from observed latencies.

Uses ordinary least squares to fit latency as a function of token counts. Coefficients are constrained to be non-negative. Only rows with valid (non-NaN) latency values are used for fitting.

Parameters:

trace (RequestTrace) – Request trace with observed latencies.

Return type:

BaselineLatencyModel

Returns:

Fitted baseline latency model, or default model if insufficient data.

slosizer.simulation.bucket_required_units(frame, profile, *, units, windows_s, output_token_source)

Compute required capacity units per time bucket.

Divides the trace into fixed-width time windows and calculates the capacity units needed to serve all requests in each window.

Parameters:
  • frame (DataFrame) – DataFrame with canonical columns.

  • profile (CapacityProfile) – Capacity profile with throughput settings.

  • units (int) – Reserved capacity units to compare against.

  • windows_s (Iterable[float]) – Time window sizes in seconds.

  • output_token_source (str) – Source for output tokens.

Return type:

DataFrame

Returns:

DataFrame with required_units, spare_units, and overflow_units per bucket.

Raises:

ValueError – If profile.throughput_per_unit is not set.

slosizer.simulation.summarize_slack(slack_table)

Summarize spare capacity statistics by time window.

Parameters:

slack_table (DataFrame) – Output from bucket_required_units.

Return type:

DataFrame

Returns:

DataFrame with aggregate statistics per window size.

slosizer.simulation.simulate_capacity(trace, profile, *, units, options=None, windows_s=(1.0, 5.0, 30.0))

Simulate request processing with fixed capacity.

Models a simple FIFO queue where requests arrive and are processed at a rate determined by the reserved capacity.

Parameters:
  • trace (RequestTrace) – Request trace to simulate.

  • profile (CapacityProfile) – Capacity profile with throughput settings.

  • units (int) – Number of reserved capacity units.

  • options (PlanOptions | None) – Planning options including output token source.

  • windows_s (tuple[float, ...]) – Time window sizes for slack analysis.

Return type:

SimulationResult

Returns:

SimulationResult with latency and slack statistics.

Raises:

ValueError – If profile.throughput_per_unit is not set or if trace contains fewer than 2 requests.

Planning

Capacity planning algorithms.

Capacity planning algorithms.

This module provides functions to determine optimal reserved capacity based on throughput or latency targets.

slosizer.planning.plan_capacity(trace, profile, target, *, options=None)

Determine optimal reserved capacity for a target.

Searches over candidate capacity levels to find the minimum that satisfies the given throughput or latency target.

Parameters:
Return type:

PlanResult

Returns:

PlanResult with recommended capacity and metrics.

Raises:
  • ValueError – If profile.throughput_per_unit is not set.

  • TypeError – If target is not a ThroughputTarget or LatencyTarget.

slosizer.planning.compare_scenarios(scenarios, profile, targets, *, options=None)

Compare capacity requirements across scenarios and targets.

Parameters:
Return type:

DataFrame

Returns:

DataFrame with planning results for each scenario/target combination.

Plotting

Visualization functions.

Visualization functions for capacity planning results.

slosizer.plotting.plot_latency_vs_units(trace, profile, *, units, options=None, target=None, path=None)

Plot latency percentiles as a function of reserved capacity.

Parameters:
  • trace (RequestTrace) – Request trace to simulate.

  • profile (CapacityProfile) – Capacity profile.

  • units (Iterable[int]) – Capacity unit values to plot.

  • options (PlanOptions | None) – Planning options.

  • target (LatencyTarget | None) – Optional latency target to show as horizontal line.

  • path (str | Path | None) – Optional path to save the figure.

Return type:

None

slosizer.plotting.plot_required_units_distribution(trace, profile, *, windows_s=(1.0, 5.0, 30.0), options=None, path=None)

Plot histogram of required capacity units per time window.

Parameters:
  • trace (RequestTrace) – Request trace to analyze.

  • profile (CapacityProfile) – Capacity profile.

  • windows_s (tuple[float, ...]) – Time window sizes to plot.

  • options (PlanOptions | None) – Planning options.

  • path (str | Path | None) – Optional path to save the figure.

Return type:

None

slosizer.plotting.plot_capacity_tradeoff(comparison, *, path=None)

Plot recommended capacity across scenarios and targets.

Parameters:
  • comparison (DataFrame) – Output from compare_scenarios.

  • path (str | Path | None) – Optional path to save the figure.

Return type:

None

slosizer.plotting.plot_slack_tradeoff(comparison, *, path=None)

Plot spare capacity fraction across scenarios and targets.

Parameters:
  • comparison (DataFrame) – Output from compare_scenarios.

  • path (str | Path | None) – Optional path to save the figure.

Raises:

ValueError – If comparison is missing the avg_spare_fraction_1s column.

Return type:

None

Synthetic Workloads

Synthetic workload generation for testing.

Synthetic workload generation for testing and demonstration.

slosizer.synthetic.optimize_trace(trace)

Apply prompt optimization to reduce token usage.

Simulates the effect of prompt engineering and caching improvements by reducing input, output, and thinking tokens.

Parameters:

trace (RequestTrace) – Original request trace.

Return type:

RequestTrace

Returns:

Optimized trace with reduced token counts.

slosizer.synthetic.make_synthetic_trace(*, horizon_s=14400, seed=42, scenario='baseline')

Generate a synthetic request trace for testing.

Parameters:
  • horizon_s (int) – Trace duration in seconds.

  • seed (int) – Random seed for reproducibility.

  • scenario (str) – Either “baseline” or “optimized”.

Return type:

RequestTrace

Returns:

Synthetic RequestTrace.

Provider Adapters

Vertex AI

Google Cloud Vertex AI capacity profiles.

This module provides built-in capacity profiles for Vertex AI Generative AI models using the GSU (Generative Service Unit) provisioned throughput model.

See: https://cloud.google.com/vertex-ai/generative-ai/docs/provisioned-throughput/supported-models

slosizer.providers.vertex.available_vertex_profiles()

List available built-in Vertex AI model profiles.

Return type:

list[str]

Returns:

Sorted list of model identifiers.

slosizer.providers.vertex.vertex_profile(model)

Get a built-in Vertex AI capacity profile.

Parameters:

model (str) – Model identifier (e.g., “gemini-2.5-flash”).

Return type:

CapacityProfile

Returns:

CapacityProfile configured for the specified Vertex model.

Raises:

KeyError – If the model is not in the built-in registry.

Azure OpenAI

Azure OpenAI PTU capacity profiles.

This module provides a factory function for creating Azure OpenAI capacity profiles using the PTU (Provisioned Throughput Unit) model.

Azure PTU throughput is workload-sensitive and must be calibrated per deployment. See: https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/provisioned-throughput

slosizer.providers.azure.azure_profile(model, *, throughput_per_unit, purchase_increment=1, min_units=1, input_weight=1.0, cached_input_weight=0.0, output_weight=4.0, thinking_weight=4.0, notes=())

Create an Azure OpenAI PTU capacity profile.

Azure PTU capacity varies by workload, so profiles must be calibrated using the Azure capacity calculator and benchmark data.

Parameters:
  • model (str) – Model identifier (e.g., “gpt-4.1”).

  • throughput_per_unit (float) – Tokens per second per PTU.

  • purchase_increment (int) – Minimum PTU increment for purchasing.

  • min_units (int) – Minimum number of PTUs.

  • input_weight (float) – Token weight for input tokens.

  • cached_input_weight (float) – Token weight for cached input tokens.

  • output_weight (float) – Token weight for output tokens.

  • thinking_weight (float) – Token weight for thinking tokens.

  • notes (tuple[str, ...]) – Additional notes about the profile.

Return type:

CapacityProfile

Returns:

CapacityProfile configured for Azure OpenAI.