API Reference¶

Schema¶

Core data structures for capacity planning.

This module defines the schema classes used throughout slosizer for representing request traces, capacity profiles, SLO targets, and planning results.

class slosizer.schema.OutputTokenSource(*values)¶

Source for output token counts in capacity planning.

OBSERVED¶: Use actual observed output token counts from trace data.

MAX_OUTPUT_TOKENS¶: Use max_output_tokens limit for worst-case planning.

class slosizer.schema.LatencyMetric(*values)¶

Latency metric for SLO evaluation.

E2E¶: End-to-end latency including baseline model latency and queue delay.

QUEUE_DELAY¶: Queue delay only, excluding baseline model latency.

class slosizer.schema.RequestSchema(time_col='ts', class_col='class_name', input_tokens_col='input_tokens', cached_input_tokens_col='cached_input_tokens', output_tokens_col='output_tokens', thinking_tokens_col='thinking_tokens', max_output_tokens_col='max_output_tokens', latency_col='latency_s')¶

Column mapping for request trace DataFrames.

Parameters:

time_col (str)
class_col (str | None)
input_tokens_col (str)
cached_input_tokens_col (str | None)
output_tokens_col (str)
thinking_tokens_col (str | None)
max_output_tokens_col (str | None)
latency_col (str | None)

time_col¶: Column containing request arrival timestamps.

class_col¶: Column containing request class labels.

input_tokens_col¶: Column containing input token counts.

cached_input_tokens_col¶: Column containing cached input token counts.

output_tokens_col¶: Column containing output token counts.

thinking_tokens_col¶: Column containing thinking/reasoning token counts.

max_output_tokens_col¶: Column containing max output token limits.

latency_col¶: Column containing observed latency in seconds.

class slosizer.schema.RequestTrace(frame, schema, provider=None, model=None, region=None, metadata=<factory>)¶

Normalized request trace with canonical columns.

Parameters:

frame (DataFrame)
schema (RequestSchema)
provider (str | None)
model (str | None)
region (str | None)
metadata (Mapping[str, Any])

frame¶: DataFrame with canonical columns (arrival_s, input_tokens, etc.).

schema¶: Original schema used to parse the trace.

provider¶: Cloud provider name (e.g., “vertex”, “azure”).

model¶: Model identifier.

region¶: Deployment region.

metadata¶: Additional trace metadata.

class slosizer.schema.CapacityProfile(provider, model, unit_name, throughput_per_unit, purchase_increment=1, min_units=1, input_weight=1.0, cached_input_weight=0.0, output_weight=4.0, thinking_weight=4.0, long_input_threshold=None, long_input_input_weight=None, long_input_cached_input_weight=None, long_input_output_weight=None, long_input_thinking_weight=None, source='', notes=())¶

Provider-specific capacity configuration.

Defines how tokens translate to reserved capacity units and the constraints on purchasing those units.

Parameters:

provider (str)
model (str)
unit_name (Literal['GSU', 'PTU', 'capacity_unit'])
throughput_per_unit (float | None)
purchase_increment (int)
min_units (int)
input_weight (float)
cached_input_weight (float)
output_weight (float)
thinking_weight (float)
long_input_threshold (int | None)
long_input_input_weight (float | None)
long_input_cached_input_weight (float | None)
long_input_output_weight (float | None)
long_input_thinking_weight (float | None)
source (str)
notes (tuple[str, ...])

provider¶: Cloud provider name.

model¶: Model identifier.

unit_name¶: Name of capacity unit (e.g., “GSU”, “PTU”).

throughput_per_unit¶: Tokens per second per capacity unit.

purchase_increment¶: Minimum increment for purchasing units.

min_units¶: Minimum number of units that can be provisioned.

input_weight¶: Token weight multiplier for input tokens.

cached_input_weight¶: Token weight multiplier for cached input tokens.

output_weight¶: Token weight multiplier for output tokens.

thinking_weight¶: Token weight multiplier for thinking tokens.

long_input_threshold¶: Input token count above which long-context weights apply.

long_input_input_weight¶: Input weight for long-context requests.

long_input_cached_input_weight¶: Cached input weight for long-context requests.

long_input_output_weight¶: Output weight for long-context requests.

long_input_thinking_weight¶: Thinking weight for long-context requests.

source¶: Documentation or calibration source for the profile.

notes¶: Additional notes about the profile.

class slosizer.schema.LatencySLO(threshold_s, percentile=0.99, metric=LatencyMetric.E2E)¶

Latency service level objective.

threshold_s¶: Maximum acceptable latency in seconds.

percentile¶: Target percentile (e.g., 0.99 for p99).

metric¶: Latency metric to measure (E2E or QUEUE_DELAY).

Raises:

ValueError – If threshold_s <= 0 or percentile not in (0, 1).

Parameters:

threshold_s (float)
percentile (float)
metric (LatencyMetric)

class slosizer.schema.ThroughputTarget(percentile=0.99, max_overload_probability=None, windows_s=(1.0, 5.0, 30.0))¶

Throughput-based capacity planning target.

percentile¶: Target percentile for required capacity.

max_overload_probability¶: Maximum acceptable probability of overload.

windows_s¶: Time window sizes for bucket analysis.

Raises:

ValueError – If percentile not in (0, 1) or max_overload_probability not in [0, 1].

Parameters:

percentile (float | None)
max_overload_probability (float | None)
windows_s (tuple[float, ...])

label()¶

Generate a human-readable label for this target.

Return type:: str
Returns:: Descriptive label string.

class slosizer.schema.LatencyTarget(slo)¶

Latency-based capacity planning target.

Parameters:: slo (LatencySLO)

slo¶: The latency SLO to meet.

label()¶

Generate a human-readable label for this target.

Return type:: str
Returns:: Descriptive label string.

class slosizer.schema.BaselineLatencyModel(intercept_s=0.15, input_token_s=3e-05, cached_input_token_s=8e-06, output_token_s=0.0009, thinking_token_s=0.0007)¶

Linear model for baseline request latency.

Predicts latency as a linear combination of token counts, useful for estimating processing time independent of queueing.

Parameters:

intercept_s (float)
input_token_s (float)
cached_input_token_s (float)
output_token_s (float)
thinking_token_s (float)

intercept_s¶: Base latency in seconds.

input_token_s¶: Seconds per input token.

cached_input_token_s¶: Seconds per cached input token.

output_token_s¶: Seconds per output token.

thinking_token_s¶: Seconds per thinking token.

predict(frame)¶

Predict baseline latency for each request.

Parameters:: frame (DataFrame) – DataFrame with token count columns.
Return type:: ndarray
Returns:: Array of predicted latencies in seconds.

class slosizer.schema.PlanOptions(output_token_source=OutputTokenSource.OBSERVED, max_units_to_search=200, headroom_factor=0.0, baseline_latency_model=None)¶

Options for capacity planning.

output_token_source¶: Use OBSERVED or MAX_OUTPUT_TOKENS for planning.

max_units_to_search¶: Maximum capacity units to consider during search.

headroom_factor¶: Additional capacity buffer as a fraction (e.g., 0.1 for 10%).

baseline_latency_model¶: Custom latency model; if None, one is fitted.

Raises:

ValueError – If max_units_to_search < 1 or headroom_factor < 0.

Parameters:

output_token_source (OutputTokenSource)
max_units_to_search (int)
headroom_factor (float)
baseline_latency_model (BaselineLatencyModel | None)

class slosizer.schema.SimulationResult(units, unit_name, request_level, latency_summary, slack_summary, assumptions)¶

Results from a capacity simulation.

Parameters:

units (int)
unit_name (str)
request_level (DataFrame)
latency_summary (DataFrame)
slack_summary (DataFrame)
assumptions (dict[str, Any])

units¶: Number of capacity units simulated.

unit_name¶: Name of capacity unit.

request_level¶: Per-request simulation results.

latency_summary¶: Aggregate latency statistics.

slack_summary¶: Spare capacity statistics by time window.

assumptions¶: Simulation parameters and settings.

class slosizer.schema.PlanResult(objective, target, recommended_units, unit_name, metrics, slack_summary, latency_summary=None, request_level=None, assumptions=<factory>)¶

Results from capacity planning.

Parameters:

objective (str)
target (str)
recommended_units (int)
unit_name (str)
metrics (dict[str, Any])
slack_summary (DataFrame)
latency_summary (DataFrame | None)
request_level (DataFrame | None)
assumptions (dict[str, Any])

objective¶: Planning objective (“throughput” or “latency”).

target¶: Human-readable target description.

recommended_units¶: Recommended number of capacity units.

unit_name¶: Name of capacity unit.

metrics¶: Planning metrics and statistics.

slack_summary¶: Spare capacity statistics.

latency_summary¶: Latency statistics (for latency planning).

request_level¶: Per-request results (for latency planning).

assumptions¶: Planning parameters and settings.

as_dict()¶

Convert result to a flat dictionary.

Return type:: dict[str, Any]
Returns:: Dictionary with all metrics and metadata.

class slosizer.schema.PaygoPricing(input_cost_per_million, output_cost_per_million)¶

Per-token pricing for overflow traffic.

input_cost_per_million¶: Cost per million input tokens.

output_cost_per_million¶: Cost per million output tokens.

Raises:

ValueError – If costs are negative.

Parameters:

input_cost_per_million (float)
output_cost_per_million (float)

class slosizer.schema.ProvisionedPricing(cost_per_unit_hour)¶

Hourly cost for provisioned capacity.

cost_per_unit_hour¶: Cost per capacity unit per hour.

Raises:: ValueError – If cost is negative.
Parameters:: cost_per_unit_hour (float)

class slosizer.schema.HybridPricingModel(provisioned, paygo)¶

Combined pricing for hybrid capacity planning.

Parameters:

provisioned (ProvisionedPricing)
paygo (PaygoPricing)

provisioned¶: Hourly cost for provisioned capacity.

paygo¶: Per-token pricing for overflow traffic.

class slosizer.schema.HybridTarget(strategy, provision_percentile=None, latency_slo=None)¶

Target for hybrid capacity planning.

strategy¶: Planning strategy - “cost_optimal” or “percentile_split”.

provision_percentile¶: Percentile to provision for (required if strategy=”percentile_split”).

latency_slo¶: Optional latency SLO constraint.

Raises:

ValueError – If strategy is “percentile_split” but provision_percentile is not set, or if provision_percentile is not in (0, 1).

Parameters:

strategy (Literal['cost_optimal', 'percentile_split'])
provision_percentile (float | None)
latency_slo (LatencySLO | None)

label()¶

Generate a human-readable label for this target.

Return type:: str
Returns:: Descriptive label string.

class slosizer.schema.HybridPlanResult(provisioned_units, unit_name, provisioned_cost_hourly, paygo_cost_hourly, total_cost_hourly, full_provision_units, full_provision_cost_hourly, savings_vs_full_provision, savings_percent, overflow_fraction, overflow_input_tokens_hourly, overflow_output_tokens_hourly, slack_summary, assumptions=<factory>)¶

Results from hybrid capacity planning.

Parameters:

provisioned_units (int)
unit_name (str)
provisioned_cost_hourly (float)
paygo_cost_hourly (float)
total_cost_hourly (float)
full_provision_units (int)
full_provision_cost_hourly (float)
savings_vs_full_provision (float)
savings_percent (float)
overflow_fraction (float)
overflow_input_tokens_hourly (float)
overflow_output_tokens_hourly (float)
slack_summary (DataFrame)
assumptions (dict[str, Any])

provisioned_units¶: Number of provisioned capacity units.

unit_name¶: Name of capacity unit (e.g., “GSU”, “PTU”).

provisioned_cost_hourly¶: Hourly cost of provisioned capacity.

paygo_cost_hourly¶: Hourly cost of overflow to paygo.

total_cost_hourly¶: Total hourly cost (provisioned + paygo).

full_provision_units¶: Units needed if provisioning for 100% of traffic.

full_provision_cost_hourly¶: Hourly cost if fully provisioned.

savings_vs_full_provision¶: Dollar savings per hour vs full provision.

savings_percent¶: Percentage savings vs full provision.

overflow_fraction¶: Fraction of time buckets with overflow.

overflow_input_tokens_hourly¶: Average overflow input tokens per hour.

overflow_output_tokens_hourly¶: Average overflow output tokens per hour.

slack_summary¶: Spare capacity statistics by time window.

assumptions¶: Planning parameters and settings.

as_dict()¶

Convert result to a flat dictionary.

Return type:: dict[str, Any]
Returns:: Dictionary with all metrics and metadata.

Ingestion¶

Request trace parsing and normalization.

Request trace ingestion and normalization.

This module provides functions to convert raw DataFrames into normalized RequestTrace objects with canonical column names.

slosizer.ingest.from_dataframe(df, *, schema, provider=None, model=None, region=None, validate=True, metadata=None)¶

Create a RequestTrace from a DataFrame.

Normalizes column names and validates data according to the schema.

Parameters:

df (DataFrame) – Source DataFrame with request data.
schema (RequestSchema) – Column mapping for the DataFrame.
provider (str | None) – Cloud provider name.
model (str | None) – Model identifier.
region (str | None) – Deployment region.
validate (bool) – Whether to validate data constraints.
metadata (dict[str, Any] | None) – Additional trace metadata.

Return type:

RequestTrace

Returns:

Normalized RequestTrace.

Raises:

ValueError – If required columns are missing or validation fails.

Simulation¶

Capacity simulation for queue-based latency modeling.

This module simulates request processing with finite capacity to estimate latency distributions and capacity utilization.

slosizer.simulation.fit_baseline_latency_model(trace)¶

Fit a linear latency model from observed latencies.

Uses ordinary least squares to fit latency as a function of token counts. Coefficients are constrained to be non-negative. Only rows with valid (non-NaN) latency values are used for fitting.

Parameters:: trace (RequestTrace) – Request trace with observed latencies.
Return type:: BaselineLatencyModel
Returns:: Fitted baseline latency model, or default model if insufficient data.

slosizer.simulation.bucket_required_units(frame, profile, *, units, windows_s, output_token_source)¶

Compute required capacity units per time bucket.

Divides the trace into fixed-width time windows and calculates the capacity units needed to serve all requests in each window.

Parameters:

frame (DataFrame) – DataFrame with canonical columns.
profile (CapacityProfile) – Capacity profile with throughput settings.
units (int) – Reserved capacity units to compare against.
windows_s (Iterable[float]) – Time window sizes in seconds.
output_token_source (str) – Source for output tokens.

Return type:

DataFrame

Returns:

DataFrame with required_units, spare_units, and overflow_units per bucket.

Raises:

ValueError – If profile.throughput_per_unit is not set.

slosizer.simulation.bucket_with_tokens(frame, profile, *, units, window_s, output_token_source)¶

Compute required units and token counts per time bucket.

Similar to bucket_required_units but also tracks raw token counts per bucket for overflow cost calculation. Uses a single window size for efficiency.

Parameters:

frame (DataFrame) – DataFrame with canonical columns.
profile (CapacityProfile) – Capacity profile with throughput settings.
units (int) – Reserved capacity units to compare against.
window_s (float) – Time window size in seconds.
output_token_source (str) – Source for output tokens.

Return type:

DataFrame

Returns:

DataFrame with required_units, overflow metrics, and token counts per bucket.

Raises:

ValueError – If profile.throughput_per_unit is not set.

slosizer.simulation.summarize_slack(slack_table)¶

Summarize spare capacity statistics by time window.

Parameters:: slack_table (DataFrame) – Output from bucket_required_units.
Return type:: DataFrame
Returns:: DataFrame with aggregate statistics per window size.

slosizer.simulation.simulate_capacity(trace, profile, *, units, options=None, windows_s=(1.0, 5.0, 30.0))¶

Simulate request processing with fixed capacity.

Models a simple FIFO queue where requests arrive and are processed at a rate determined by the reserved capacity.

Parameters:

trace (RequestTrace) – Request trace to simulate.
profile (CapacityProfile) – Capacity profile with throughput settings.
units (int) – Number of reserved capacity units.
options (PlanOptions | None) – Planning options including output token source.
windows_s (tuple[float, ...]) – Time window sizes for slack analysis.

Return type:

SimulationResult

Returns:

SimulationResult with latency and slack statistics.

Raises:

ValueError – If profile.throughput_per_unit is not set or if trace contains fewer than 2 requests.

Planning¶

Capacity planning algorithms.

This module provides functions to determine optimal reserved capacity based on throughput or latency targets.

slosizer.planning.plan_capacity(trace, profile, target, *, options=None)¶

Determine optimal reserved capacity for a target.

Searches over candidate capacity levels to find the minimum that satisfies the given throughput or latency target.

Parameters:

trace (RequestTrace) – Request trace representing workload.
profile (CapacityProfile) – Capacity profile for the target provider/model.
target (ThroughputTarget | LatencyTarget) – Throughput or latency target to meet.
options (PlanOptions | None) – Planning options.

Return type:

PlanResult

Returns:

PlanResult with recommended capacity and metrics.

Raises:

ValueError – If profile.throughput_per_unit is not set.
TypeError – If target is not a ThroughputTarget or LatencyTarget.

slosizer.planning.compare_scenarios(scenarios, profile, targets, *, options=None)¶

Compare capacity requirements across scenarios and targets.

Parameters:

scenarios (Mapping[str, RequestTrace]) – Named request traces to compare.
profile (CapacityProfile) – Capacity profile for planning.
targets (Sequence[ThroughputTarget | LatencyTarget]) – Throughput and/or latency targets.
options (PlanOptions | None) – Planning options.

Return type:

DataFrame

Returns:

DataFrame with planning results for each scenario/target combination.

Plotting¶

Visualization functions.

Visualization functions for capacity planning results.

slosizer.plotting.plot_latency_vs_units(trace, profile, *, units, options=None, target=None, path=None)¶

Plot latency percentiles as a function of reserved capacity.

Parameters:

trace (RequestTrace) – Request trace to simulate.
profile (CapacityProfile) – Capacity profile.
units (Iterable[int]) – Capacity unit values to plot.
options (PlanOptions | None) – Planning options.
target (LatencyTarget | None) – Optional latency target to show as horizontal line.
path (str | Path | None) – Optional path to save the figure.

Return type:

None

slosizer.plotting.plot_required_units_distribution(trace, profile, *, windows_s=(1.0, 5.0, 30.0), options=None, path=None)¶

Plot histogram of required capacity units per time window.

Parameters:

trace (RequestTrace) – Request trace to analyze.
profile (CapacityProfile) – Capacity profile.
windows_s (tuple[float, ...]) – Time window sizes to plot.
options (PlanOptions | None) – Planning options.
path (str | Path | None) – Optional path to save the figure.

Return type:

None

slosizer.plotting.plot_capacity_tradeoff(comparison, *, path=None)¶

Plot recommended capacity across scenarios and targets.

Parameters:

comparison (DataFrame) – Output from compare_scenarios.
path (str | Path | None) – Optional path to save the figure.

Return type:

None

slosizer.plotting.plot_slack_tradeoff(comparison, *, path=None)¶

Plot spare capacity fraction across scenarios and targets.

Parameters:

comparison (DataFrame) – Output from compare_scenarios.
path (str | Path | None) – Optional path to save the figure.

Raises:

ValueError – If comparison is missing the avg_spare_fraction_1s column.

Return type:

None

Synthetic Workloads¶

Synthetic workload generation for testing.

Synthetic workload generation for testing and demonstration.

slosizer.synthetic.optimize_trace(trace)¶

Apply prompt optimization to reduce token usage.

Simulates the effect of prompt engineering and caching improvements by reducing input, output, and thinking tokens.

Parameters:: trace (RequestTrace) – Original request trace.
Return type:: RequestTrace
Returns:: Optimized trace with reduced token counts.

slosizer.synthetic.make_synthetic_trace(*, horizon_s=14400, seed=42, scenario='baseline')¶

Generate a synthetic request trace for testing.

Parameters:

horizon_s (int) – Trace duration in seconds.
seed (int) – Random seed for reproducibility.
scenario (str) – Either “baseline” or “optimized”.

Return type:

RequestTrace

Returns:

Synthetic RequestTrace.

Provider Adapters¶

Vertex AI¶

Google Cloud Vertex AI capacity profiles.

This module provides built-in capacity profiles for Vertex AI Generative AI models using the GSU (Generative Service Unit) provisioned throughput model.

See: https://cloud.google.com/vertex-ai/generative-ai/docs/provisioned-throughput/supported-models

slosizer.providers.vertex.available_vertex_profiles()¶

List available built-in Vertex AI model profiles.

Return type:: list[str]
Returns:: Sorted list of model identifiers.

slosizer.providers.vertex.vertex_profile(model)¶

Get a built-in Vertex AI capacity profile.

Parameters:: model (str) – Model identifier (e.g., “gemini-2.5-flash”).
Return type:: CapacityProfile
Returns:: CapacityProfile configured for the specified Vertex model.
Raises:: KeyError – If the model is not in the built-in registry.

Azure OpenAI¶

Azure OpenAI PTU capacity profiles.

This module provides a factory function for creating Azure OpenAI capacity profiles using the PTU (Provisioned Throughput Unit) model.

Azure PTU throughput is workload-sensitive and must be calibrated per deployment. See: https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/provisioned-throughput

slosizer.providers.azure.azure_profile(model, *, throughput_per_unit, purchase_increment=1, min_units=1, input_weight=1.0, cached_input_weight=0.0, output_weight=4.0, thinking_weight=4.0, notes=())¶

Create an Azure OpenAI PTU capacity profile.

Azure PTU capacity varies by workload, so profiles must be calibrated using the Azure capacity calculator and benchmark data.

Parameters:

model (str) – Model identifier (e.g., “gpt-4.1”).
throughput_per_unit (float) – Tokens per second per PTU.
purchase_increment (int) – Minimum PTU increment for purchasing.
min_units (int) – Minimum number of PTUs.
input_weight (float) – Token weight for input tokens.
cached_input_weight (float) – Token weight for cached input tokens.
output_weight (float) – Token weight for output tokens.
thinking_weight (float) – Token weight for thinking tokens.
notes (tuple[str, ...]) – Additional notes about the profile.

Return type:

CapacityProfile

Returns:

CapacityProfile configured for Azure OpenAI.