API Reference¶

This section contains the complete API reference for all modules in the rowvoi package.

Core Types¶

Type aliases for rowvoi.

This module defines type aliases used throughout the package for clarity and consistency. The actual data structures are in core.py.

rowvoi.types.RowIndex¶

Alias for row indices within a pandas DataFrame.

Can be int for positional indexing or any hashable for label-based indexing.

rowvoi.types.ColName¶

Alias for column identifiers within a pandas DataFrame.

Pandas allows various types for column labels (strings, integers, tuples, etc.), so we use Hashable to permit any immutable label type.

Logical Methods¶

Mutual Information¶

Machine Learning¶

Model‑based value‑of‑information routines for rowvoi.

🔮 USE CASE: Sequential Conditional Selection (Informed Prediction) This module solves Use Case 2 where you make sequential column selection decisions based on learned patterns from historical data combined with currently observed values. This is CONDITIONAL prediction, not blind guessing.

KEY INSIGHT: The model uses learned mutual information patterns to predict expected information gain CONDITIONAL on what’s already been observed in the current case.

Examples

Interactive interviews: Given responses so far, what question provides most info?
Sequential experiments: Given current results, which test should we run next?
Adaptive diagnosis: Given initial symptoms, which additional test is most valuable?

🎯 For Use Case 1 (optimizing collection when complete information is available), see rowvoi.setcover and minimal_key_* functions instead.

This module defines a small class, RowVoiModel, that implements a conditional prediction policy for sequential feature acquisition. It learns patterns from historical data about mutual information relationships, then applies these patterns conditionally based on current observations. Using a CandidateState that captures what columns have been observed and their values, the model predicts which additional column will provide the most information gain for distinguishing the remaining candidate rows.

The design is inspired by the active feature acquisition literature but kept deliberately simple for ease of understanding and extension. For deterministic/local policies that do not require learning a model, see rowvoi.mi.

class rowvoi.ml.RowVoiModel(smoothing=1e-06, noise=0.0, normalize_cols=True)[source]¶

Bases: object

Model for computing expected value of information across features.

A RowVoiModel encapsulates global information about the distribution of values for each feature in a dataset, together with optional discretization rules and a simple noise model. It provides methods to fit to a DataFrame, rank features by expected information gain, and simulate sequential acquisition procedures to disambiguate an unknown row among a candidate set.

Parameters:

smoothing (float, optional) – A pseudo‑count added to each category when computing frequencies. This mitigates zero‑probability issues when some candidate values are rare. Default is 1e‑6.
noise (float, optional) – Probability that the observed feature value does not equal the candidate row’s true value. When greater than zero, noise spreads probability mass over other candidate values according to the global frequency distribution. Default is 0.0 (no noise).
normalize_cols (bool, optional) – Whether to compute and use normalized mutual information values (i.e. divide by the feature entropy) when ranking features. Default is True.

__init__(smoothing=1e-06, noise=0.0, normalize_cols=True)[source]¶

Parameters:

smoothing (float)
noise (float)
normalize_cols (bool)

Return type:

None

fit(df, discrete_cols=None, bins=3)[source]¶

Fit the model to a DataFrame by computing column frequencies and entropies.

This method prepares the model to evaluate expected information gain by storing global frequencies and, if necessary, discretizing numeric columns. Only columns in discrete_cols will be treated as discrete; if None, all columns are treated as discrete. The DataFrame is not modified in place; a copy with discretized values is stored.

Parameters:

df (pandas.DataFrame) – The dataset from which to learn frequencies. Should contain no missing values; callers should handle missing values externally (e.g. by imputation or by treating NaN as a category).
discrete_cols (Sequence[ColName], optional) – Columns to treat as discrete. If None, all columns are considered discrete. Numeric columns not in this list are discretized into quantile bins of size bins.
bins (int, optional) – Number of quantile bins for discretization of numeric columns not specified in discrete_cols. Default is 3.

Returns:

Returns self for chaining.

Return type:

RowVoiModel

suggest_next_feature(df, state, candidate_cols=None, objective='mi', feature_costs=None)[source]¶

Rank candidate columns by expected value of information.

Given a DataFrame df, a current candidate state, and an optional set of candidate columns, this method evaluates each column for its expected mutual information I(R; X_col | E) under the model’s smoothing and noise assumptions. It then returns the column with the best score. Two objectives are supported:

'mi' – select the column with the highest expected MI.
'mi_over_cost' – divide expected MI by a user‑supplied cost for that feature. This allows penalizing expensive features.

If normalize_cols was set when constructing the model, the normalized mutual information (MI divided by the feature entropy) is returned in the FeatureSuggestion for diagnostic purposes but is not used to rank features unless objective is set accordingly.

Parameters:

df (pandas.DataFrame) – The data table. If different from the DataFrame passed to fit(), it will be discretized using the same rules.
state (CandidateState) – The current candidate state.
candidate_cols (Sequence[ColName], optional) – A list of columns to consider. If None, all columns not yet observed in state.observed_cols are considered.
objective (str, optional) – Objective used for ranking. One of 'mi' or 'mi_over_cost'. Default is 'mi'.
feature_costs (dict[ColName, float], optional) – Mapping of feature costs. Required if objective is 'mi_over_cost'. Costs must be positive.

Returns:

A FeatureSuggestion containing the best column and associated information gain estimates, or None if no eligible columns remain.

Return type:

FeatureSuggestion or None

run_acquisition(df, true_row, initial_state, candidate_cols=None, stop_when_unique=True, max_steps=None, objective='mi', feature_costs=None)[source]¶

Simulate a sequential feature acquisition session.

Given a true row index and an initial candidate state, this method repeatedly calls suggest_next_feature() to select the next column to query. It then simulates acquiring the feature value from the true row, updates the posterior and candidate list accordingly, and continues until either only one candidate row remains or a maximum number of steps has been reached. The history of suggestions (with associated VOI metrics) is returned.

Parameters:

df (pandas.DataFrame) – The data table (same columns as used for fitting). If different from self._df, it will be discretized consistently.
true_row (RowIndex) – The index of the actual row to identify. Must be contained in initial_state.candidate_rows.
initial_state (CandidateState) – The starting candidate state. This object is not modified; a new state is created for the simulation.
candidate_cols (Sequence[ColName], optional) – Optional subset of columns to consider when selecting features. If None, all columns not yet observed are considered at each step.
stop_when_unique (bool, optional) – If True (default), stop the acquisition as soon as the posterior concentrates all mass on a single row. If False, continue until max_steps is reached.
max_steps (int, optional) – Maximum number of features to query. If None, no explicit limit is imposed.
objective (str, optional) – Objective passed to suggest_next_feature() (either 'mi' or 'mi_over_cost'). Default is 'mi'.
feature_costs (dict[ColName, float], optional) – Feature cost mapping used if objective='mi_over_cost'.

Returns:

A list of suggestions (one per query) containing the column chosen at each step and the associated VOI metrics. The length of the list equals the number of queries made.

Return type:

list[FeatureSuggestion]

Simulation¶

Core Utilities¶

Core types and data structures for rowvoi.

This module defines the fundamental building blocks used throughout the rowvoi package: - CandidateState: Represents current uncertainty over which row is “the one” - FeatureSuggestion: A recommendation for which column to query next

class rowvoi.core.CandidateState(candidate_rows, posterior, observed_cols, observed_values)[source]¶

Bases: object

Represents the current uncertainty over which row is “the one”.

candidate_rows¶

List of row indices under consideration

Type:: Sequence[RowIndex]

posterior¶

Probabilities over candidate_rows, shape (n_candidates,) Use uniform if deterministic / no model

Type:: np.ndarray

observed_cols¶

Set of columns that have been queried

Type:: set[ColName]

observed_values¶

Mapping col -> observed value (may be empty in planning mode)

Type:: Mapping[ColName, Any]

Parameters:

candidate_rows (Sequence[Hashable])
posterior (ndarray)
observed_cols (set[Hashable])
observed_values (Mapping[Hashable, Any])

candidate_rows: Sequence[Hashable]¶

posterior: ndarray¶

observed_cols: set[Hashable]¶

observed_values: Mapping[Hashable, Any]¶

__post_init__()[source]¶: Validate state consistency.

property entropy: float¶: Shannon entropy H(posterior) in bits.

property max_posterior: float¶: max_r p(r | E).

property residual_uncertainty: float¶: 1 - max_posterior.

property is_unique: bool¶: True if there is a single candidate with posterior ~1.

property unique_row: Hashable | None¶: Return the most probable row if unique, else None.

classmethod uniform(candidate_rows, observed_cols=None, observed_values=None)[source]¶

Create a state with uniform posterior over candidates.

Parameters:

candidate_rows (Sequence[RowIndex]) – The candidate row indices
observed_cols (set[ColName], optional) – Already observed columns
observed_values (Mapping[ColName, Any], optional) – Values of observed columns

Returns:

State with uniform posterior distribution

Return type:

CandidateState

filter_candidates(df, col, value)[source]¶

Filter candidates to those matching the observed value.

Parameters:

df (pd.DataFrame) – The data frame containing candidate rows
col (ColName) – The column that was observed
value (Any) – The observed value

Returns:

New state with filtered candidates and renormalized posterior

Return type:

CandidateState

__init__(candidate_rows, posterior, observed_cols, observed_values)¶

Parameters:

candidate_rows (Sequence[Hashable])
posterior (ndarray)
observed_cols (set[Hashable])
observed_values (Mapping[Hashable, Any])

Return type:

None

class rowvoi.core.FeatureSuggestion(col, score, expected_voi=None, marginal_cost=None, debug=None)[source]¶

Bases: object

A recommendation of which column to query next.

col¶

The column name suggested to query next

Type:: ColName

score¶

Raw score used to rank columns (e.g., MI, coverage gain)

Type:: float

expected_voi¶

Expected value of information in bits

Type:: float, optional

marginal_cost¶

Cost of querying this column

Type:: float, optional

debug¶

Additional debugging information

Type:: dict[str, Any], optional

Parameters:

col (Hashable)
score (float)
expected_voi (float | None)
marginal_cost (float | None)
debug (dict[str, Any] | None)

col: Hashable¶

score: float¶

expected_voi: float | None = None¶

marginal_cost: float | None = None¶

debug: dict[str, Any] | None = None¶

property cost_adjusted_score: float¶: Score divided by cost (if cost is available).

__init__(col, score, expected_voi=None, marginal_cost=None, debug=None)¶

Parameters:

col (Hashable)
score (float)
expected_voi (float | None)
marginal_cost (float | None)
debug (dict[str, Any] | None)

Return type:

None