tabula_rasa

Tabula Rasa: Production Table Knowledge LLM.

Teaching LLMs to accurately answer questions about tabular data through statistical sketching and execution grounding.

class tabula_rasa.AdvancedStatSketch(max_categories=50, confidence_level=0.95)[source]

Production-grade statistical sketch with: - Automatic distribution detection - Robust copula estimation - Conditional distribution inference - Multi-table relationship tracking

__init__(max_categories=50, confidence_level=0.95)[source]

Initialize the statistical sketch extractor.

Parameters:
  • max_categories (int) – Maximum number of categorical values to track

  • confidence_level (float) – Confidence level for statistical estimates

extract(df, table_name='table')[source]

Extract comprehensive statistical sketch from a DataFrame.

Parameters:
  • df (DataFrame) – Input DataFrame

  • table_name (str) – Name identifier for the table

Return type:

dict

Returns:

Dictionary containing statistical sketch with columns, correlations, copula parameters, and conditional distributions

class tabula_rasa.AdvancedQueryExecutor(df)[source]

Execute queries on actual data.

Supports: aggregations, filters, conditionals, group-by

__init__(df)[source]

Initialize query executor.

Parameters:

df (DataFrame) – DataFrame to execute queries against

execute(query)[source]

Execute structured query.

Parameters:

query (Query) – Query object specifying the operation

Return type:

Any

Returns:

Query result (float for aggregates, int for counts, dict for group-by)

Raises:

ValueError – If query type is unknown

class tabula_rasa.Query(query_type, target_column=None, aggregation=None, condition=None, percentile=None, group_by=None)[source]

Structured query representation.

__init__(query_type, target_column=None, aggregation=None, condition=None, percentile=None, group_by=None)
aggregation: str | None = None
condition: str | None = None
group_by: str | None = None
percentile: float | None = None
target_column: str | None = None
query_type: str
class tabula_rasa.StatisticalEncoder(hidden_dim=256, output_dim=768)[source]

Encode statistical sketch into neural representation.

Handles variable-length column lists via attention pooling.

__init__(hidden_dim=256, output_dim=768)[source]

Initialize the statistical encoder.

Parameters:
  • hidden_dim (int) – Hidden dimension for column encoders

  • output_dim (int) – Output dimension of the encoded sketch

forward(sketch)[source]

Encode sketch to fixed-size vector.

Parameters:

sketch (dict) – Statistical sketch dictionary

Return type:

Tensor

Returns:

Tensor of shape (output_dim,) representing the encoded table

class tabula_rasa.ProductionTableQA(model_name='t5-small', stat_dim=768)[source]

Production Table QA model with T5 backbone.

Combines: - Pretrained language understanding (T5) - Statistical table knowledge (StatEncoder) - Execution grounding (trained to match executor)

__init__(model_name='t5-small', stat_dim=768)[source]

Initialize the Table QA model.

Parameters:
  • model_name (str) – Pretrained T5 model name

  • stat_dim (int) – Dimension for statistical encoder output

forward(question, sketch, return_features=False)[source]

Forward pass.

Parameters:
  • question (str) – Natural language question

  • sketch (dict) – Statistical sketch dictionary

  • return_features (bool) – Whether to return intermediate features

Returns:

  • answer: Predicted numerical answer

  • confidence: Confidence score [0, 1]

  • query_type_logits: Query type classification logits

  • features (optional): Fused representation

Return type:

Dictionary with keys

class tabula_rasa.TableQADataset(df, sketch, n_samples=1000)[source]

Dataset for table QA with synthetic query generation.

__getitem__(idx)[source]

Get a single sample.

Return type:

dict

__init__(df, sketch, n_samples=1000)[source]

Initialize dataset with synthetic query generation.

Parameters:
  • df (DataFrame) – Source DataFrame

  • sketch (dict) – Statistical sketch of the DataFrame

  • n_samples (int) – Number of training samples to generate

__len__()[source]

Return number of samples.

Return type:

int

class tabula_rasa.ProductionTrainer(model, df, sketch, lr=0.0001, batch_size=16, device='cpu')[source]

Production training with best practices.

__init__(model, df, sketch, lr=0.0001, batch_size=16, device='cpu')[source]

Initialize the trainer.

Parameters:
  • model (ProductionTableQA) – ProductionTableQA model to train

  • df (DataFrame) – Training DataFrame

  • sketch (dict) – Statistical sketch of the DataFrame

  • lr (float) – Learning rate

  • batch_size (int) – Batch size for training

  • device (str) – Device to train on (‘cpu’ or ‘cuda’)

train(n_epochs=10, n_train_samples=1000, n_val_samples=200)[source]

Training loop with validation.

Parameters:
  • n_epochs (int) – Number of training epochs

  • n_train_samples (int) – Number of training samples to generate

  • n_val_samples (int) – Number of validation samples to generate

Return type:

tuple[float, dict]

Returns:

Tuple of (best_val_loss, history_dict)